PaLM Pathways Language Model explained | 540 Billion parameters can explain jokes!?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello! You’ve maybe seen how we recently struggled   to decide which huge and impressive generative  model we would like to cover next on the channel. Should it be DALL-E 2 from OpenAI that can  generate or edit images from just text? Or rather the slightly boring news about a new   largest large language model, called Pathways  Language Model or short PaLM from Google AI? We suppose you already guessed from the title  of this video, that we decided for PaLM. We did not choose DALL-E 2 for the moment since  we released a video about diffusion models and   GLIDE way too recently to deal again with  that just now. Maybe later, so subscribe! In this video, we will summarize the PaLM  paper for you. It is “only” 83 pages long,   and we will explain what is so interesting  about this boring language model, PaLM. Spoiler: It can explain jokes. Oh, Ms. Coffee  Bean, it is not only about the jokes, is it? But first, let's thank Weights & Biases  for supporting us for this video!   How many times did you crack a hard problem  and wanted to show your solution to someone? Well, W&B Reports have got you covered. If  all your insights are in a particular chart,   you can share it in a Report  with a couple of quick comments. But if you have more to show, Reports can also  do a long-form, more polished piece of analysis,   just check out this Report about DALL-E  mini. It’s like a blog post on steroids! Reports can help you keep apart the things you  want to share and rather not share: You can use   Reports as a dashboard for a reporting a smaller  subset of metrics than the main W&B workspace. Create private sharable links or public,   view-only links and share with anyone! What are  you waiting for? Go to wand.me/ai-coffee-break   to see more about how Reports can help you quickly  share your work with colleagues around the world! Now, back to boring large language models.   When it comes to large language  models, we have been bored lately. We have been bored by GLaM from  Google AI which has a total   of 1.2 trillion parameters,  but don’t be that impressed! The model works a lot with  sparsity, so it activates only   97 billion parameters at a time, which  is 8% of the capacity of the whole model. And on the sparse model landscape with trillions  of parameters, we could also count Wu Dao 2.0   of 1.75 trillion parameters or Google’s Switch  Transformers of 1.6 trillion parameters. But let’s take a look at dense models, shall we?  We have been bothered by LaMDA from Google AI,   which is a 137 billion parameter  model specialized on dialog. Then with Gopher from DeepMind  of 280 billion parameters.  And let’s not forget the  largest monolithic model so far   called Megatron-Turing NLG from Microsoft  and NVIDIA with 530 billion parameters. Ok, and of course Google needed  to claim the biggest model again,   so they made PaLM, just a bit bigger, just enough  to overshadow Megatron by 10 billion parameters. So yeah, they made a 540 billion  parameter model. Keep in mind that   this model is dense, all parameters are used  during inference, there is no sparsity here. In terms of model architecture, PaLM does not   innovate, it is a GPT like  dense transformer decoder. The beginning of the text goes in,   the transformer predicts the next word,  while attending to the previous words. The authors do use some tricks  for training this beast,   for example they are computing attention  and the feedworward layer in parallel,   which is what people from EleutherAI  have been doing with GPT-J-6B. For example, PaLM uses SwiGLU activations, Rotary  Positional Embeddings, and a special “lossless”   vocabulary to account for many languages in  the training data, for code, and for numbers. These things have already been proven to help  train large language models by previous work.   PaLM’s innovation is completely geared  towards training this beast efficiently.  Now, what does it mean efficiently – for Google? It certainly does not mean  the same as it means for you   when you are happy you parallelize your GPU  training on your 4 GPUs your cluster node has. ;) For Google AI, this means – and I hope  you are seated – that they used 6144   TPU chips for training. Wow. Just for comparison, the most  impressive numbers so far were established   for Megatron-Turing NLG on 2240 A100 GPUs and  for Gopher which went up to 4096 TPU v3 chips. So how does it work, how can Ms. Coffee Bean  train on 6144 TPU chips too? Well, she can’t,   but Google could because they used two TPU v4 pods  with 3072 chips each with fast intercommunication. Each pod should have a full copy of the model  parameters partitioned over the 3072 chips.   Now, to let the two pods communicate efficiently,  the authors used the Pathways system which was   introduced by Google in a previous paper.  In a nutshell, a batch of training data   is divided into half and each half is sent to  each pod for forward and backward propagation. Then the gradients on half the batch are  transferred to the other pod and vice-versa. But this also means that the transfer must  wait until the other pod finishes and when   the waiting is complete, a burst in data transfer  of 81 Terrabytes per second happens at a time. But you can imagine that all this waiting means  that in this time, the cores cannot really   compute interesting stuff. Really impressive  is that through a lot of engineering tricks Ms.   Coffee Bean has no idea about, they achieved an  analytically computed hardware FLOPs utilization   of over 50%, which is impressive and  unprecedented. In this table comparing PaLM   to previous large language models, you can see a  FLOPs utilization of 46.2%, so smaller than 57.8%. This is because in the table, you can  find the FLOPs utilization without   rematerialisation. Rematerialisation means  that one saves memory by not storing some   intermediate activations, but  re-computing them when needed. This means that with rematerialization,  the chips are busier when training PaLM   than the other models which are not  using the trick, so for comparability,   you can see the number in the  table without rematerialization. We see how high this is nonetheless,   a lot higher than for the comparably  inefficient GPT-3 model for example. Okay, we get it. This thing is big, it’s  training fast, so what is it eating, exactly? Well, it is feeding of lots and lots of training  data. On 780 billion tokens, to be exact. 50% of the data consists of social media  conversations in multiple languages,   almost a quarter of the data consists of filtered  webpages, then books, Wikipedia, news and Code. So already from the training data, we assume  that PaLM will have multilingual abilities   and will be able to do code completion: it used   196GB of source code in Java, HTML,  Javascript, Python, PHP, C#, XML, C++, and C. And surprise, surprise, the  model can do code completion,   fix compilation errors, translate code from  one programming language into the other,   even though its the pre-training set contained  50 times less Python code than Codex 12B. The hypothesis here is that PaLM can do  code completion even though only 5% of   its training data consisted of code, because  it benefits from its size and its training   on natural language. This makes it more  data-efficient when it comes to code data. On natural language PaLM improves  state-of-the-art on common benchmarks, like   natural language inference, where one has to say  whether a sentence entails or contradicts another,   reading comprehension, question answering, the  complicated Winograd-style schemas, and so on. But we did not get to the jaw-dropping  part yet: Google had previously organised   the collection of BIG Bench, which stands  for “Beyond the Imitation Game Benchmark” which contains quite the tricky tasks,   like delivering persuasive  arguments for false statements, or finding the Russian proverb closest  in meaning to a given English proverb. BIG bench is quite the crazy collection  of tasks, where even the average human   has a 50% success rate. Well, by scaling PaLM  from 8 billion to 62 billion, to 540 billion   parameters, we see how PaLM in 5-shot  learning, surpasses the average human! One word here: Maybe you are wondering  what few-shot learning means: Well, I remember a time when few-shot learning  meant that one trains a model further on 10,   maybe a 100 samples and updates the  parameters of the model while doing so,   with the feedback from the few examples. Of course, catastrophic forgetting is  a huge problem here and the updates on   the little data can undo what a lot  of training have previously done. But there were parameter  updates through gradients.   But since GPT-3, what we have just  described is rather called fine-tuning   and the default meaning of “few-shot  learning” has shifted to something else: The model receives a couple of solved examples  as a prompt which is pre-appended to the input. So we have problem, solution, problem,  solution, then problem. And when the   GPT-like model completes the sequence  with the correct examples as input. Since GPT-3, we have been used to expect  large language models to be sensitive to   this kind of prompting and give right answers  even without having any updates done on them. And we are wondering why this is it called  few-shot learning when there are no updates   and there is no feedback loop in case the model  answered incorrectly? Like, where is the learning? Well, we do not want to start a  discussion about what “learning” means,   now. We just wanted to make the point that  few-shot learning as we see it from GPT-3 onwards,   is just glorified pattern completion. Now, with PaLM in this few-shot  learning setting, there are   six tasks where we can observe  a discontinuous improvement with   scale, so a certain size where the model  suddenly “gets it” with enough parameters. With enough scale, the capability  of solving the logical sequences,   just emerges. Is this understanding? I have no  idea, but with English proverbs, Ms. Coffee Bean   would assume that enough parameters are enough  to memorize the proverbs? No idea, honestly. I wonder if there is anybody in  this world understanding what   kind of “understanding” happens in these models.   We observe these discontinuous jumps, but do  we really know why they happen? Honestly, no. Speaking of emergence and reasoning, just  look, another emergence from model scale and   data scale happens: If one gives a reasoning  example to PaLM, then PaLM reasons too! The authors call this “chain  of thought prompting”.   PaLM gets as input a question, its answer,  and a reason. Then for the next question,   the model will follow the pattern and give  a reason to its answer too! Incredible. Following the same idea of prompting where the  output is a completion of the given examples,   PaLM can also explain jokes.  But keep in mind, it can do so   because its input is an example of  an explained joke and a new joke. This makes the joke explanation example in  the blog which was circulated on social media,   a little bit misleading because it omits to show  the prompt. You decide how much cherry-picking   has gone into these examples, because we know  from playing with GPT-3 like models that they   sometimes succeed incredibly and  sometimes they fail gracelessly. So, be excited, these emergent  capabilities are indeed impressive,   coming from a “slightly conscious” I mean: “boring language model”. But also, keep  in mind that there are experiments in the paper   evaluating fairness and biases that  highlight several problems of the   model reinforcing biases and exhibiting toxicity. There is a lot we do not understand about  these models, like why do the bad things   such as toxicity happen? We do not even know why  the good things happen! Ok, scale enables the   emergence of some key capabilities in the model,  like we have seen with PaLM and logical reasoning. But in the open questions of the paper, we can  also see a discussion about a possible alternative   to sheer model scale, which is the effect of  having lots of high quality training data. So high quality data could be an alternative to  large parameter numbers. Or this could also mean   that PaLM with its 540 billion parameters would  learn even more from more high-quality data. So what do you think about this paper? On the  one hand the model shows impressive capabilities,   but on the other hand, it’s boring, there  is no architectural innovation in it. But you can imagine that while we did not see the  model innovation yet, now that Google knows how to   train these things efficiently on 6144 TPU chips  at once, we will see more research and innovation   with large models on text, but also  on images, video, audio, you name it. We’re living in exciting times and I do not even  start to say how a little PhD student feels about   this. How do you feel about realizing over  and over again that Sutton’s bitter lesson   is really a thing and scaling things up is  the way to go? Or at least one of the ways. See you next time! Okay, bye!
Info
Channel: AI Coffee Break with Letitia
Views: 20,992
Rating: undefined out of 5
Keywords: PaLM, Pathways, Pathways Language Model, Google AI, foundation model, BIG bench, bigger GPT-3, TPU v4, chain of thought reasoning, few-shot learning explained, what is few-shot learning, dense sparse models, neural network, AI, artificial intelligence, machine learning, visualized, deep learning, easy, explained, basics, comprehensible, research, computer science, women in ai, algorithm, example, machine learning research, aicoffeebean, animated, animation, illustrated, letitia parcalabescu
Id: yi-A0kWXEO4
Channel Id: undefined
Length: 16min 32sec (992 seconds)
Published: Tue Apr 26 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.