PaLM Pathways Language Model explained | 540 Billion parameters can explain jokes!?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello! You’ve maybe seen how we recently struggled to decide which huge and impressive generative model we would like to cover next on the channel. Should it be DALL-E 2 from OpenAI that can generate or edit images from just text? Or rather the slightly boring news about a new largest large language model, called Pathways Language Model or short PaLM from Google AI? We suppose you already guessed from the title of this video, that we decided for PaLM. We did not choose DALL-E 2 for the moment since we released a video about diffusion models and GLIDE way too recently to deal again with that just now. Maybe later, so subscribe! In this video, we will summarize the PaLM paper for you. It is “only” 83 pages long, and we will explain what is so interesting about this boring language model, PaLM. Spoiler: It can explain jokes. Oh, Ms. Coffee Bean, it is not only about the jokes, is it? But first, let's thank Weights & Biases for supporting us for this video! How many times did you crack a hard problem and wanted to show your solution to someone? Well, W&B Reports have got you covered. If all your insights are in a particular chart, you can share it in a Report with a couple of quick comments. But if you have more to show, Reports can also do a long-form, more polished piece of analysis, just check out this Report about DALL-E mini. It’s like a blog post on steroids! Reports can help you keep apart the things you want to share and rather not share: You can use Reports as a dashboard for a reporting a smaller subset of metrics than the main W&B workspace. Create private sharable links or public, view-only links and share with anyone! What are you waiting for? Go to wand.me/ai-coffee-break to see more about how Reports can help you quickly share your work with colleagues around the world! Now, back to boring large language models. When it comes to large language models, we have been bored lately. We have been bored by GLaM from Google AI which has a total of 1.2 trillion parameters, but don’t be that impressed! The model works a lot with sparsity, so it activates only 97 billion parameters at a time, which is 8% of the capacity of the whole model. And on the sparse model landscape with trillions of parameters, we could also count Wu Dao 2.0 of 1.75 trillion parameters or Google’s Switch Transformers of 1.6 trillion parameters. But let’s take a look at dense models, shall we? We have been bothered by LaMDA from Google AI, which is a 137 billion parameter model specialized on dialog. Then with Gopher from DeepMind of 280 billion parameters. And let’s not forget the largest monolithic model so far called Megatron-Turing NLG from Microsoft and NVIDIA with 530 billion parameters. Ok, and of course Google needed to claim the biggest model again, so they made PaLM, just a bit bigger, just enough to overshadow Megatron by 10 billion parameters. So yeah, they made a 540 billion parameter model. Keep in mind that this model is dense, all parameters are used during inference, there is no sparsity here. In terms of model architecture, PaLM does not innovate, it is a GPT like dense transformer decoder. The beginning of the text goes in, the transformer predicts the next word, while attending to the previous words. The authors do use some tricks for training this beast, for example they are computing attention and the feedworward layer in parallel, which is what people from EleutherAI have been doing with GPT-J-6B. For example, PaLM uses SwiGLU activations, Rotary Positional Embeddings, and a special “lossless” vocabulary to account for many languages in the training data, for code, and for numbers. These things have already been proven to help train large language models by previous work. PaLM’s innovation is completely geared towards training this beast efficiently. Now, what does it mean efficiently – for Google? It certainly does not mean the same as it means for you when you are happy you parallelize your GPU training on your 4 GPUs your cluster node has. ;) For Google AI, this means – and I hope you are seated – that they used 6144 TPU chips for training. Wow. Just for comparison, the most impressive numbers so far were established for Megatron-Turing NLG on 2240 A100 GPUs and for Gopher which went up to 4096 TPU v3 chips. So how does it work, how can Ms. Coffee Bean train on 6144 TPU chips too? Well, she can’t, but Google could because they used two TPU v4 pods with 3072 chips each with fast intercommunication. Each pod should have a full copy of the model parameters partitioned over the 3072 chips. Now, to let the two pods communicate efficiently, the authors used the Pathways system which was introduced by Google in a previous paper. In a nutshell, a batch of training data is divided into half and each half is sent to each pod for forward and backward propagation. Then the gradients on half the batch are transferred to the other pod and vice-versa. But this also means that the transfer must wait until the other pod finishes and when the waiting is complete, a burst in data transfer of 81 Terrabytes per second happens at a time. But you can imagine that all this waiting means that in this time, the cores cannot really compute interesting stuff. Really impressive is that through a lot of engineering tricks Ms. Coffee Bean has no idea about, they achieved an analytically computed hardware FLOPs utilization of over 50%, which is impressive and unprecedented. In this table comparing PaLM to previous large language models, you can see a FLOPs utilization of 46.2%, so smaller than 57.8%. This is because in the table, you can find the FLOPs utilization without rematerialisation. Rematerialisation means that one saves memory by not storing some intermediate activations, but re-computing them when needed. This means that with rematerialization, the chips are busier when training PaLM than the other models which are not using the trick, so for comparability, you can see the number in the table without rematerialization. We see how high this is nonetheless, a lot higher than for the comparably inefficient GPT-3 model for example. Okay, we get it. This thing is big, it’s training fast, so what is it eating, exactly? Well, it is feeding of lots and lots of training data. On 780 billion tokens, to be exact. 50% of the data consists of social media conversations in multiple languages, almost a quarter of the data consists of filtered webpages, then books, Wikipedia, news and Code. So already from the training data, we assume that PaLM will have multilingual abilities and will be able to do code completion: it used 196GB of source code in Java, HTML, Javascript, Python, PHP, C#, XML, C++, and C. And surprise, surprise, the model can do code completion, fix compilation errors, translate code from one programming language into the other, even though its the pre-training set contained 50 times less Python code than Codex 12B. The hypothesis here is that PaLM can do code completion even though only 5% of its training data consisted of code, because it benefits from its size and its training on natural language. This makes it more data-efficient when it comes to code data. On natural language PaLM improves state-of-the-art on common benchmarks, like natural language inference, where one has to say whether a sentence entails or contradicts another, reading comprehension, question answering, the complicated Winograd-style schemas, and so on. But we did not get to the jaw-dropping part yet: Google had previously organised the collection of BIG Bench, which stands for “Beyond the Imitation Game Benchmark” which contains quite the tricky tasks, like delivering persuasive arguments for false statements, or finding the Russian proverb closest in meaning to a given English proverb. BIG bench is quite the crazy collection of tasks, where even the average human has a 50% success rate. Well, by scaling PaLM from 8 billion to 62 billion, to 540 billion parameters, we see how PaLM in 5-shot learning, surpasses the average human! One word here: Maybe you are wondering what few-shot learning means: Well, I remember a time when few-shot learning meant that one trains a model further on 10, maybe a 100 samples and updates the parameters of the model while doing so, with the feedback from the few examples. Of course, catastrophic forgetting is a huge problem here and the updates on the little data can undo what a lot of training have previously done. But there were parameter updates through gradients. But since GPT-3, what we have just described is rather called fine-tuning and the default meaning of “few-shot learning” has shifted to something else: The model receives a couple of solved examples as a prompt which is pre-appended to the input. So we have problem, solution, problem, solution, then problem. And when the GPT-like model completes the sequence with the correct examples as input. Since GPT-3, we have been used to expect large language models to be sensitive to this kind of prompting and give right answers even without having any updates done on them. And we are wondering why this is it called few-shot learning when there are no updates and there is no feedback loop in case the model answered incorrectly? Like, where is the learning? Well, we do not want to start a discussion about what “learning” means, now. We just wanted to make the point that few-shot learning as we see it from GPT-3 onwards, is just glorified pattern completion. Now, with PaLM in this few-shot learning setting, there are six tasks where we can observe a discontinuous improvement with scale, so a certain size where the model suddenly “gets it” with enough parameters. With enough scale, the capability of solving the logical sequences, just emerges. Is this understanding? I have no idea, but with English proverbs, Ms. Coffee Bean would assume that enough parameters are enough to memorize the proverbs? No idea, honestly. I wonder if there is anybody in this world understanding what kind of “understanding” happens in these models. We observe these discontinuous jumps, but do we really know why they happen? Honestly, no. Speaking of emergence and reasoning, just look, another emergence from model scale and data scale happens: If one gives a reasoning example to PaLM, then PaLM reasons too! The authors call this “chain of thought prompting”. PaLM gets as input a question, its answer, and a reason. Then for the next question, the model will follow the pattern and give a reason to its answer too! Incredible. Following the same idea of prompting where the output is a completion of the given examples, PaLM can also explain jokes. But keep in mind, it can do so because its input is an example of an explained joke and a new joke. This makes the joke explanation example in the blog which was circulated on social media, a little bit misleading because it omits to show the prompt. You decide how much cherry-picking has gone into these examples, because we know from playing with GPT-3 like models that they sometimes succeed incredibly and sometimes they fail gracelessly. So, be excited, these emergent capabilities are indeed impressive, coming from a “slightly conscious” I mean: “boring language model”. But also, keep in mind that there are experiments in the paper evaluating fairness and biases that highlight several problems of the model reinforcing biases and exhibiting toxicity. There is a lot we do not understand about these models, like why do the bad things such as toxicity happen? We do not even know why the good things happen! Ok, scale enables the emergence of some key capabilities in the model, like we have seen with PaLM and logical reasoning. But in the open questions of the paper, we can also see a discussion about a possible alternative to sheer model scale, which is the effect of having lots of high quality training data. So high quality data could be an alternative to large parameter numbers. Or this could also mean that PaLM with its 540 billion parameters would learn even more from more high-quality data. So what do you think about this paper? On the one hand the model shows impressive capabilities, but on the other hand, it’s boring, there is no architectural innovation in it. But you can imagine that while we did not see the model innovation yet, now that Google knows how to train these things efficiently on 6144 TPU chips at once, we will see more research and innovation with large models on text, but also on images, video, audio, you name it. We’re living in exciting times and I do not even start to say how a little PhD student feels about this. How do you feel about realizing over and over again that Sutton’s bitter lesson is really a thing and scaling things up is the way to go? Or at least one of the ways. See you next time! Okay, bye!

Info

Channel: AI Coffee Break with Letitia

Views: 20,992

Rating: undefined out of 5

Keywords: PaLM, Pathways, Pathways Language Model, Google AI, foundation model, BIG bench, bigger GPT-3, TPU v4, chain of thought reasoning, few-shot learning explained, what is few-shot learning, dense sparse models, neural network, AI, artificial intelligence, machine learning, visualized, deep learning, easy, explained, basics, comprehensible, research, computer science, women in ai, algorithm, example, machine learning research, aicoffeebean, animated, animation, illustrated, letitia parcalabescu

Id: yi-A0kWXEO4

Channel Id: undefined

Length: 16min 32sec (992 seconds)

Published: Tue Apr 26 2022