Hello!
You’ve maybe seen how we recently struggled to decide which huge and impressive generative
model we would like to cover next on the channel. Should it be DALL-E 2 from OpenAI that can
generate or edit images from just text? Or rather the slightly boring news about a new largest large language model, called Pathways
Language Model or short PaLM from Google AI? We suppose you already guessed from the title
of this video, that we decided for PaLM. We did not choose DALL-E 2 for the moment since
we released a video about diffusion models and GLIDE way too recently to deal again with
that just now. Maybe later, so subscribe! In this video, we will summarize the PaLM
paper for you. It is “only” 83 pages long, and we will explain what is so interesting
about this boring language model, PaLM. Spoiler: It can explain jokes. Oh, Ms. Coffee
Bean, it is not only about the jokes, is it? But first, let's thank Weights & Biases
for supporting us for this video! How many times did you crack a hard problem
and wanted to show your solution to someone? Well, W&B Reports have got you covered. If
all your insights are in a particular chart, you can share it in a Report
with a couple of quick comments. But if you have more to show, Reports can also
do a long-form, more polished piece of analysis, just check out this Report about DALL-E
mini. It’s like a blog post on steroids! Reports can help you keep apart the things you
want to share and rather not share: You can use Reports as a dashboard for a reporting a smaller
subset of metrics than the main W&B workspace. Create private sharable links or public, view-only links and share with anyone! What are
you waiting for? Go to wand.me/ai-coffee-break to see more about how Reports can help you quickly
share your work with colleagues around the world! Now, back to boring large language models. When it comes to large language
models, we have been bored lately. We have been bored by GLaM from
Google AI which has a total of 1.2 trillion parameters,
but don’t be that impressed! The model works a lot with
sparsity, so it activates only 97 billion parameters at a time, which
is 8% of the capacity of the whole model. And on the sparse model landscape with trillions
of parameters, we could also count Wu Dao 2.0 of 1.75 trillion parameters or Google’s Switch
Transformers of 1.6 trillion parameters. But let’s take a look at dense models, shall we?
We have been bothered by LaMDA from Google AI, which is a 137 billion parameter
model specialized on dialog. Then with Gopher from DeepMind
of 280 billion parameters. And let’s not forget the
largest monolithic model so far called Megatron-Turing NLG from Microsoft
and NVIDIA with 530 billion parameters. Ok, and of course Google needed
to claim the biggest model again, so they made PaLM, just a bit bigger, just enough
to overshadow Megatron by 10 billion parameters. So yeah, they made a 540 billion
parameter model. Keep in mind that this model is dense, all parameters are used
during inference, there is no sparsity here. In terms of model architecture, PaLM does not innovate, it is a GPT like
dense transformer decoder. The beginning of the text goes in, the transformer predicts the next word,
while attending to the previous words. The authors do use some tricks
for training this beast, for example they are computing attention
and the feedworward layer in parallel, which is what people from EleutherAI
have been doing with GPT-J-6B. For example, PaLM uses SwiGLU activations, Rotary
Positional Embeddings, and a special “lossless” vocabulary to account for many languages in
the training data, for code, and for numbers. These things have already been proven to help
train large language models by previous work. PaLM’s innovation is completely geared
towards training this beast efficiently. Now, what does it mean efficiently – for Google? It certainly does not mean
the same as it means for you when you are happy you parallelize your GPU
training on your 4 GPUs your cluster node has. ;) For Google AI, this means – and I hope
you are seated – that they used 6144 TPU chips for training. Wow. Just for comparison, the most
impressive numbers so far were established for Megatron-Turing NLG on 2240 A100 GPUs and
for Gopher which went up to 4096 TPU v3 chips. So how does it work, how can Ms. Coffee Bean
train on 6144 TPU chips too? Well, she can’t, but Google could because they used two TPU v4 pods
with 3072 chips each with fast intercommunication. Each pod should have a full copy of the model
parameters partitioned over the 3072 chips. Now, to let the two pods communicate efficiently,
the authors used the Pathways system which was introduced by Google in a previous paper.
In a nutshell, a batch of training data is divided into half and each half is sent to
each pod for forward and backward propagation. Then the gradients on half the batch are
transferred to the other pod and vice-versa. But this also means that the transfer must
wait until the other pod finishes and when the waiting is complete, a burst in data transfer
of 81 Terrabytes per second happens at a time. But you can imagine that all this waiting means
that in this time, the cores cannot really compute interesting stuff. Really impressive
is that through a lot of engineering tricks Ms. Coffee Bean has no idea about, they achieved an
analytically computed hardware FLOPs utilization of over 50%, which is impressive and
unprecedented. In this table comparing PaLM to previous large language models, you can see a
FLOPs utilization of 46.2%, so smaller than 57.8%. This is because in the table, you can
find the FLOPs utilization without rematerialisation. Rematerialisation means
that one saves memory by not storing some intermediate activations, but
re-computing them when needed. This means that with rematerialization,
the chips are busier when training PaLM than the other models which are not
using the trick, so for comparability, you can see the number in the
table without rematerialization. We see how high this is nonetheless, a lot higher than for the comparably
inefficient GPT-3 model for example. Okay, we get it. This thing is big, it’s
training fast, so what is it eating, exactly? Well, it is feeding of lots and lots of training
data. On 780 billion tokens, to be exact. 50% of the data consists of social media
conversations in multiple languages, almost a quarter of the data consists of filtered
webpages, then books, Wikipedia, news and Code. So already from the training data, we assume
that PaLM will have multilingual abilities and will be able to do code completion: it used 196GB of source code in Java, HTML,
Javascript, Python, PHP, C#, XML, C++, and C. And surprise, surprise, the
model can do code completion, fix compilation errors, translate code from
one programming language into the other, even though its the pre-training set contained
50 times less Python code than Codex 12B. The hypothesis here is that PaLM can do
code completion even though only 5% of its training data consisted of code, because
it benefits from its size and its training on natural language. This makes it more
data-efficient when it comes to code data. On natural language PaLM improves
state-of-the-art on common benchmarks, like natural language inference, where one has to say
whether a sentence entails or contradicts another, reading comprehension, question answering, the
complicated Winograd-style schemas, and so on. But we did not get to the jaw-dropping
part yet: Google had previously organised the collection of BIG Bench, which stands
for “Beyond the Imitation Game Benchmark” which contains quite the tricky tasks, like delivering persuasive
arguments for false statements, or finding the Russian proverb closest
in meaning to a given English proverb. BIG bench is quite the crazy collection
of tasks, where even the average human has a 50% success rate. Well, by scaling PaLM
from 8 billion to 62 billion, to 540 billion parameters, we see how PaLM in 5-shot
learning, surpasses the average human! One word here: Maybe you are wondering
what few-shot learning means: Well, I remember a time when few-shot learning
meant that one trains a model further on 10, maybe a 100 samples and updates the
parameters of the model while doing so, with the feedback from the few examples. Of course, catastrophic forgetting is
a huge problem here and the updates on the little data can undo what a lot
of training have previously done. But there were parameter
updates through gradients. But since GPT-3, what we have just
described is rather called fine-tuning and the default meaning of “few-shot
learning” has shifted to something else: The model receives a couple of solved examples
as a prompt which is pre-appended to the input. So we have problem, solution, problem,
solution, then problem. And when the GPT-like model completes the sequence
with the correct examples as input. Since GPT-3, we have been used to expect
large language models to be sensitive to this kind of prompting and give right answers
even without having any updates done on them. And we are wondering why this is it called
few-shot learning when there are no updates and there is no feedback loop in case the model
answered incorrectly? Like, where is the learning? Well, we do not want to start a
discussion about what “learning” means, now. We just wanted to make the point that
few-shot learning as we see it from GPT-3 onwards, is just glorified pattern completion. Now, with PaLM in this few-shot
learning setting, there are six tasks where we can observe
a discontinuous improvement with scale, so a certain size where the model
suddenly “gets it” with enough parameters. With enough scale, the capability
of solving the logical sequences, just emerges. Is this understanding? I have no
idea, but with English proverbs, Ms. Coffee Bean would assume that enough parameters are enough
to memorize the proverbs? No idea, honestly. I wonder if there is anybody in
this world understanding what kind of “understanding” happens in these models. We observe these discontinuous jumps, but do
we really know why they happen? Honestly, no. Speaking of emergence and reasoning, just
look, another emergence from model scale and data scale happens: If one gives a reasoning
example to PaLM, then PaLM reasons too! The authors call this “chain
of thought prompting”. PaLM gets as input a question, its answer,
and a reason. Then for the next question, the model will follow the pattern and give
a reason to its answer too! Incredible. Following the same idea of prompting where the
output is a completion of the given examples, PaLM can also explain jokes.
But keep in mind, it can do so because its input is an example of
an explained joke and a new joke. This makes the joke explanation example in
the blog which was circulated on social media, a little bit misleading because it omits to show
the prompt. You decide how much cherry-picking has gone into these examples, because we know
from playing with GPT-3 like models that they sometimes succeed incredibly and
sometimes they fail gracelessly. So, be excited, these emergent
capabilities are indeed impressive, coming from a “slightly conscious” I mean: “boring language model”. But also, keep
in mind that there are experiments in the paper evaluating fairness and biases that
highlight several problems of the model reinforcing biases and exhibiting toxicity. There is a lot we do not understand about
these models, like why do the bad things such as toxicity happen? We do not even know why
the good things happen! Ok, scale enables the emergence of some key capabilities in the model,
like we have seen with PaLM and logical reasoning. But in the open questions of the paper, we can
also see a discussion about a possible alternative to sheer model scale, which is the effect of
having lots of high quality training data. So high quality data could be an alternative to
large parameter numbers. Or this could also mean that PaLM with its 540 billion parameters would
learn even more from more high-quality data. So what do you think about this paper? On the
one hand the model shows impressive capabilities, but on the other hand, it’s boring, there
is no architectural innovation in it. But you can imagine that while we did not see the
model innovation yet, now that Google knows how to train these things efficiently on 6144 TPU chips
at once, we will see more research and innovation with large models on text, but also
on images, video, audio, you name it. We’re living in exciting times and I do not even
start to say how a little PhD student feels about this. How do you feel about realizing over
and over again that Sutton’s bitter lesson is really a thing and scaling things up is
the way to go? Or at least one of the ways. See you next time! Okay, bye!