hi everyone I'm Thomas Wolf from HuggingFace and today we're going to talk about an exciting topic, the future of NLP well
more precisely the future of transfer learning in NLP. To be honest this talk
is like a personal walk through some of my favorite papers and research directions
of the last few month so I really hope you enjoy it as much as I do
we're going to talk about a lot of things we'll start by talking about model
size and the data requirements then we'll talk about in domain versus out
of domain generalization. We'll move on to fine tuning and model evaluation, what are
the problems and the limits of these things and then we'll end up
discussing common sense and inductive biases. Okay so let's start by the elephant in the room. You've probably noticed the models are getting bigger and
bigger there is a nice graph that Victor Sanh made last year which shows how
these models are just getting crazy bigger it's exponentially increasing so
now the state of the art models are over 1 billion parameters and they
actually far above that because several models are now like 10 billion
parameters like T5 and the Turing model for Microsoft and you have a huge
problem with these models because they don't even fit on one GPU, not even
on two GPUs you need like four to eight GPUs just to load these models and to run them
with a batch size of one. Okay now why is it a problem well that's a huge problem
because if you check like the current leaderboard for instance the glue leaderboard that you can see here you can see that the competition is narrowing okay
it's all about the same teams now there's like Google Microsoft Amazon Baidu Facebook that's pretty much it you see there is no academics there okay
because just the models are too big, the computational requirements are too big
so there's a huge problem of diversity and also where does academia fits in
currents research in NLP. Another problem that you've probably seen is the
environmental cost of training these models okay they require a lot of energy, energy
consumed, well generate carbon dioxide and so we know that training these models isn't
it good for the environment so what can we do? and the last problem was very well
stated by Francois Cholet and means that if it does go bigger what do we expect? do we expect to see like a phase transition at some point or is it just like building
bigger scales to try to to reach the moon okay now there is another option
which is to go the other way. we know since this very nice paper, I
really like the title, "optimal brain damage" in 1989, we know that neural nets
are over-parametrized, they have too much weights. We can just prune them and hum the most recent example is the lottery ticket
hypothesis which says that if you take a randomly initialized model you can
actually find a subset inside this model which already has good
performances on your test tasks ok you don't even need to train your model, you
can just take this big model and find a subset, a small sub network inside of
it, that's already nice for your task and we see that in fine-tuned models as well, when you fine-tune models you can remove weights, you can select the weights,
and you keep the performances so here you see this example on NLP tasks where
actually you can remove like 90 percent of the weights and keep the same
performances ok so we want to push in this direction so here is a small
promotion we're doing competition that started actually two days ago which is
about getting the more efficient models that you can ok it's called sustaiNLP
it's a workshop that will be collocated with EMNLP at the end of
the year and the goal is to get to the same threshold as the current
state-of-the-art models like BERT-base for instance and just to try to be the most energy efficient that you can okay
the competition is only on inference for now because inference is actually one of
the biggest part when you when you look at the lifetime contribution of a
model like the lifetime computational cost when these models are
deployed in applications on like thousand of servers actually inference cost is
the biggest part of their lifetime environment cost okay so if we can get
better on inference we're already a long way toward our goal of getting more
efficient models now how can you reduce the size of the models okay so we start
by this let's go a bit into that they're mostly likes three
techniques that you can use the first one is called distillation second one is
pruning and then there is quantization okay so distillation
here is a good example we made a model called DistilBERT at the end of
last year which get like 95% of BERT model on GLUE and is
like 40% smaller. How you do that? Well you take BERT as a teacher model which means you have a pre trained BERT and then you will train a student model
which will be smaller and you train the student model to reproduce the
generalization capabilities of the teacher okay so here is a good example
you see this sentence "I think this is the beginning of beautiful..." and then the
model is asked to complete so BERT is trained like that BERT is trained to
predicts masked token so here you see the top prediction of BERT and you see they
all make sense okay you see "day", "life" "future", "story" all these top predictions
that BERT think are possible they all make sense this is what we call
generalization. The model, BERT model, learned to generalize beyond just the
simple training example in this sentence and what we do is that the
student model will be trained to generalize in the same way as the
teacher model so it will learn the inductive biases that the teacher model has learned
it's very easy we just do a Cross-Entropy loss, it's called knowledge
distillation, and we just do a cross-entropy between the output of our student
and output of our teacher okay when you train you can use temperature to like
emphasizes the lower probability that's very common trick in NLP
so now a lot of people have been publishing in on distillation at the end
of last year you can see a couple of papers. The state of the art distillation models are kind of very complex now TinyBERT is
a good example where the student model actually has a smaller size of
hidden-states than the teacher model and also mimic to hidden states of the
of the teacher so you have down you have a down projections from the teacher to the student they also used a lot of data augmentation so it's kind of tricky
to know exactly what is the good the part of the good performance of this
this latest model come from data augmentation and what part come from distillation
but definitely people can get very small models with good performances
using mix of distillation and data augmentation Now let's move on to the
second technique you can use to reduce the size of the mobile okay which is
called pruning improving you directly work on your
teacher model and you actually remove waste from this model to make it smaller
okay there are various way you can prove and one simple way is actually to remove
the tension heads in your transformer it was shown that you join two nice paper
of last year one by Alina boy it's a from Edinburgh University and is a other
by a pony shell at CMU and they show that you can actually remove a lot of
the heads of transformer model after they've been trained and you can keep
very good performances so on the top you see the results on translation you see
that you can actually remove a 90 percent of your heads and keep a very
good blue score and at the bottom you see the result on blue which is the
general language understanding match mark and you see pretty much the same
performance so one way you can identify the heads you should remove is by using
what Michele and polish shells will call the score the head importance coal which
is actually their the grant of the loss we forgot to their to the output of the
attention layer and if you if you remove the the last important heads first you
can actually kill these very good performance what is interesting is that
you see on the other slash graph here is that you can actually if you remove some
heads that are less important for one task here it's on Amanda lie you can
actually see that this exceed this is quite resilient to dominate updation so
here in the middle I you have to power from the data set you have what is
called the match data set on a mismatch that's it and if we move some hairs
around at useful for one you can see that actually it's kind of related well
okay the graph is not exactly linear but there is some correlation with heads
that are not important on another domain okay so which means that there are some
has actually not useful for anything at least on MNLA okay that's interesting
because it means this is quite resilient to the maladaptation
now you can also die out here remove the weights when you remove the weights it's
actually more fine-grained because each each specific weights can be removed but
the problem is that you will end up with a very sparse matrices that are not so
good for us for GPUs we'll talk about that later
but you can get also very performances here is a nice paper from a
SAP team with cherry whirlwind and the Hang Wong and they did a very nice paper
that his witness were complete rain models as well removing weights with a
nice differentiable s0 pruning okay they use like a are concrete distribution
which is basically the the Gumbel softmax break okay and the last part is
actually layer pruning in layer pruning this is nice paper by under a fan of
last year at Facebook as well they actually remove full layers of the
transformer so this is this is really a lot okay you remove this this pool
player so the way you can actually do that and have the model still behave
quite well is by training the model to be resilient to that so during pre
training you will randomly remove weights
remove layers sorry like a dropout okay it's a structured approach to drop
layers and so the model learns to actually behave well without some ways
it works well because this transformers layer they are like a repetition of the
same module okay and you have this residual connection this shortcut
connection which means that actually one layer and the next one there are kind of
they always connected with a shortcut as well so when you remove a layer it's
actually less aggressive than in some like fully connected models without
shortcut connection so layer pruning is very interesting as well and you keep
these dance matrices because you really move full blocks of weights so why am I
talking about this problem of sparsity well because all these models will run
them on GPUs and GPUs our GPUs but CPUs and GPUs they are really optimized for
dense metric multiplication okay they have troubles with sparsity
and when you use this space model on GPU on GPU or GPU they're usually way slower
it can be like three to four times over to run so they're smaller indeed but
they also lots a lot slower and it's also not efficient so you're losing what
you were actually looking for which was energy efficiency so they're values
where you can try to circumvent this one way is to use
what open area was promoting which is block sparsity so instead of removing
all these weights single weights you have to remove blocks of weights and
these blocks have the nice size that is a adapted to your GPU or GPU kernel
which means that you keep dance metric multiplication well you can remove
blocks but actually when you do like strong sparsity it means you just keep
blocks actually okay your matrix is just a few blocks as you can see here so this
helps you the another approach is actually to make a full sparsity but
with patterns that you actually control so you can keep advantages of optimized
CUDA kernel now the more you structure is positive usually the less
performances you can get because you actually constraining the model okay so
if you have like instructors pass et usually we can keep the best
performances and all the metrics but you lose the efficiency and the more you
structure the sparsity the better your your energy efficiency is and using the
worse your performance is so another alternative is to actually switch chips
and try for instance the new IP you from graph call which are chips that are
specifically designed for spaz models okay they are made of a lot of small
module can process data independently and have this smaller RAM associated to
them and they can actually process sparse matrices very efficiently now the
last technique I want to talk about when we talk about shriek shriek shrinking
model is quantization quantization is also very interesting we know that using
float32 using full precision floats weights is
actually not the most optimal way we know that these neural networks they
also work well without precision and even quantized integral so we can do
that from our transformers as well okay it converts the flow just
4:32 the full precision weights into integrate so we really reduce a lot the
size of our model and we use dynamic consolation for instance where you have
a scaling and zero points conversion and this works very well there was a
nice work by Intel I called q8 bird and it's really working
well you can try it it's very easy in Python and as well it's very easy to
apply conservation and a bit like layer pruning you can do training aware
cancellation so you can tell your model is gonna be quantized at the end okay
you actually train it in a way that it's getting used to be quantized and so you
have better performance scissors as well at the end okay okay we've talked a lot
about these big models how to reduce the size now there is another things that is
a increasing exponentially recently in NLP which is the requirement for more
data people are using more data for training and people are also using more
data for fine-tuning okay so there's a problem because when you compare two
models that was they were pre trained on two different data sets of very
different size it's really hard to tell if one model is better because it was
betraying on more data or if it's better because usually of the like novel
architectural design that people introduces good a good recent example of
last year was XL net the the transformer from from Google there was the successor
of transformer Excel and exhale net use a smart autoregressive training so you
could actually do auto regressive training while having the possibility to
attend to both contacts to left on the right context usually when we do order
autoregressive training on Ingo whywe so a model is masked like the right context
of each token what Malo is masked but in excel nets actually they do auto
regressive with a random permutation so the model actually learns to pay
attention to both context now the problem with that excel net was also
trained on a lot more data than Birds so when they compared to Birds it was
really hard to tell what was the difference
what was the improvement that came from training on 20 times more data and which
which which province came from having this new Auto regressive architecture so
there was a huge debate and actually it was kind of settled by Roberto which was
a very simple bird architecture that's exactly the same
birds but just trained on more data basically and Robert our output from
Excel net which showed that basically there was the bitter lesson of an LP and
the bitter lesson of machine learning in general as reach a certain talk about it
which is that if you have more data is usually output from having a smaller
model okay and now there is this recent paper that we're gonna talk a lot more
about which is called scaling laws for a new language model this is paper from
open AI and it's a really in-depth study of what happened when you increase the
data size and when you increase the model size we saw a change in the
architecture so it's very good it's very good study now this is for free training
but we see the same on file journey which is that people when they fight
hoon they do a lot of that documentation and a good example is the Winograd
schema challenge the winner grade schema challenge was very interesting that set
for a long time it's very simple you see one example here you have a sentence
that say for instance the trophy would not fit in a brown suitcase because it
was too big and the question is what was to be was it a trophy or was it the
suitcase and the model has to do a classification between these two one
okay so that's very interesting because you need some common sense you need to
know that the suitcase is usually bigger than your trophy and the way it was
solved and for a long time it was a very hard challenge it's a very small one
you'll need like 300 example and for a long time it was very hard to get good
performances for deep learning models on that and the weight was solve was
actually to generate artificial documentation that sets with some
heuristics extracting from Wikipedia sentences where you have two times the
same noun like two times trophy and replacing one of them by it like this
you can build with these heuristics you can build a huge data set from any any
like crawl text data set and you can pre train your model on that and then you
this fine-tuning on the winner brat schema challenge after that and you can
solve the task but you can see that it's not very we're not really happy about
that because scientifically we have not really learned anything about common
sense by doing that okay we've just learned that
more data is better so let's talk a bit more about retraining first okay so I
talked about the scaling laws paper let's go into depth in this paper so
this paper is about one single architecture it's about the transformer
train for auto regressive language modeling okay so you only have left
context for each talk and is that transformer that up trying to predict
the next token given the given the beginning of a sentence this is with GPT
too but the experiments with many sizes with many sizes of the datasets and also
they did some nice scan on the architecture there was always some
question about transformer which is what is the optimal ratio of the number of
heads we forgot the model size what are the optimal ratio of the number of
layers we forgot to the diamond shovel models and I can show that all these
doesn't really matter as long as you in the like very flat suite point where you
have this nice hyper parameter that was pretty much the original attention is
only unit parameter as long as you as you're in this sweet spot you good so
these models are very actually rubbished to this simple hyper parameter
exploration and what they show is that just by scaling and model size and
scaling data set size you have a very clear power law which mean that's it's
actually exponential actually that's what it means it's exponentially
squeezing so if you double your model size you have this linear improvement in
performances if you double your data set size you also have this linear
improvement but it's power low they go over very wide ranges they go over hold
of the went on over horrors of magnitude now you can read this paper it's very
interesting they show that to to interesting thing for me one was that
actually that was actually there was a failure follow a paper by Eric Wallace
at UC Berkeley we show that it's actually better to have a too big model
it's actually better that your model is more ready than we used to that and then
we used to to think and for the datasets so if your model is actually slightly
too big for your data set in a way we were as
the size of that's it on model before you can actually get better results you
go you go down you know you're lost go down faster and there is another
interesting thing in this paper which is also something we saw a little bit
earlier on the pruning which is that this transformer models the embeddings
and the layers they behave really differently okay so when you find sauce
when you prune you should prune differently embeddings
and layers and here they show so that it's actually the the capacity of the
model is really defined by layers and all the power law they observed they
work well if you remove them it is okay if you don't take into account the
imbalance when you compute the size of the model okay now there is a last very
interesting thing is that they have two lows for the the decreasing glass okay
one of the lows and one of the increasing loss one of the lows to
decrease the loss related to the capacity of the model when you increase
the capacity of the model and one is related to increasing the dead set both
kind of both can be related together in terms of computation like more data
means more computation bigger model also means more computations you can collect
this to power low and what you see on this graph is that you have actually two
slopes which mean that at some point they joined together and you can't
really know you don't really know what which loss you should have okay you have
one loss which is actually defined by our giving using the optimal data
capacity the optimal that set and one law is defined by using the optimal
model capacity and at some point they predict their prediction doesn't fit
together and which is actually far above which we've been experimenting right now
it's around the pizza Pepe mr. parameters regime and hear what
they say that activity architecture the transformer architecture is breaking
down that's what they open yeah okay so all this exploration of more data and
bigger models are actually related to one idea the idea is that maybe there
will be a qualitative jump in behavior if we get enough data okay
the idea is like maybe just getting more data is enough to see a qualitative like
a phase transition how the model behaved and there is some
hints of this it's a quite interesting idea I think it's very controversial
somehow because more data as I was thinking more data bigger model is this
video research program right and there's this nice paper from AI - from Allen
thalmor and people at and at Olay to Israel they show that actually just
comparing birds on robots you can you can you can invert this a phase
transition okay so comparing birch row better is interesting because they are
the same architecture they're exactly the same models just that bird's was
trained on only 137 billion tokens or D and roberto is trained on two point two
Tara seconds okay so what I was really trained on a lot more data and here you
can see this very interesting is they were short evaluation so you just take
the free trade model you don't find unit and you ask a question that are kind of
like the window grad scheme a challenge question okay here you ask it a 25 year
old person age is then a 30 year old person I mean the model has to predict
if the was younger or older so it has to actually compare numbers together and
use some common sense if you want and you can see that vert is pretty bad
perch is the blue curve and Rho beta which is the green curve is actually
like super good at comparing this remember in the in the range of ages for
people you can see the same on size comparison if you ask Roberta to compare
the size of like the Sun to have the table to a house and like that Roberta
is usually pretty good out of the box so it has some form of what we would call
common sense and this is out of the box okay just by pre-training it's also even
able to compare birth rate like birth year sorry like if you asked if somebody
was very born in this year or this year who is older us which means it's the
reverse than the H okay the year the higher the
your birth year of this was the younger you are and the model is able to do this
sweep swatch swap so that's very actually surprising I think now there is
this big question when you do now fine tuning okay so we've seen that
pre-training bigger data is just better and actually you may even see some phase
transition now what about fine tuning okay so fine tuning means you've taken
this free training model on their data set and now you want to adapt them on
one task okay this paper is very important one from
deep mind the evaluating learning and evaluating general linguistic
intelligence it's a paper that can pose a lot of question it's an opinion paper
and you should definitely read it I think if you want it's one of the most
important of last year last year's paper it's like now why one-year-old and what
I say is that the reason that said they're actually too easy to solve with
leader generalization why because we have this training data set for the
mollow that are usually often quite be like Amanda lie or SLI or squad they're
really kind of big data set to fine-tune on and they give models that actually
don't really have good sample efficiencies so let's let's focus what
does this mean let's say we have two model we have model a model hey a as a
ninety percent accuracy with like a hundred training example but then it
doesn't get any better with more training example okay it can plateau at
ninety percent but lb takes like one million examples
to get to ninety percent accuracy but then it can increase a little bit and it
end up plateauing at ninety two percent so if we do if we just do like we do
usually like we compare the model at the end of the fine-tuning we will say Oh
model B is not better because it can reach 92 percent accuracy
well actually we should really we were Model A because Model A is able to reach
very good score with just 100 training example that's really great that's what
we wanted from transfer learning okay that was one of the initial goal of
transfer learning was to make this model work on very small data sets and this is
called sample efficiency it means how better your model gets with one
additional example there are a lot of other problem with
these models which are related one model is that when you find you in on these
big data sets usually we get models that work well really exactly on the training
and the fine-tuning domain so you have models that work well on squared for
instance that work well that means they work well on Wikipedia question
answering in this very narrow field of question answering but we would like for
instance we don't really want squad models we would like to have question
answering models that would work on any question answering tasks and this is
related to sample efficiency because it means that if you just give a few
general question answering example you would like your model to work already
well on them okay you would like to the model not to need to function on full
Wikipedia to just no question so in Wikipedia there is a related make matrix
matrix which is called online code length we'd say that how much better
your model will get with each additional sample okay it's an information theory
metric which actually is related to how how much you can compress your mala so
it's a very important matrix and it's actually probably the way to go forward
so here just a few example here you see if a model was actually trained from so
you can see here the benefits of transfer learning first so here on the
bottom bottom you see you birds that strain from scratch on
question-answering so this bird is not initialized is initial is going to be
initialized okay so it's pretty bad at the end you just train on the full
square data set and you just don't get very high now you can see that if bird
is pre trained already on like its usual HP training which is the tanto bit
copper so we keep it yeah it goes a lot faster okay so this is the benefits of
transfer learning and it reached nice accuracy and now if you look at this the
last part which is actually a bird that was pre trained on another question on
saying that is that you can state already start very high so it means that
this model is actually very simple efficient because it was already
fine-tuned on another question as varied as before okay so this model when you
look at them on online image it's an online code length metric you
can see that they actually very different because when you actually find
you this model you have to understand the birth model was pre trained now to
fine-tune it we'll add a linear layer on top okay and this linear layer is
randomly initialized so there is no shortcut here you will have to train
this linear layer okay you cannot really bypass this when you use this model with
this task specific layer added on top you need to train this task specific
layer which means that you can't really observe effort okay and this mother we
always have to catch up somehow they always need a few example to be able to
train this last layer whatever whatever smaller layer is this is what we call
task specific components and that's a very strong limit to how we can do a
very simple efficient model okay now this was just to show you when you
actually investigates sample efficiency you can see that it's also a good way to
see if the model is actively learning the task using the knowledge it had from
before or if it's actually learning the task from scratch so here you see
comparison between bottom robots all right you know that robots are we've
been showing this blue and green dragon diagram Roberta has some kind of good
common sense better than bird ok and we can see that
because when we find your Obata it's a lot more sample efficient with just a
few example it's already catching it it's already getting better matrix on
birds and by comparing the sample efficiency curve which is the
performance of your model while you just use progressively more more sample to
function it just by comparing the curve for birds on rebuttal you can see I
think you can have a good idea of how much your pre-training was helping to to
get good performance on your target task ok so this paper is also very nice
investigation on that and that's actually posted a nice question of how
much data should we need and this actually lead us to the next topic which
is in domain versus our domaine what we would like in general is
out of the main generalization what we have usually is in the main journal is
Asia what does it mean let's have a look we've trained our model on question
answering datasets now we are experimenting with like real life where
question answering is different like the domain is different the language people
use is not Wikipedia language and we see there is a strong performance rope
because our model is not really capable of out of dimensionalization here is
another nice example on this paper by thomas mccoy and we show that actually
if you train birds and has good performances unlike your fine tuning
that's a set i glue okay you can then you can then test it on another data set
which is a out of domain so how they did out of the main is here is by having
some heuristics so for instance you try to make like for instance lexical
overlap heuristics so in Amidala you have two examples and
you have two sentence and you have to say if one entail the other or
contradict the other okay they are very simple heuristics for that in the data
set which means that usually if there is not means contradiction if there is a
lot of flexible overlap it means use a entailment so they build an adversarial
data set called the hands which is in a transformers library actually you can
use it we have an example it and which is adversarial so when there is a lot in
this data set enhance where there is a lot of lexical overlap its contradiction
the good label is contraction here is an example here and what they show that is
that they can train several birds on this fine tune it with different random
seed okay so the difference is is very small the difference between this model
is very small it's just the weight initialization of the last layer and
this model they behaved similarly uh nominally they are like various very
similar performances but when you test them on the adversarial hunts data set
they behave really differently okay there are this huge variability some of
them are pretty good well none of them is really really great but some
the magnet so bad that some of them are really bad and this means that actually
what you see in domain to test performances you see just give you no
indication of how your model will behave in the real world which is kind of bad
okay here is more example than what they do
on them in a lie their values heuristics they use to design and you can see you
have more or less variability in the fine-tune model some heuristics leads to
like really a huge variance which means that you can't really know how your
mother will behave in the real world unless you be able to honest you able to
test it on real data and so on my like small have a smaller effect okay now
it's really hard to investigate out of the metallization so one way is to do
this kind of heuristics another way is try to build our datasets ourselves so
we can control them so the only really interesting Phillips in this work is the
work and compositionality compositionality is to investigate how
you model is actually able to combine values part of a sentence to build a
meaning representation this is very important because we think that in
linguistics composition is something important that we do when I say the blue
dog is going out you kind of gather blue and dog together in a single in a single
meaning and then you combine this with the rest of the sentence to build up the
meaning so there's a nice work called scan and pcfg
sets which was actually a really really long but super interesting paper by age
of two up case from Amsterdam University and they can build a huge data set
that's replicates some natural language data sets so they be status in which you
you have to combine instruction together to generate an output okay and you have
to combine instruction compositionally so you can generate your nice output so
you have like it's written like repeat something and there's something we have
to be considered as as a single entity and then the repeat function at the
apply on it and they were able to naturalize the data set so they were
able to reproduce the depth and the length of like a translation or a very
big translation that I said VMT challenge okay so they're this
artificial data set you generate this instruction yourself but which really
replicates well natural language and WebRTC artificial deaths that you can
actually do some out of the man generalization so Finance in your
training part you can remove some instruction some words do the model will
never see them I like some way to combine words and it can estate how the
model will learn to do that and I like one of the most fascinating
experiments have so last year which is called over generalization over the
duration is like super fascinating it's it's a bit like when you're when you're
kid you're learning language and you make mistakes but this mistakes are like
smart mistakes okay for instance you will add Edie at the
end of a past verb verb in the past tense but it's like an irregular verb we
say instead of saying I went we say I go with okay and this is called smart
mistake because it means you've learned room you've just not learned yet the
exception and we really want our model to learn rules so they can generalize
outside of the training domain okay so you can investigate that by putting some
irregular verbs in this article that said so here you can investigate that
and the nice thing about these paper is that they compared the complete various
architectures they compared lsdm they comport kind of nets the commands former
together and what they see is the really very important C's like Alice then they
cannot struggle with this question of over memory of over generalization and
transformer are really lot better you three can cornetist somewhere in the
middle so this is this nice graph where you see on the top you have like very
few examples like a very few exception so it's kind of hard for the model to
learn that so here you see during the training the red means that the model is
over generalizing the blue means that the model has lone exception and the
gray mean that the model actually don't we know what to do
so it's predicting like random output which is neither the role Naser expect
the exception and when you have only a very few sample
like very few exception although just can't really get them cornet manage a
little bit to do that but transformer and
they don't when you have a lot of exception the model learned just to
memorize them that would we see in your network okay they are very good at
memorizing brute force memorization and when you were in the middle you see a
bit something that's similar to the way a human learn which is much you start to
have over the ionization during your training you have the peak when you
actually open your eyes everywhere and then you learn that there are some
exception so it's very interesting and it shows that this model are capable of
some out of the main generalization somehow now talking about in domain and
out of the mineralisation posed the question of how do you measure the
distance between your two domains and that's a very open question there is a
large body of work on domain adaptation that is trying to show that you can
actually extract some feature from your data set and you can compute some
similarity metrics on them but it's definitely a very open question how can
you measure the distance like in a statistical meaning or you can measure
the distance between two data sets okay now you can know when you're not in the
main anymore okay so I think talking about in domain versus out of the
miniaturization we're talking about this question of sample efficiency we saw
that using this task specific component was actually a problem okay because we
have to fine-tune them on each task we have to fine-tune them and they are like
limiting how efficient we can do like they are like increasing the number of
sample we need to learn the target task that we have at the end and this is
actually related to the rise of analogy so let me show you a little bit okay recently we've seen more and more text
to text model this was tearing but studied by this nice say swap a paper by
a bright Mack cane which was called the natural language Decathlon it was a task
it was a benchmark where you have like 10 tasks to take at home and you have to
just they were all cast in the same format in the same framework they were
all cast as Christian answering tasks so if you have translation you would have a
question which is translate something and translate this from English to
German and then you have a context which is the English sentence okay when you
have summarization we actually formulate that as a textual input which is
summarized this and then you have a the newest
example a new model has to generate the output it has to generate the trans
ladies the German Association it has to generate the summarization so it's not
classification task like we saw before but it's generation task
gbt 2 is a big model that makes a lot of PR but was also very very nice paper
called multitask language Molalla unsupervised multitask learners and
there was a lot of 0 shut experiment in this how do you do the shot with CPD -
you do the same you actually formulate your task with a prompt which is fine
some summarize or for summarization they did what they they could TLDR - too long
didn't read and then they put the the sentence to summarize and the model is a
actually train to generate is not training so 0 shot the model try to
generate some plausible a completion and the plausible completion will be a
summary and gb2 is quite good at that a lot of tasks can be formulated like that
like the Lambada data set which is very interesting tasks where you try to
predict the last word and the last word of sentence is something that is not
explicitly said in the beginning but uses just implied for instance people
are talking about giving birth but you don't say it's not explicitly mentioned
in the paragraph and at the end you have to talk you have to complete with the
word pregnancy ok so the model has to understand the underlying meaning of the
sentence to be able to put in the right word so this is completion tasks it can
be formulated as text generation where you generate the next word and we've
seen a rise of models like this which are trained to generate world and where
we actually recast our usual classification or usual NLP tasks as
test text to text generation tasks ok and we've seen that in a lot of recent
model Facebook birth model which pre trained with the text to text objectives
so it was pre trained by giving it Corrib text where you have like randomly
dropped tokens randomly dropped words or the text is shuffled you can see all the
the objectives here and the mother is trying to regenerate the clean text from
that ok so you can formulate this the noising objective as a text to text
generation the correct text to clean text
generation and they even train a multilingual model called embowered
on this so we we have this both model now on transformer so you can try them
and this models they're trained in this framework and the most famous one is the
recent t5 Google is mostly famous because for some time with a short
amount of time it was the biggest model so the 11
billion parameter model ant if I mean is P trained and fine-tuned like this it's
pre trained with a denoising objective like the one with sofa board and it's
fine tuned in a text to text format so for instance on blue tasks like a manila
I you have to pretty contain my no contradiction and you will have you will
formulate your task as a text input another we have to generate entailment
the word on statement or the word contradiction why is it great this is
great because with this we don't really have to fine-tune any additional layer
okay we don't actually add any layer to our model we take the same architecture
for pre-training and for fine-tuning there is nothing to fight you in on to
train from scratch which mean that in theory we can do zero shot because no
weights needs to be fine-tuned on the on the target task ok the model is ready to
be used on the target task now usually this means we need to do like target
target task inform for training like we need during the pre training that's what
is very interesting in the t5 paper that you can read as well during the pre
train they can doing for training the prepared model for this task by giving
some example of the fine-tuning as well so the model knows that it will be asked
to do some containment or contradiction between your question but then you can
have 0 shots and actually when you look at what Sam Bauman is saying about glue
and super glue is their successor to this task so that is actually really
hard now to find some datasets where our key we can have a good classification
like a good NLU task where you actually can prevent this model to which human
preferences where this model don't even already reach human performances and
usually they remain preferences because they are taking advantage of this
fine-tuning task okay so in general we should I think we should really focus on
0 shots adaptation for transfer learning like
zero shot of very few sample efficiencies your adaptation okay I hope
that's take away of all this Diskin discussion now let's go back a little
bit we've talked a lot about the quantity of data we've talked a lot
about the size of the models but all these models there are some common
problems behind this quantity of data which is their lack of robustness they
aren't there a few thing but for instance one is like the lycra business
when function so let's start by talking about that and then we talked about the
lack of robust test or if we got to common-sense weight when you find you
this model you can usually see some pattern like this so this is a nice
graph from jason funk paper called a sentence of korean stilt and they find
hubert with just just wearing the random seed and they show that this model say
they are very easily fall into what we called local minima so you can see this
on of behavior sometime the model work now if someone themselves it just
doesn't work at all so there was a follow-up recently by
u-dub just dodge paper and it's also exploring when you just vary the random
seed for fine-tuning or how the preferences of this model are behaving
and they saw the same thing that the model are very sensitive to the random
seed and they have this they are very easily
they very easily fall into local minima what i call local minima is that it has
bad performances and stuck in this in this video okay so how we solve that
usually we saw that with the very brute-force approach that you can see
for roberta for instance which is that you will train hundreds of models you
will fine-tune hundred of models on various fine-tuning set up you're
exploring the full hyper bomb into space and just keep the best one okay we
talked a little bit about that better now that's one way to mitigate that the
other way is that we probably need just better regard regularization okay so the
mix out paper is very interesting they show that actually when we use dropouts
we usually use dropout to fine-tune this model in drop out we replace some
weights by zero okay well when you do fine-tuning is it is it good to actually
have the model regularized to add zero maybe instead of replacing the weight
with zero we should replace them with the pre-trained value so we keep them a
little close to a pre trained model okay and they show that it's some form of
adaptive l2 actually and the model are behaving better with this regularization
objective that you can see on the on the on the gray map here now they are more
they can be a lot more complex with polarization and all the work of
Microsoft on the MT DNN models that were tapping the blue leader ball for some
mamma for some time is also all these various regularization you can do so you
can do also organization where you try to limit the evaluation in the weights
during fine-tuning there's a lot of there are a lot of where you can do that
to realize this model but it's probably the way you should go and then the last
way you can do is actually just to train a fine tune a lot of this model and to
assemble them so you can fine-tune them with multi task as well in which you
actually try to increase the domain the data set size by gathering several tasks
together and then you train several model that you assemble and if it's too
big at the end because you have several model you can just distill them back in
a single model okay which is nice but that's very complex and actually when
you look at the typical setup right now to get the SOTA on blue that black sent
me the other day it's pretty crazy look at this you have to prevent your
model so here as we said just use as much data as you can't as much compute
as you can then you have to tune the fine tuning I prepare meet your Lots
okay you do that type notation so you will
some for some specific tasks of glue you will start by fine tuning on W and Ally
which is the biggest at a set of glue so you get some data augmentation you
increase that we've unlabeled data like we saw for Winograd the schema challenge
you increase that with additional label data wherever you can use it okay then
you can use some tricks that are actually not normally forbidden but
everybody used to do a pairwise we're ranking where you actually exchange
information between examples in the in the queue and I and W and I
that's it and then you fine-tune as many model as you can you take the five to
the ten best of all these models you and sample them and you submit that as your
results so this is just crazy computer and it's definitely overfitting a lot to
glue and it's a big problem why is it because all these hyper parameter search
now we know that actually if you take every any kind of model that was shown
by the gaba Meli's paper for lsdm last year if you find unit well enough you
can actually reach some very good performances but you've used a huge
compute budget to fine tune it okay so this was actually formalized in this
nice paper by a just judge which is called show your work which was an ACL
paper last year which say that we should not just we should not just report the
end evaluation metrics but we should report what happened during I have a
parameter search because it gives information on how much computer needs
to actually get this model to good performances okay
we talk about that force for data sets we said that if your model needs like 1
million that sample that said to get a good performance it's actually it should
be advocated and we should know that so we can select also the more efficient
model and you're the same for hyperparameters search if your model
needs a crazy high programa to search to reach the good performances we should
know about that so the show you work paper say that you should give these
curves that show all you during all your hyper primates search how you model was
behaving and what was the best one we see the same thing with standard splits
the these deaths it has splits in standard training test which is nice to
comfort model but it's rich people to it lead people to overfit on some sana
splits ok so the specific heuristics that will work under standard train
split people we can over feed them and almost unco them in the models which is
bad so this guy government paper runs the barrack paper at ACL we need to talk
about sooner speech is also very interesting read and they advocate for
random randoms please I don't know if random split is a solution maybe the
solution would be to have several standards
it but definitely just a single sauna split is bad and people over Fitz and we
can see that because when you try to do 20 smaadahl on others plates they
actually can be a bad they can be kind of worse than what was expected okay so
this was about hyper parameter search okay but now there is an underlying
question also as well on how our model are behaving and we know that they are
brittle and spurious so let's just see a little bit what this means
Brittles mean what if we change a little bit the input we can't see weird
behavior like this was very very visible on on the geunyoung paper on squad they
show that if you add like a random sentence at the end of the squad context
like the answer that the model predict for the question are totally different
another is just lost because we've moved this very small modification with lead
we've left to training domain okay but our models are also spurious spurious
mean that say we really do the least amount of work to get the bear from best
performances so if they are easy heuristics that they can leverage for
instance like we've talked about lexical overlap when you have a lot of overlap
on the two sentences of an ally of em and Ally like the model will very
quickly learn that a lot of overlap means that the right classification
should be an Talent but this is not what we would like who like a model to go to
the semantic meaning of the sentence and not just to stay on the surface from
just to stay on the lexical overlap heuristics so this is called spurious
and it's also mean Amidala are very fragile because in both this case when
we leave a little bit the training data set the model gives wrong prediction and
fail in unexpected ways so how can we solve that that way as well well one way
would be first to get better ideas of how they behave and I think here
linguistics is very important there is this nice talk by le public why we
should care about linguistics well I think it's really the time for
linguistic to become to come back to NLP to replace some some of the machine
learning approach and to help design better evaluation
okay because we know that this linguistics rules they're kind of the
underlying rules we would like our models to learn so they are probably
good not-not-not to try to be a train that's
set but they're probably very good to build evaluation at a set and failure
evaluation that's it okay now what we really want to do is to
try to provide good inductive biases which means that our models will want to
learn these rules there will be it will be easier for a model to learn these
rules than to learn these heuristics which is called an inductive biases that
will that means the model goes toward the solution that we want faster than
toward a solution that we don't want so one idea is all the work on composition
ID that we show we know that compositionality is important for model
to get the good the good understanding of the meaning of a sentence so if we
can deal models that are good on compositionality that would be one way
to incorporate some linguistics so this is one possibility so how could how can
we do that well we can trick the architecture of the model this model
they have this attention heads so we can try to encode in this attention heads
some of this graph that we see in linguistics like dependency graft or
like shallow like a Sarah relation and this was the nice word for instance from
Emma's trouble the lizard linguistically informed self attention for semantic
role labeling that was the best paper at him in LP two years ago I think where
some of the attention head were actually trained with additionally augmented data
with semantic role labeling okay and the model reads better performances because
it was able to embrace this inductive biases there is a lot of nice work as
well graph computed convolutional network phrase of this work by by diego
on us pasting and even cheetah where they explore the semantics in
translation with graph neural network encoding the the dependency trees okay
so you can try to forty architecture another a different way is to try to add
this inductive bias in the training data here is one funny example it's an
anonymous paper and I hope review I think it's very nice they try
to help Bert by providing after a training sample for Bert like a form of
semantic role labeling so they will say okay this word is like a predicate this
work is the first argument okay they add this this semantic role labeling
information after the training so this form like just the input from the model
and then at test time you just don't worry you just don't use this you just
don't you don't input monte data and you see that the model actually gets
improved whoa business by being trained with this information okay they
experimented on an adversarial squad that's cool swag that is it okay so now
probably what we want to do for that is that we probably want to work on
pre-training to get some deeper linguistic information in our pre
training that set as well like for instance we can add more like
linguistically informs panel level representation we can try to have like
more modular way to prevent our models and this is all very open but how can we
incorporate this the linguistic information free training will be
probably important as well an alternative way to make this model
robust is to add common sense because usually they are lacking some linguistic
stuff but as well as common sense there are some limits so why do we need common
sense because we can't learn everything from text let's have a look you know
that for instance sheeps are white okay but in text usually we don't say that
because it's too obvious and on the contrary we often talk about black
sheeps because yeah because that's a common
expression okay so when you ask a model that was trained on text only when you
ask it what color is either sheep it will say oh I I'm not sure it's probably
black okay because it has no way to know that white is the real answer okay this
is called the reporting bias it means that in text we usually don't state the
obvious we don't write common sense so how can
we help them out and learn that well there are various way
we can try to add like a knowledge base in the knowledge base we like hover this
shape node and we connect it with color white nodes so the model can learn that
this is the color of a normal shape we can add multi model when you show a
picture to the model with a white sheep and the model can just look at it and
say oh yeah okay it's white I can see it on the picture and then the last part is
maybe you can just tell the model we can have like human-in-the-loop training
like humans are actually learning the model can ask what color is the Sheep
under new human or somebody say yeah it's white and then the model knows it
so this is human-in-the-loop training these are all the way but you can see
that all of them involve something other than texts we need something like more
structure like data base or we need like picture we need another modality for the
model to be able to learn common sense so yeah in showing has been as been
working a lot of common sense she's done really really great stuff the first
question is what is common sense what is common sense it's this basic level of
practical knowledge in everyday service situation there is a very blurry blurry
front here with bias okay because some power of this common sense is exactly
bias that we don't want in our model but some part is just normal thing like here
it's okay to keep the closet door open but it's not okay to keep the fridge
door open because the food doesn't want to go Wow okay so the way they did that
the way they tackle that is to build a lot of that sets they'll be there like a
lot of that sets that try to gather some common sense and I really want to
highlight some of them vino grande is really fascinating when it's an
extension now the winogradsky mat challenge is solved okay we've started
with that later on earlier and now they build a successor to it which is a vino
grande they have all these common-sense cosmos
cue a lot of visual commands as I said they're all very interesting and let me
just show you maybe if you want to be just repaper
you should really probably atomic comets and the vinegar and a paper atomic is
this relation database that I was talking about okay that encode comments
and set a crowd so this Kermit is a nice paper by antoine
busloads is about using transformers pre-trained for someone to actually
mount this knowledge graph and we know ground is the successor to vinegar
schema so let me show you just a little bit about vino grande okay because
that's interesting we are building all these data sets from crowd source
information and actually it's very hard to build nice crowd crowd sourcing set
up and I think they have a lot of experience on this topic which is very
interesting so here we know grande the main problem is that we don't want
people first we want people to have IDs and not to generate from just the
simplest the simplest IDs they have so they use some way to enhance the
creativity with what they call random unco words so the people in the crowd
source the Turkish they will have this random word that they are supposed to
use in their example the example they are writing and this helps a lot to make
them like creatively create new examples and they also use like very strong data
validation set up where they phantoon state-of-the-art models on the data and
I try to remove that that's too easy to predict okay so this could be like like
an online data creation process where each time you have a better
state-of-the-art model you actually through new examples okay to remove them
and they show that the state-of-the-art model from now they are like very bad
performances on this well pretty bad anyway you would need actually a lot of
samples to get to human performances so that's probably the way forward in
creating crowdsource data set like trying to make more diversity and try to
filter this data set better okay and this lead us to the same idea for models
okay the data sets should now be evolving there should be dynamically
updated as new models came here come in and the models are actually the same
problem which is called the continual learning question what is it so that's
the end of the this the end of a description is me that bird was trained
on like 2018 data it will always think that the president of like the united
said was the 2018 president okay but maybe
will change in the future maybe the president maybe the like the countries
will change and the model we just never learn this okay and how do we overcome
that well we need models that are able to evolve over time okay but then there
is the main problem with this approach there are a lot of people have been
working continual learning the main problem is called catastrophic
forgetting which means that you want to learn new stuff without forgetting
everything that you've learned before and here you have different approach to
try to tackle this from memory to regularization to dynamically growing
models and this is probably the way like NLP should go forward to have these
models that I will to adapt and that I will to generalize to other domain as
well okay so that's the end of today that was a very long talk I hope you
really like that and the plan for now will be series is to do like a lot
smaller talk in the future with like more basic information so if you like
this you can say it and you can send us feedback and we continue with an LLP
series as well