The Future of Natural Language Processing

hi everyone I'm Thomas Wolf from HuggingFace and today we're going to talk about an exciting topic, the future of NLP well more precisely the future of transfer learning in NLP. To be honest this talk is like a personal walk through some of my favorite papers and research directions of the last few month so I really hope you enjoy it as much as I do we're going to talk about a lot of things we'll start by talking about model size and the data requirements then we'll talk about in domain versus out of domain generalization. We'll move on to fine tuning and model evaluation, what are the problems and the limits of these things and then we'll end up discussing common sense and inductive biases. Okay so let's start by the elephant in the room. You've probably noticed the models are getting bigger and bigger there is a nice graph that Victor Sanh made last year which shows how these models are just getting crazy bigger it's exponentially increasing so now the state of the art models are over 1 billion parameters and they actually far above that because several models are now like 10 billion parameters like T5 and the Turing model for Microsoft and you have a huge problem with these models because they don't even fit on one GPU, not even on two GPUs you need like four to eight GPUs just to load these models and to run them with a batch size of one. Okay now why is it a problem well that's a huge problem because if you check like the current leaderboard for instance the glue leaderboard that you can see here you can see that the competition is narrowing okay it's all about the same teams now there's like Google Microsoft Amazon Baidu Facebook that's pretty much it you see there is no academics there okay because just the models are too big, the computational requirements are too big so there's a huge problem of diversity and also where does academia fits in currents research in NLP. Another problem that you've probably seen is the environmental cost of training these models okay they require a lot of energy, energy consumed, well generate carbon dioxide and so we know that training these models isn't it good for the environment so what can we do? and the last problem was very well stated by Francois Cholet and means that if it does go bigger what do we expect? do we expect to see like a phase transition at some point or is it just like building bigger scales to try to to reach the moon okay now there is another option which is to go the other way. we know since this very nice paper, I really like the title, "optimal brain damage" in 1989, we know that neural nets are over-parametrized, they have too much weights. We can just prune them and hum the most recent example is the lottery ticket hypothesis which says that if you take a randomly initialized model you can actually find a subset inside this model which already has good performances on your test tasks ok you don't even need to train your model, you can just take this big model and find a subset, a small sub network inside of it, that's already nice for your task and we see that in fine-tuned models as well, when you fine-tune models you can remove weights, you can select the weights, and you keep the performances so here you see this example on NLP tasks where actually you can remove like 90 percent of the weights and keep the same performances ok so we want to push in this direction so here is a small promotion we're doing competition that started actually two days ago which is about getting the more efficient models that you can ok it's called sustaiNLP it's a workshop that will be collocated with EMNLP at the end of the year and the goal is to get to the same threshold as the current state-of-the-art models like BERT-base for instance and just to try to be the most energy efficient that you can okay the competition is only on inference for now because inference is actually one of the biggest part when you when you look at the lifetime contribution of a model like the lifetime computational cost when these models are deployed in applications on like thousand of servers actually inference cost is the biggest part of their lifetime environment cost okay so if we can get better on inference we're already a long way toward our goal of getting more efficient models now how can you reduce the size of the models okay so we start by this let's go a bit into that they're mostly likes three techniques that you can use the first one is called distillation second one is pruning and then there is quantization okay so distillation here is a good example we made a model called DistilBERT at the end of last year which get like 95% of BERT model on GLUE and is like 40% smaller. How you do that? Well you take BERT as a teacher model which means you have a pre trained BERT and then you will train a student model which will be smaller and you train the student model to reproduce the generalization capabilities of the teacher okay so here is a good example you see this sentence "I think this is the beginning of beautiful..." and then the model is asked to complete so BERT is trained like that BERT is trained to predicts masked token so here you see the top prediction of BERT and you see they all make sense okay you see "day", "life" "future", "story" all these top predictions that BERT think are possible they all make sense this is what we call generalization. The model, BERT model, learned to generalize beyond just the simple training example in this sentence and what we do is that the student model will be trained to generalize in the same way as the teacher model so it will learn the inductive biases that the teacher model has learned it's very easy we just do a Cross-Entropy loss, it's called knowledge distillation, and we just do a cross-entropy between the output of our student and output of our teacher okay when you train you can use temperature to like emphasizes the lower probability that's very common trick in NLP so now a lot of people have been publishing in on distillation at the end of last year you can see a couple of papers. The state of the art distillation models are kind of very complex now TinyBERT is a good example where the student model actually has a smaller size of hidden-states than the teacher model and also mimic to hidden states of the of the teacher so you have down you have a down projections from the teacher to the student they also used a lot of data augmentation so it's kind of tricky to know exactly what is the good the part of the good performance of this this latest model come from data augmentation and what part come from distillation but definitely people can get very small models with good performances using mix of distillation and data augmentation Now let's move on to the second technique you can use to reduce the size of the mobile okay which is called pruning improving you directly work on your teacher model and you actually remove waste from this model to make it smaller okay there are various way you can prove and one simple way is actually to remove the tension heads in your transformer it was shown that you join two nice paper of last year one by Alina boy it's a from Edinburgh University and is a other by a pony shell at CMU and they show that you can actually remove a lot of the heads of transformer model after they've been trained and you can keep very good performances so on the top you see the results on translation you see that you can actually remove a 90 percent of your heads and keep a very good blue score and at the bottom you see the result on blue which is the general language understanding match mark and you see pretty much the same performance so one way you can identify the heads you should remove is by using what Michele and polish shells will call the score the head importance coal which is actually their the grant of the loss we forgot to their to the output of the attention layer and if you if you remove the the last important heads first you can actually kill these very good performance what is interesting is that you see on the other slash graph here is that you can actually if you remove some heads that are less important for one task here it's on Amanda lie you can actually see that this exceed this is quite resilient to dominate updation so here in the middle I you have to power from the data set you have what is called the match data set on a mismatch that's it and if we move some hairs around at useful for one you can see that actually it's kind of related well okay the graph is not exactly linear but there is some correlation with heads that are not important on another domain okay so which means that there are some has actually not useful for anything at least on MNLA okay that's interesting because it means this is quite resilient to the maladaptation now you can also die out here remove the weights when you remove the weights it's actually more fine-grained because each each specific weights can be removed but the problem is that you will end up with a very sparse matrices that are not so good for us for GPUs we'll talk about that later but you can get also very performances here is a nice paper from a SAP team with cherry whirlwind and the Hang Wong and they did a very nice paper that his witness were complete rain models as well removing weights with a nice differentiable s0 pruning okay they use like a are concrete distribution which is basically the the Gumbel softmax break okay and the last part is actually layer pruning in layer pruning this is nice paper by under a fan of last year at Facebook as well they actually remove full layers of the transformer so this is this is really a lot okay you remove this this pool player so the way you can actually do that and have the model still behave quite well is by training the model to be resilient to that so during pre training you will randomly remove weights remove layers sorry like a dropout okay it's a structured approach to drop layers and so the model learns to actually behave well without some ways it works well because this transformers layer they are like a repetition of the same module okay and you have this residual connection this shortcut connection which means that actually one layer and the next one there are kind of they always connected with a shortcut as well so when you remove a layer it's actually less aggressive than in some like fully connected models without shortcut connection so layer pruning is very interesting as well and you keep these dance matrices because you really move full blocks of weights so why am I talking about this problem of sparsity well because all these models will run them on GPUs and GPUs our GPUs but CPUs and GPUs they are really optimized for dense metric multiplication okay they have troubles with sparsity and when you use this space model on GPU on GPU or GPU they're usually way slower it can be like three to four times over to run so they're smaller indeed but they also lots a lot slower and it's also not efficient so you're losing what you were actually looking for which was energy efficiency so they're values where you can try to circumvent this one way is to use what open area was promoting which is block sparsity so instead of removing all these weights single weights you have to remove blocks of weights and these blocks have the nice size that is a adapted to your GPU or GPU kernel which means that you keep dance metric multiplication well you can remove blocks but actually when you do like strong sparsity it means you just keep blocks actually okay your matrix is just a few blocks as you can see here so this helps you the another approach is actually to make a full sparsity but with patterns that you actually control so you can keep advantages of optimized CUDA kernel now the more you structure is positive usually the less performances you can get because you actually constraining the model okay so if you have like instructors pass et usually we can keep the best performances and all the metrics but you lose the efficiency and the more you structure the sparsity the better your your energy efficiency is and using the worse your performance is so another alternative is to actually switch chips and try for instance the new IP you from graph call which are chips that are specifically designed for spaz models okay they are made of a lot of small module can process data independently and have this smaller RAM associated to them and they can actually process sparse matrices very efficiently now the last technique I want to talk about when we talk about shriek shriek shrinking model is quantization quantization is also very interesting we know that using float32 using full precision floats weights is actually not the most optimal way we know that these neural networks they also work well without precision and even quantized integral so we can do that from our transformers as well okay it converts the flow just 4:32 the full precision weights into integrate so we really reduce a lot the size of our model and we use dynamic consolation for instance where you have a scaling and zero points conversion and this works very well there was a nice work by Intel I called q8 bird and it's really working well you can try it it's very easy in Python and as well it's very easy to apply conservation and a bit like layer pruning you can do training aware cancellation so you can tell your model is gonna be quantized at the end okay you actually train it in a way that it's getting used to be quantized and so you have better performance scissors as well at the end okay okay we've talked a lot about these big models how to reduce the size now there is another things that is a increasing exponentially recently in NLP which is the requirement for more data people are using more data for training and people are also using more data for fine-tuning okay so there's a problem because when you compare two models that was they were pre trained on two different data sets of very different size it's really hard to tell if one model is better because it was betraying on more data or if it's better because usually of the like novel architectural design that people introduces good a good recent example of last year was XL net the the transformer from from Google there was the successor of transformer Excel and exhale net use a smart autoregressive training so you could actually do auto regressive training while having the possibility to attend to both contacts to left on the right context usually when we do order autoregressive training on Ingo whywe so a model is masked like the right context of each token what Malo is masked but in excel nets actually they do auto regressive with a random permutation so the model actually learns to pay attention to both context now the problem with that excel net was also trained on a lot more data than Birds so when they compared to Birds it was really hard to tell what was the difference what was the improvement that came from training on 20 times more data and which which which province came from having this new Auto regressive architecture so there was a huge debate and actually it was kind of settled by Roberto which was a very simple bird architecture that's exactly the same birds but just trained on more data basically and Robert our output from Excel net which showed that basically there was the bitter lesson of an LP and the bitter lesson of machine learning in general as reach a certain talk about it which is that if you have more data is usually output from having a smaller model okay and now there is this recent paper that we're gonna talk a lot more about which is called scaling laws for a new language model this is paper from open AI and it's a really in-depth study of what happened when you increase the data size and when you increase the model size we saw a change in the architecture so it's very good it's very good study now this is for free training but we see the same on file journey which is that people when they fight hoon they do a lot of that documentation and a good example is the Winograd schema challenge the winner grade schema challenge was very interesting that set for a long time it's very simple you see one example here you have a sentence that say for instance the trophy would not fit in a brown suitcase because it was too big and the question is what was to be was it a trophy or was it the suitcase and the model has to do a classification between these two one okay so that's very interesting because you need some common sense you need to know that the suitcase is usually bigger than your trophy and the way it was solved and for a long time it was a very hard challenge it's a very small one you'll need like 300 example and for a long time it was very hard to get good performances for deep learning models on that and the weight was solve was actually to generate artificial documentation that sets with some heuristics extracting from Wikipedia sentences where you have two times the same noun like two times trophy and replacing one of them by it like this you can build with these heuristics you can build a huge data set from any any like crawl text data set and you can pre train your model on that and then you this fine-tuning on the winner brat schema challenge after that and you can solve the task but you can see that it's not very we're not really happy about that because scientifically we have not really learned anything about common sense by doing that okay we've just learned that more data is better so let's talk a bit more about retraining first okay so I talked about the scaling laws paper let's go into depth in this paper so this paper is about one single architecture it's about the transformer train for auto regressive language modeling okay so you only have left context for each talk and is that transformer that up trying to predict the next token given the given the beginning of a sentence this is with GPT too but the experiments with many sizes with many sizes of the datasets and also they did some nice scan on the architecture there was always some question about transformer which is what is the optimal ratio of the number of heads we forgot the model size what are the optimal ratio of the number of layers we forgot to the diamond shovel models and I can show that all these doesn't really matter as long as you in the like very flat suite point where you have this nice hyper parameter that was pretty much the original attention is only unit parameter as long as you as you're in this sweet spot you good so these models are very actually rubbished to this simple hyper parameter exploration and what they show is that just by scaling and model size and scaling data set size you have a very clear power law which mean that's it's actually exponential actually that's what it means it's exponentially squeezing so if you double your model size you have this linear improvement in performances if you double your data set size you also have this linear improvement but it's power low they go over very wide ranges they go over hold of the went on over horrors of magnitude now you can read this paper it's very interesting they show that to to interesting thing for me one was that actually that was actually there was a failure follow a paper by Eric Wallace at UC Berkeley we show that it's actually better to have a too big model it's actually better that your model is more ready than we used to that and then we used to to think and for the datasets so if your model is actually slightly too big for your data set in a way we were as the size of that's it on model before you can actually get better results you go you go down you know you're lost go down faster and there is another interesting thing in this paper which is also something we saw a little bit earlier on the pruning which is that this transformer models the embeddings and the layers they behave really differently okay so when you find sauce when you prune you should prune differently embeddings and layers and here they show so that it's actually the the capacity of the model is really defined by layers and all the power law they observed they work well if you remove them it is okay if you don't take into account the imbalance when you compute the size of the model okay now there is a last very interesting thing is that they have two lows for the the decreasing glass okay one of the lows and one of the increasing loss one of the lows to decrease the loss related to the capacity of the model when you increase the capacity of the model and one is related to increasing the dead set both kind of both can be related together in terms of computation like more data means more computation bigger model also means more computations you can collect this to power low and what you see on this graph is that you have actually two slopes which mean that at some point they joined together and you can't really know you don't really know what which loss you should have okay you have one loss which is actually defined by our giving using the optimal data capacity the optimal that set and one law is defined by using the optimal model capacity and at some point they predict their prediction doesn't fit together and which is actually far above which we've been experimenting right now it's around the pizza Pepe mr. parameters regime and hear what they say that activity architecture the transformer architecture is breaking down that's what they open yeah okay so all this exploration of more data and bigger models are actually related to one idea the idea is that maybe there will be a qualitative jump in behavior if we get enough data okay the idea is like maybe just getting more data is enough to see a qualitative like a phase transition how the model behaved and there is some hints of this it's a quite interesting idea I think it's very controversial somehow because more data as I was thinking more data bigger model is this video research program right and there's this nice paper from AI - from Allen thalmor and people at and at Olay to Israel they show that actually just comparing birds on robots you can you can you can invert this a phase transition okay so comparing birch row better is interesting because they are the same architecture they're exactly the same models just that bird's was trained on only 137 billion tokens or D and roberto is trained on two point two Tara seconds okay so what I was really trained on a lot more data and here you can see this very interesting is they were short evaluation so you just take the free trade model you don't find unit and you ask a question that are kind of like the window grad scheme a challenge question okay here you ask it a 25 year old person age is then a 30 year old person I mean the model has to predict if the was younger or older so it has to actually compare numbers together and use some common sense if you want and you can see that vert is pretty bad perch is the blue curve and Rho beta which is the green curve is actually like super good at comparing this remember in the in the range of ages for people you can see the same on size comparison if you ask Roberta to compare the size of like the Sun to have the table to a house and like that Roberta is usually pretty good out of the box so it has some form of what we would call common sense and this is out of the box okay just by pre-training it's also even able to compare birth rate like birth year sorry like if you asked if somebody was very born in this year or this year who is older us which means it's the reverse than the H okay the year the higher the your birth year of this was the younger you are and the model is able to do this sweep swatch swap so that's very actually surprising I think now there is this big question when you do now fine tuning okay so we've seen that pre-training bigger data is just better and actually you may even see some phase transition now what about fine tuning okay so fine tuning means you've taken this free training model on their data set and now you want to adapt them on one task okay this paper is very important one from deep mind the evaluating learning and evaluating general linguistic intelligence it's a paper that can pose a lot of question it's an opinion paper and you should definitely read it I think if you want it's one of the most important of last year last year's paper it's like now why one-year-old and what I say is that the reason that said they're actually too easy to solve with leader generalization why because we have this training data set for the mollow that are usually often quite be like Amanda lie or SLI or squad they're really kind of big data set to fine-tune on and they give models that actually don't really have good sample efficiencies so let's let's focus what does this mean let's say we have two model we have model a model hey a as a ninety percent accuracy with like a hundred training example but then it doesn't get any better with more training example okay it can plateau at ninety percent but lb takes like one million examples to get to ninety percent accuracy but then it can increase a little bit and it end up plateauing at ninety two percent so if we do if we just do like we do usually like we compare the model at the end of the fine-tuning we will say Oh model B is not better because it can reach 92 percent accuracy well actually we should really we were Model A because Model A is able to reach very good score with just 100 training example that's really great that's what we wanted from transfer learning okay that was one of the initial goal of transfer learning was to make this model work on very small data sets and this is called sample efficiency it means how better your model gets with one additional example there are a lot of other problem with these models which are related one model is that when you find you in on these big data sets usually we get models that work well really exactly on the training and the fine-tuning domain so you have models that work well on squared for instance that work well that means they work well on Wikipedia question answering in this very narrow field of question answering but we would like for instance we don't really want squad models we would like to have question answering models that would work on any question answering tasks and this is related to sample efficiency because it means that if you just give a few general question answering example you would like your model to work already well on them okay you would like to the model not to need to function on full Wikipedia to just no question so in Wikipedia there is a related make matrix matrix which is called online code length we'd say that how much better your model will get with each additional sample okay it's an information theory metric which actually is related to how how much you can compress your mala so it's a very important matrix and it's actually probably the way to go forward so here just a few example here you see if a model was actually trained from so you can see here the benefits of transfer learning first so here on the bottom bottom you see you birds that strain from scratch on question-answering so this bird is not initialized is initial is going to be initialized okay so it's pretty bad at the end you just train on the full square data set and you just don't get very high now you can see that if bird is pre trained already on like its usual HP training which is the tanto bit copper so we keep it yeah it goes a lot faster okay so this is the benefits of transfer learning and it reached nice accuracy and now if you look at this the last part which is actually a bird that was pre trained on another question on saying that is that you can state already start very high so it means that this model is actually very simple efficient because it was already fine-tuned on another question as varied as before okay so this model when you look at them on online image it's an online code length metric you can see that they actually very different because when you actually find you this model you have to understand the birth model was pre trained now to fine-tune it we'll add a linear layer on top okay and this linear layer is randomly initialized so there is no shortcut here you will have to train this linear layer okay you cannot really bypass this when you use this model with this task specific layer added on top you need to train this task specific layer which means that you can't really observe effort okay and this mother we always have to catch up somehow they always need a few example to be able to train this last layer whatever whatever smaller layer is this is what we call task specific components and that's a very strong limit to how we can do a very simple efficient model okay now this was just to show you when you actually investigates sample efficiency you can see that it's also a good way to see if the model is actively learning the task using the knowledge it had from before or if it's actually learning the task from scratch so here you see comparison between bottom robots all right you know that robots are we've been showing this blue and green dragon diagram Roberta has some kind of good common sense better than bird ok and we can see that because when we find your Obata it's a lot more sample efficient with just a few example it's already catching it it's already getting better matrix on birds and by comparing the sample efficiency curve which is the performance of your model while you just use progressively more more sample to function it just by comparing the curve for birds on rebuttal you can see I think you can have a good idea of how much your pre-training was helping to to get good performance on your target task ok so this paper is also very nice investigation on that and that's actually posted a nice question of how much data should we need and this actually lead us to the next topic which is in domain versus our domaine what we would like in general is out of the main generalization what we have usually is in the main journal is Asia what does it mean let's have a look we've trained our model on question answering datasets now we are experimenting with like real life where question answering is different like the domain is different the language people use is not Wikipedia language and we see there is a strong performance rope because our model is not really capable of out of dimensionalization here is another nice example on this paper by thomas mccoy and we show that actually if you train birds and has good performances unlike your fine tuning that's a set i glue okay you can then you can then test it on another data set which is a out of domain so how they did out of the main is here is by having some heuristics so for instance you try to make like for instance lexical overlap heuristics so in Amidala you have two examples and you have two sentence and you have to say if one entail the other or contradict the other okay they are very simple heuristics for that in the data set which means that usually if there is not means contradiction if there is a lot of flexible overlap it means use a entailment so they build an adversarial data set called the hands which is in a transformers library actually you can use it we have an example it and which is adversarial so when there is a lot in this data set enhance where there is a lot of lexical overlap its contradiction the good label is contraction here is an example here and what they show that is that they can train several birds on this fine tune it with different random seed okay so the difference is is very small the difference between this model is very small it's just the weight initialization of the last layer and this model they behaved similarly uh nominally they are like various very similar performances but when you test them on the adversarial hunts data set they behave really differently okay there are this huge variability some of them are pretty good well none of them is really really great but some the magnet so bad that some of them are really bad and this means that actually what you see in domain to test performances you see just give you no indication of how your model will behave in the real world which is kind of bad okay here is more example than what they do on them in a lie their values heuristics they use to design and you can see you have more or less variability in the fine-tune model some heuristics leads to like really a huge variance which means that you can't really know how your mother will behave in the real world unless you be able to honest you able to test it on real data and so on my like small have a smaller effect okay now it's really hard to investigate out of the metallization so one way is to do this kind of heuristics another way is try to build our datasets ourselves so we can control them so the only really interesting Phillips in this work is the work and compositionality compositionality is to investigate how you model is actually able to combine values part of a sentence to build a meaning representation this is very important because we think that in linguistics composition is something important that we do when I say the blue dog is going out you kind of gather blue and dog together in a single in a single meaning and then you combine this with the rest of the sentence to build up the meaning so there's a nice work called scan and pcfg sets which was actually a really really long but super interesting paper by age of two up case from Amsterdam University and they can build a huge data set that's replicates some natural language data sets so they be status in which you you have to combine instruction together to generate an output okay and you have to combine instruction compositionally so you can generate your nice output so you have like it's written like repeat something and there's something we have to be considered as as a single entity and then the repeat function at the apply on it and they were able to naturalize the data set so they were able to reproduce the depth and the length of like a translation or a very big translation that I said VMT challenge okay so they're this artificial data set you generate this instruction yourself but which really replicates well natural language and WebRTC artificial deaths that you can actually do some out of the man generalization so Finance in your training part you can remove some instruction some words do the model will never see them I like some way to combine words and it can estate how the model will learn to do that and I like one of the most fascinating experiments have so last year which is called over generalization over the duration is like super fascinating it's it's a bit like when you're when you're kid you're learning language and you make mistakes but this mistakes are like smart mistakes okay for instance you will add Edie at the end of a past verb verb in the past tense but it's like an irregular verb we say instead of saying I went we say I go with okay and this is called smart mistake because it means you've learned room you've just not learned yet the exception and we really want our model to learn rules so they can generalize outside of the training domain okay so you can investigate that by putting some irregular verbs in this article that said so here you can investigate that and the nice thing about these paper is that they compared the complete various architectures they compared lsdm they comport kind of nets the commands former together and what they see is the really very important C's like Alice then they cannot struggle with this question of over memory of over generalization and transformer are really lot better you three can cornetist somewhere in the middle so this is this nice graph where you see on the top you have like very few examples like a very few exception so it's kind of hard for the model to learn that so here you see during the training the red means that the model is over generalizing the blue means that the model has lone exception and the gray mean that the model actually don't we know what to do so it's predicting like random output which is neither the role Naser expect the exception and when you have only a very few sample like very few exception although just can't really get them cornet manage a little bit to do that but transformer and they don't when you have a lot of exception the model learned just to memorize them that would we see in your network okay they are very good at memorizing brute force memorization and when you were in the middle you see a bit something that's similar to the way a human learn which is much you start to have over the ionization during your training you have the peak when you actually open your eyes everywhere and then you learn that there are some exception so it's very interesting and it shows that this model are capable of some out of the main generalization somehow now talking about in domain and out of the mineralisation posed the question of how do you measure the distance between your two domains and that's a very open question there is a large body of work on domain adaptation that is trying to show that you can actually extract some feature from your data set and you can compute some similarity metrics on them but it's definitely a very open question how can you measure the distance like in a statistical meaning or you can measure the distance between two data sets okay now you can know when you're not in the main anymore okay so I think talking about in domain versus out of the miniaturization we're talking about this question of sample efficiency we saw that using this task specific component was actually a problem okay because we have to fine-tune them on each task we have to fine-tune them and they are like limiting how efficient we can do like they are like increasing the number of sample we need to learn the target task that we have at the end and this is actually related to the rise of analogy so let me show you a little bit okay recently we've seen more and more text to text model this was tearing but studied by this nice say swap a paper by a bright Mack cane which was called the natural language Decathlon it was a task it was a benchmark where you have like 10 tasks to take at home and you have to just they were all cast in the same format in the same framework they were all cast as Christian answering tasks so if you have translation you would have a question which is translate something and translate this from English to German and then you have a context which is the English sentence okay when you have summarization we actually formulate that as a textual input which is summarized this and then you have a the newest example a new model has to generate the output it has to generate the trans ladies the German Association it has to generate the summarization so it's not classification task like we saw before but it's generation task gbt 2 is a big model that makes a lot of PR but was also very very nice paper called multitask language Molalla unsupervised multitask learners and there was a lot of 0 shut experiment in this how do you do the shot with CPD - you do the same you actually formulate your task with a prompt which is fine some summarize or for summarization they did what they they could TLDR - too long didn't read and then they put the the sentence to summarize and the model is a actually train to generate is not training so 0 shot the model try to generate some plausible a completion and the plausible completion will be a summary and gb2 is quite good at that a lot of tasks can be formulated like that like the Lambada data set which is very interesting tasks where you try to predict the last word and the last word of sentence is something that is not explicitly said in the beginning but uses just implied for instance people are talking about giving birth but you don't say it's not explicitly mentioned in the paragraph and at the end you have to talk you have to complete with the word pregnancy ok so the model has to understand the underlying meaning of the sentence to be able to put in the right word so this is completion tasks it can be formulated as text generation where you generate the next word and we've seen a rise of models like this which are trained to generate world and where we actually recast our usual classification or usual NLP tasks as test text to text generation tasks ok and we've seen that in a lot of recent model Facebook birth model which pre trained with the text to text objectives so it was pre trained by giving it Corrib text where you have like randomly dropped tokens randomly dropped words or the text is shuffled you can see all the the objectives here and the mother is trying to regenerate the clean text from that ok so you can formulate this the noising objective as a text to text generation the correct text to clean text generation and they even train a multilingual model called embowered on this so we we have this both model now on transformer so you can try them and this models they're trained in this framework and the most famous one is the recent t5 Google is mostly famous because for some time with a short amount of time it was the biggest model so the 11 billion parameter model ant if I mean is P trained and fine-tuned like this it's pre trained with a denoising objective like the one with sofa board and it's fine tuned in a text to text format so for instance on blue tasks like a manila I you have to pretty contain my no contradiction and you will have you will formulate your task as a text input another we have to generate entailment the word on statement or the word contradiction why is it great this is great because with this we don't really have to fine-tune any additional layer okay we don't actually add any layer to our model we take the same architecture for pre-training and for fine-tuning there is nothing to fight you in on to train from scratch which mean that in theory we can do zero shot because no weights needs to be fine-tuned on the on the target task ok the model is ready to be used on the target task now usually this means we need to do like target target task inform for training like we need during the pre training that's what is very interesting in the t5 paper that you can read as well during the pre train they can doing for training the prepared model for this task by giving some example of the fine-tuning as well so the model knows that it will be asked to do some containment or contradiction between your question but then you can have 0 shots and actually when you look at what Sam Bauman is saying about glue and super glue is their successor to this task so that is actually really hard now to find some datasets where our key we can have a good classification like a good NLU task where you actually can prevent this model to which human preferences where this model don't even already reach human performances and usually they remain preferences because they are taking advantage of this fine-tuning task okay so in general we should I think we should really focus on 0 shots adaptation for transfer learning like zero shot of very few sample efficiencies your adaptation okay I hope that's take away of all this Diskin discussion now let's go back a little bit we've talked a lot about the quantity of data we've talked a lot about the size of the models but all these models there are some common problems behind this quantity of data which is their lack of robustness they aren't there a few thing but for instance one is like the lycra business when function so let's start by talking about that and then we talked about the lack of robust test or if we got to common-sense weight when you find you this model you can usually see some pattern like this so this is a nice graph from jason funk paper called a sentence of korean stilt and they find hubert with just just wearing the random seed and they show that this model say they are very easily fall into what we called local minima so you can see this on of behavior sometime the model work now if someone themselves it just doesn't work at all so there was a follow-up recently by u-dub just dodge paper and it's also exploring when you just vary the random seed for fine-tuning or how the preferences of this model are behaving and they saw the same thing that the model are very sensitive to the random seed and they have this they are very easily they very easily fall into local minima what i call local minima is that it has bad performances and stuck in this in this video okay so how we solve that usually we saw that with the very brute-force approach that you can see for roberta for instance which is that you will train hundreds of models you will fine-tune hundred of models on various fine-tuning set up you're exploring the full hyper bomb into space and just keep the best one okay we talked a little bit about that better now that's one way to mitigate that the other way is that we probably need just better regard regularization okay so the mix out paper is very interesting they show that actually when we use dropouts we usually use dropout to fine-tune this model in drop out we replace some weights by zero okay well when you do fine-tuning is it is it good to actually have the model regularized to add zero maybe instead of replacing the weight with zero we should replace them with the pre-trained value so we keep them a little close to a pre trained model okay and they show that it's some form of adaptive l2 actually and the model are behaving better with this regularization objective that you can see on the on the on the gray map here now they are more they can be a lot more complex with polarization and all the work of Microsoft on the MT DNN models that were tapping the blue leader ball for some mamma for some time is also all these various regularization you can do so you can do also organization where you try to limit the evaluation in the weights during fine-tuning there's a lot of there are a lot of where you can do that to realize this model but it's probably the way you should go and then the last way you can do is actually just to train a fine tune a lot of this model and to assemble them so you can fine-tune them with multi task as well in which you actually try to increase the domain the data set size by gathering several tasks together and then you train several model that you assemble and if it's too big at the end because you have several model you can just distill them back in a single model okay which is nice but that's very complex and actually when you look at the typical setup right now to get the SOTA on blue that black sent me the other day it's pretty crazy look at this you have to prevent your model so here as we said just use as much data as you can't as much compute as you can then you have to tune the fine tuning I prepare meet your Lots okay you do that type notation so you will some for some specific tasks of glue you will start by fine tuning on W and Ally which is the biggest at a set of glue so you get some data augmentation you increase that we've unlabeled data like we saw for Winograd the schema challenge you increase that with additional label data wherever you can use it okay then you can use some tricks that are actually not normally forbidden but everybody used to do a pairwise we're ranking where you actually exchange information between examples in the in the queue and I and W and I that's it and then you fine-tune as many model as you can you take the five to the ten best of all these models you and sample them and you submit that as your results so this is just crazy computer and it's definitely overfitting a lot to glue and it's a big problem why is it because all these hyper parameter search now we know that actually if you take every any kind of model that was shown by the gaba Meli's paper for lsdm last year if you find unit well enough you can actually reach some very good performances but you've used a huge compute budget to fine tune it okay so this was actually formalized in this nice paper by a just judge which is called show your work which was an ACL paper last year which say that we should not just we should not just report the end evaluation metrics but we should report what happened during I have a parameter search because it gives information on how much computer needs to actually get this model to good performances okay we talk about that force for data sets we said that if your model needs like 1 million that sample that said to get a good performance it's actually it should be advocated and we should know that so we can select also the more efficient model and you're the same for hyperparameters search if your model needs a crazy high programa to search to reach the good performances we should know about that so the show you work paper say that you should give these curves that show all you during all your hyper primates search how you model was behaving and what was the best one we see the same thing with standard splits the these deaths it has splits in standard training test which is nice to comfort model but it's rich people to it lead people to overfit on some sana splits ok so the specific heuristics that will work under standard train split people we can over feed them and almost unco them in the models which is bad so this guy government paper runs the barrack paper at ACL we need to talk about sooner speech is also very interesting read and they advocate for random randoms please I don't know if random split is a solution maybe the solution would be to have several standards it but definitely just a single sauna split is bad and people over Fitz and we can see that because when you try to do 20 smaadahl on others plates they actually can be a bad they can be kind of worse than what was expected okay so this was about hyper parameter search okay but now there is an underlying question also as well on how our model are behaving and we know that they are brittle and spurious so let's just see a little bit what this means Brittles mean what if we change a little bit the input we can't see weird behavior like this was very very visible on on the geunyoung paper on squad they show that if you add like a random sentence at the end of the squad context like the answer that the model predict for the question are totally different another is just lost because we've moved this very small modification with lead we've left to training domain okay but our models are also spurious spurious mean that say we really do the least amount of work to get the bear from best performances so if they are easy heuristics that they can leverage for instance like we've talked about lexical overlap when you have a lot of overlap on the two sentences of an ally of em and Ally like the model will very quickly learn that a lot of overlap means that the right classification should be an Talent but this is not what we would like who like a model to go to the semantic meaning of the sentence and not just to stay on the surface from just to stay on the lexical overlap heuristics so this is called spurious and it's also mean Amidala are very fragile because in both this case when we leave a little bit the training data set the model gives wrong prediction and fail in unexpected ways so how can we solve that that way as well well one way would be first to get better ideas of how they behave and I think here linguistics is very important there is this nice talk by le public why we should care about linguistics well I think it's really the time for linguistic to become to come back to NLP to replace some some of the machine learning approach and to help design better evaluation okay because we know that this linguistics rules they're kind of the underlying rules we would like our models to learn so they are probably good not-not-not to try to be a train that's set but they're probably very good to build evaluation at a set and failure evaluation that's it okay now what we really want to do is to try to provide good inductive biases which means that our models will want to learn these rules there will be it will be easier for a model to learn these rules than to learn these heuristics which is called an inductive biases that will that means the model goes toward the solution that we want faster than toward a solution that we don't want so one idea is all the work on composition ID that we show we know that compositionality is important for model to get the good the good understanding of the meaning of a sentence so if we can deal models that are good on compositionality that would be one way to incorporate some linguistics so this is one possibility so how could how can we do that well we can trick the architecture of the model this model they have this attention heads so we can try to encode in this attention heads some of this graph that we see in linguistics like dependency graft or like shallow like a Sarah relation and this was the nice word for instance from Emma's trouble the lizard linguistically informed self attention for semantic role labeling that was the best paper at him in LP two years ago I think where some of the attention head were actually trained with additionally augmented data with semantic role labeling okay and the model reads better performances because it was able to embrace this inductive biases there is a lot of nice work as well graph computed convolutional network phrase of this work by by diego on us pasting and even cheetah where they explore the semantics in translation with graph neural network encoding the the dependency trees okay so you can try to forty architecture another a different way is to try to add this inductive bias in the training data here is one funny example it's an anonymous paper and I hope review I think it's very nice they try to help Bert by providing after a training sample for Bert like a form of semantic role labeling so they will say okay this word is like a predicate this work is the first argument okay they add this this semantic role labeling information after the training so this form like just the input from the model and then at test time you just don't worry you just don't use this you just don't you don't input monte data and you see that the model actually gets improved whoa business by being trained with this information okay they experimented on an adversarial squad that's cool swag that is it okay so now probably what we want to do for that is that we probably want to work on pre-training to get some deeper linguistic information in our pre training that set as well like for instance we can add more like linguistically informs panel level representation we can try to have like more modular way to prevent our models and this is all very open but how can we incorporate this the linguistic information free training will be probably important as well an alternative way to make this model robust is to add common sense because usually they are lacking some linguistic stuff but as well as common sense there are some limits so why do we need common sense because we can't learn everything from text let's have a look you know that for instance sheeps are white okay but in text usually we don't say that because it's too obvious and on the contrary we often talk about black sheeps because yeah because that's a common expression okay so when you ask a model that was trained on text only when you ask it what color is either sheep it will say oh I I'm not sure it's probably black okay because it has no way to know that white is the real answer okay this is called the reporting bias it means that in text we usually don't state the obvious we don't write common sense so how can we help them out and learn that well there are various way we can try to add like a knowledge base in the knowledge base we like hover this shape node and we connect it with color white nodes so the model can learn that this is the color of a normal shape we can add multi model when you show a picture to the model with a white sheep and the model can just look at it and say oh yeah okay it's white I can see it on the picture and then the last part is maybe you can just tell the model we can have like human-in-the-loop training like humans are actually learning the model can ask what color is the Sheep under new human or somebody say yeah it's white and then the model knows it so this is human-in-the-loop training these are all the way but you can see that all of them involve something other than texts we need something like more structure like data base or we need like picture we need another modality for the model to be able to learn common sense so yeah in showing has been as been working a lot of common sense she's done really really great stuff the first question is what is common sense what is common sense it's this basic level of practical knowledge in everyday service situation there is a very blurry blurry front here with bias okay because some power of this common sense is exactly bias that we don't want in our model but some part is just normal thing like here it's okay to keep the closet door open but it's not okay to keep the fridge door open because the food doesn't want to go Wow okay so the way they did that the way they tackle that is to build a lot of that sets they'll be there like a lot of that sets that try to gather some common sense and I really want to highlight some of them vino grande is really fascinating when it's an extension now the winogradsky mat challenge is solved okay we've started with that later on earlier and now they build a successor to it which is a vino grande they have all these common-sense cosmos cue a lot of visual commands as I said they're all very interesting and let me just show you maybe if you want to be just repaper you should really probably atomic comets and the vinegar and a paper atomic is this relation database that I was talking about okay that encode comments and set a crowd so this Kermit is a nice paper by antoine busloads is about using transformers pre-trained for someone to actually mount this knowledge graph and we know ground is the successor to vinegar schema so let me show you just a little bit about vino grande okay because that's interesting we are building all these data sets from crowd source information and actually it's very hard to build nice crowd crowd sourcing set up and I think they have a lot of experience on this topic which is very interesting so here we know grande the main problem is that we don't want people first we want people to have IDs and not to generate from just the simplest the simplest IDs they have so they use some way to enhance the creativity with what they call random unco words so the people in the crowd source the Turkish they will have this random word that they are supposed to use in their example the example they are writing and this helps a lot to make them like creatively create new examples and they also use like very strong data validation set up where they phantoon state-of-the-art models on the data and I try to remove that that's too easy to predict okay so this could be like like an online data creation process where each time you have a better state-of-the-art model you actually through new examples okay to remove them and they show that the state-of-the-art model from now they are like very bad performances on this well pretty bad anyway you would need actually a lot of samples to get to human performances so that's probably the way forward in creating crowdsource data set like trying to make more diversity and try to filter this data set better okay and this lead us to the same idea for models okay the data sets should now be evolving there should be dynamically updated as new models came here come in and the models are actually the same problem which is called the continual learning question what is it so that's the end of the this the end of a description is me that bird was trained on like 2018 data it will always think that the president of like the united said was the 2018 president okay but maybe will change in the future maybe the president maybe the like the countries will change and the model we just never learn this okay and how do we overcome that well we need models that are able to evolve over time okay but then there is the main problem with this approach there are a lot of people have been working continual learning the main problem is called catastrophic forgetting which means that you want to learn new stuff without forgetting everything that you've learned before and here you have different approach to try to tackle this from memory to regularization to dynamically growing models and this is probably the way like NLP should go forward to have these models that I will to adapt and that I will to generalize to other domain as well okay so that's the end of today that was a very long talk I hope you really like that and the plan for now will be series is to do like a lot smaller talk in the future with like more basic information so if you like this you can say it and you can send us feedback and we continue with an LLP series as well
