An Introduction to Transfer Learning in NLP and HuggingFace

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay i think we can get started um so hi everyone welcome to bay area nlp i'm meghna ravi kumar i'm a machine learning engineer at sigopt and your host for today and the speaker for the upcoming session in november 12th today our speaker is thomas wolfe thomas wolfe is co-founder and chief science officer at hugging face his team is on a mission to catalyze and democratize nlp research prior to hugging face thomas gained a phd in physics and later a law degree he has also worked as a physics researcher and a european patent attorney today he'll be talking to us about an introduction to transfer learning in nlp and hugging phase so over to you thomas thanks a lot okay happy to be here so let's see if i can share my slides yep chuck and do you still see this line now yes okay perfect yep okay so today the topic of the talk is my favorite topic of discussion which is transfer learning and the tools we build at every phase i'll start by a brief introduction of transfer learning talk a little bit about the the limits the problems the trends and then the second part of the talk is about the hacking phase and our library and i finished by some hands-on kind of thing like walking you through the best resource i would say to to use our tools okay so really feel free to ask questions uh can i see the chat uh naturally but yeah i i'll stop during the the talk after i finish each part and if you if there is some question i can already answer them in particular if you can't stay for the whole duration of the talk don't hesitate to ask your question even if they are not related to what i'm talking about it's fine okay um let's start transfer learning so what is transfer learning that's a very good question i guess a lot of you know a little bit about it but yeah it's good to start from the basics so um the traditional way we used to train a model like machine learning model statistical model for a long time was to together separate data sets for each task we were faced with and each time we were initializing using our model from scratch and training it on the data set we've got the the data we've gathered for our task okay so if you had like three different tasks that we're faced with in a row then we have three different data sets and three different models that were initialized plus three random initial conditions okay now that's not how human learn okay human don't start from scratch every time we're faced with a new task we we reuse all the knowledge we we've learned during our studies during all the the challenge we've tackled in life and we use all that that we call usually knowledge okay we use all that to bootstrap the way we we tackle new tasks okay and this helps us to achieve two two main advantages the one the first one is that we can learn a task with just a handful of example because we can we can complete this example with with all the knowledge we have so a lot of humans you can just give them one example and they understand what you want to do and if you have to give them 20 examples you're like wow you're very slow to learn but our machine learning model if we start from scratch they need thousands of examples and the second the second main website is that we can reach better performances because we can connect the dots between these data points we can connect the dots and we can understand what is actually the underlying task and we solve this instead of some surface form correlation between these data points okay and this help us to reach better performances so machine transfer learning is one way to try to do that with machine learning models to try to to start from a knowledge base if we if we can say that okay so last year we gave a very long tutorial well very long three hours tutorial we 300 slides at knuckle and there's also a lot of hands-on exercise in this tutorial so it's a good resource if you want to have a more information on what i i'm going to say today in this introduction okay this is also what is called is that we have a very tiny and simple code base for training a transfer learning model in nlp which worked quite well we should actually integrate it in transformers one day because it's definitely a lot smaller than the code base we have it's like it's like a 100 lines i think for the models and 100 lines for the training and it's reached state of the art on transferring okay so there are many ways you can do transfer learning but the one i'm going to talk about today is called sequential transfer learning because there is a sequence the sequence of tasks so who say sequence say at least two steps so there is there is at least two steps the first one is usually called the pre-training step and the second one is is called the adaptation step oh i see i can see the the question already here oh that's great okay okay so let's put this question on this side and i'll start answering after the end of this part so the first step the first step sorry is the pre-training step so as i told you during this pre-training step we try to gather as many tasks as we can so actually together as many data as we can okay so this is very computationally intensive step but the idea is actually that here we kind of free to use as much computation as we can as we want because we only do this step once and then we end up with what we call a general purpose model which is a model which is not naturally specially designed to do one task but which is designed to be like a knowledge base where we can start from to to train on a specific task okay so you know probably most of these models the early one where the the word embeddings this is the first time we can change some general representation of of on huge corpora so glovy was trained on the common crawl for instance uh actually glovy was trained on a really huge amount of data that was probably only reached by gpt3 recently um and then more recently we've we've started to train full models so gbt and birds are full neural nets just not not just the embeddings but the full neural nets and to use them as a starting point for for the second step so what is the second step the second step is called adaptation or fine-tuning and this step um in this step the idea is that you will use a smaller data set and you will use a data set which is specially specifically designed for your um for your target task so there are many many tasks in nlp so i just show you a few a few of them so if you're not specialist i'll just give you give you a few explanation text classification is for instance let's say you have a lot of tweets and you have to classify the tweets in two categories positive negative emotion tweets so this is text classification word labeling means that you will have for instance paragraph and you will have to label the names of companies the name of actors okay so this means that each each word or of or each token will get a label and then question answering is another task where your for instance knowledge base can be a paragraph from wikipedia and your friend stands to underline the answer to a question that's given okay so you have usually two inputs you can also have a abstractive question answering well where instead of labeling the answer we will actually generate it from scratch so and then you have a lot of other tasks so the main com the most common are like translation which is google's translate summarization and so on okay so let's go quickly over these two steps we start by the first pre-training in pre-training we've seen the rise of what we call language modeling objective so what is it the language modeling objective means that we will train our our general purpose model to predict text itself so we try to maximize the probability of the token in the text okay so what is nice is that we can we can divide is the probability of the text for instance as a product of the probability of sentences or the product of the probability of tokens or words so there are many ways we can rephrase this but the the main advantage is that text here is its own label so we call that self-supervise it means we don't have to annotate texts and this is super useful because it means we can basically use any text we want and in many languages we have enough texts for instance on the internet to really train a high capacity model without overfitting so this is really something that's super important also in terms of what we considered a low resource language because a lower resource language if we if we actually have to annotate in any in every language it's it make a deep difference between some language where we can easily find annotators or we can it can be very cheap or some language where it can be really hard to find people able to annotate or people it can be very expensive but if everything is without annotation we can use a lot of things that are just generated by human on the internet and the last thing nice as i told you is that it's very versatile objective so this we can decompose we can rephrase this probability different way decompose it at the product of the token we can also approximate it with a denoising objective where we just corrupt the text and we try to reconstruct the text which is kind of an approximation of trying to maximize the probability of the text so we have many ways we can reformulate this objective and this let us explore a diversity of possible objective functions which are better or worse and people are still exploring them so this is the objective and now let's talk a little bit about the models yeah the main models today um in this kind of transfer learning lab scale transfer learning i mean large scale i mean large scales in the way that these models are pre-trained on very huge corpora or at least as big as we can the main the main architecture which is used today is transformers um for the reason that they are they are kind of more computational efficient actually they are very they make a lot of this computation parallel and this is actually more efficient than the lstm so let me show you a little bit if you're not familiar with this architecture how it works so this is an example from how bird is trained this bird model how it is strange that we train we take um a sentence from the corpus so here is a typical internet language sentence which is my dog is a good boy you very likely will meet this sentence somewhere on the internet and you will mask one token so we mask the word ease and the goal of the model will be to predict it so this is a denoising approximation if you want of the probability okay so the first step because our models cannot process strings the first step is to convert that in something that our model can process so we'll convert these words in integers which are integers which are number uh word indices in our vocabulary and we have an input and meanings which is just a mapping from each word to um a vector okay so this is like word embeddings basically to each word you associate or to each token you have to say to vector which is this first hidden states so as you can see this this vector is is not dependent on the context so the vector for dog here is independent from the fact that we have my before and that we have a good boy after okay but we would like to make them dependent so that this vector for mask can become a vector which is able to predict what was um here with what is masked with token is masked okay so to make them dependent we use the attention mechanism the attention mechanism is basically just replacing each vector by a weighted average of the left on the right context so it's very simple actually it's a lot simpler to to explain and to see the math or the attention mechanism than the mass of an i mean in my opinion now after that we have a little non-linearity otherwise if everything is linear we just fall back to a linear linear model so we have a non-linearity and we do that several times so for instance we do that 12 times in a birth or we do that 100 times in the gbt3 here okay and we end up at the end with the final hidden states with the same dimensionality as the input in the state usually and this one is dependent on the context here because you see there is all this weighted average here okay so we can just project this back using the actually the same matrices here but the transpose of the of the input embeddings using that we can project this back on the on the vocabulary and we can use the mass token as a label okay so this is our birthday stream now there is another way to train these models which is called uh causal sometimes or auto aggressive is is a more beautiful term probably and it's a little bit different in this case we in each in each hyper column here you see we start with one word and we end up with the next word so we actually try to guess the next word or the next token at each step so obviously you have to change a little bit the attention here you can see the attention is only pay attention to the to the left context otherwise it's just too easy for the models because the the answer is just in the right context okay if you make the attention here the model can just see oh there is he is here so i'll predict he is so these models they are inherently less powerful because they cannot use the right context to guess something so here for instance to guess that here the word should be uh this model cannot use the the end of the sentence to guess this okay so they are less powerful for some for some nlp tasks where you need also the right context for instance ner but you see the the other advantage is that they have a lot of training signal here because we can have some signal for each token while for but if we if we mask everything we have nothing to guess what are the mass token for so we need to mask only a few tokens so we have enough context to guess the missing the missing one so typically you will train a gpt model like two to four times faster than a birth model okay so these are the two main architecture very similar you have some layer norm that can go up or down but yeah there is some tricks that that you can change inside but mostly they look they look really the same now let's talk about adaptation so adaptation is um the second step and we hope to reach these two advantages i was talking uh about at the beginning which is data efficiency because we can leverage this general purpose model we hope that we can have a very data efficient step where we only need a few lay supervised uh example here labeled example and we also hope that we will reach higher performances okay so how do we do adaptation here you remember we trained this general purpose model so it's not it's not specifically designed for the task so we need to adapt a little bit of model it's quite simple the way we adapt it is usually we remove the pre-training head here so in our case it was the pink boxes you remember these pink boxes that protected back the vectors down to the vocabulary we remove them and we replace them with a head which is specifically designed for our task so this task specific head we will in pretty much every case we will train it from scratch on the target task okay so we want to keep it simple if possible because we will have just a few examples so we don't have to have too many parameters but if we have a target test with a lot of um with a lot of training data we can actually have something very complex on top we can have a full lstm that we can use if we have enough data not to overfit on this target class in some cases it can be very complex like if you if you want to adapt to a task which is structurally different for instance let's say we want to use our models for um like summarization or like summarization we have two inputs we have the the text we want to summarize and we have the beginning of the summary that we have already generated okay but we have pre-trained our model with just one input remember we train just with this single sentence so we have to adapt it so one way one simple way for instance is to duplicate the model that's what i show on this little graph you just duplicate the model and you add correspond tension and one model will digest the the use articles and the other model the other part of the model will digest the beginning of the answer but you have other way i'll show you just in a minute so this is how you add up the model now to make this very more concrete i will show you just two examples of very different tasks so you can see how it works the first task is text classification it's to show you the full pipeline so how it works let's say we have this task it's a bit uh a toy task but the idea is that we have input sentences here our input sentences jim henson was a perpetra well okay this this sentence should be free in this blue box now scanning is too big uh and we have to predict if this sentence is true or false so the first step will be the tokenization step what is it we will have to convert this string in something that the model can digest so the model can just digest integrals or numbers okay so the first step is called tokenization these models as i told you they are made to be trained on like open domain data basically they are made to be trained on common crawl on anything so we will we need to be able to handle complex and rare words without having to learn all the words so the way we do that is if a word is too complex or too rare like this paper is not it's really not a common word we'll split it we'll split it in prefix and suffix until we know all the sub parts okay so it can go down to the letter level if it's really really strange or if it's just uh like a made-up word so here for instance r is common puppet is common so we split it like this and now all the words are all the tokens are in our vocabulary vocabulary they are typically between twenty thousand and fifty thousand okay if it's multilingual then it can go higher to two hundred thousand for instance once we have all these small all these um tokens in vocabulary convert them into integers and then we can use that as inputs to the model we've seen earlier okay so this is our pre-trained birth to our future in gpt and we end up as you remember we end up with a vector for each input token now we need to go from this vector to here so what we'll do is that we will pull this output and we will project this back to two classes so we will add this classifier head so this is the one the the adaptation head which is trained just on our target task okay but it's okay it can be very simple for instance here we can just pull this let's say we can just take the average or the max or we can add the specific classification token here which will be trained to contain a good representation here and um once we have just one vector here we can just protect this with a linear matrix for instance okay so here the output demonstration for birds is like 700 768 for instance and we're projecting back to two classes so it's just a thousand five hundred uh parameters so this is something we can we can actually learn uh without overfitting on a small data set okay here is an example this this task is called track six it's a six class classification task like like the name indict and um and this is an example we we did in the tutorial last summer so if you want to see if you want to play with the code just look at the link i showed on the first slide then you can see and you can and you can see and play with it and you can see two things here the first thing is that after just one epoch we're already over 90 percent accuracy so this is what we what we call data efficiency because this this this training data set is really small it's like 2 500 examples so this is really something you could label yourself with a small team okay you're like eight people just everybody to do like a hundred label and you and you're good or like 200 300. so this is very data efficient and the second thing is just just after three epochs on this small data set we reach an error rate of 3.6 which is uh which was last year the state of the art on this on this data set so it's also reached two very good performances okay so these are the two promises and it's gonna work now let's go to a second very different example to show you just a wide um range of tasks you can tackle with this let's say we have a dialogue generation task so here you see we have several several inputs we have a persona which is kind of a little knowledge base from for our chatbot in the search chat but he's supposed to be an artist with four children love watching game of thrones and it has a little dialogue history that has been exchanged with a user the user said hi and somebody say will we share the powerpoint yes i will share the slides um and so the the user said hi the box answer hello how are you today and now we have a new uterus from the user i'm good thank you how are you and the task of the model is to generate the answer and you can see the gold answer from a human which is pretty nice great thanks my children and i were just about to watch game of thrones this is a nice this is my data set by facebook by fair new york and that's called snn chat now if you want to tackle that you see there is a lot of different type of inputs you have like a kind of a small knowledge base you have a dialogue history you have the next the last huge appearance and you also have the beginning of the reply if you generated word by word so how do you handle all of this well there are many ways to do that we actually have a paper last year at acl about that and two main ways to do that are you can concatenate everything in one single input it works surprisingly well actually you just separate different type of input with a separator token and the model can figure figure it out all by by itself or you can use a multi-input model like you can duplicate the model as i showed you earlier and one model will encode some some inputs the other model will encode some other inputs then you have to connect them together okay and the result is that you have very good performances now on this so this is like a competition we participate this is now uh this is now two years ago this should be in eurips now um and we get state of the art results i think it was the one of the first time people were using transfer learning for dialogue generation okay now just before i finish this first part do i answer some questions i finish with the trends and limit and then i answer a few questions now what are the trends what are some limits well the first trends you may very likely have seen is that the model they are getting really bigger and bigger and bigger okay uh when bird came out uh now it's it's it's about two years ago it was really they were saying we have the first like 340 million parameters model and this is this was kind of scaling to huge models that's how it's presented in the paper they talk about reaching like the limits of of the largest possible model and now this looks really small because we have models which are three orders of magnitude bigger or even more so very quickly we had this race to two larger model for some time t5 was the biggest with eleven billion then there was gp t3 and then about at the same time there was the google g-shock model with 600 billion and they actually even trained a trillion parameter models in this in this paper so yeah and now we we have this small distiller that we that we try to to push okay so this is a big problem for some reasons well one reason you can see here this is the the bench the this is the benchmark the leaderboard of the glue benchmark and you can see that basically all the top spots they are just to the same handful of teams they're just about google baidu microsoft facebook and there's just no academic here anymore just because it's too costly and academia cannot spend 10 million dollars training a model there's also a huge environmental cost to all of this we know that training this model even if you train just for one time still very expensive and it's also more important in my opinion it's very expensive to use them in production which is actually a lot more compute intensive because you will deploy the model on thousands of servers for like years while the training is just four days on a few thousand tpus so and then the last question is uh are we actually trying to reach something we've just started this scaling the scaling of the models is it a research project um and i think france virtually put it quite well when you say that training even bigger conflict gets closer to the moon in the same sense that build that building um get a closer sorry to agi in the sense that building taller towers get us closer to the moon now there is another trend we've been trying to push at hugging face is to reduce the size of these models we can reduce them during the training but also as i mentioned after the training and reduce them to be able to use them in inference efficiently and instead of deploying your 175 billion parameters model you deploy a smaller model which was smartly designed to take advantage of all the things that the biggest model has learned so there's many ways you can use the there is several techniques you can use to reduce the size of this model which is distillation distillation means you will train a smaller model that will try to mimic the big model and you can actually have pretty much the same performances as birds but with a model which is about two times smaller you can also do pruning so in our team it's it's when it's victor son who has been leading all this effort on on on size reduction he has done like this till birth and armor recently did a movement prodding which is a smart way to do pruning in transfer learning it works a lot better when you are reaching the extreme spacity regime um and when you do pruning you want to have like efficient activity you want to have efficient gpu kernels and we are we have a recent library on this so we can talk about that if you want in a q a session and the last thing is that you can quantize we know that somebody's adding some [Music] what is this okay we also have some way to do quantization you can quantize the weights moving from flow to integers okay now there is two two limits about these models also that i want to highlight the the first one is brittleness they are very sensitive to what we would call out of the main [Music] outer dimensionalization out of the main um drift so if you train them for instance some some type of wikipedia paragraph and now you have destructor sentences here which really looks like wikipedia but they can change the way the change the statistics here the model will be very easily lost so here for instance you you had a sentence that kind of looked like the answer but it has the wrong number doesn't match here but the model is very easily lost and will kind of fall into all these easy traps so this is brittleness if you change a little bit the input your last the the the if you leave a little bit the training distribution if you want your loss and spuriousness is something that's slightly different um spuriousness is the idea that this model they will very quickly grab the easy way to find the answer so if you have a little bit of bias in your training data set the model will grab it so there is two good examples of nli mnli one example is lexical overlap so if you have a lot of words so mli if you don't know that that's a task where you where you have two sentences and you have to say if the first and the two the two if the two sentences are contradiction or if the first one entails the second one okay or if they are neutral they are unrelated and in this training in this data set when you have a lot of words in common between the the premise and the the hypothesis this is the name of the sentence when you have a lot of words in common it's usually it's usually an entailment so the model really learned this idea that if you have a lot of friends in common it's entailment but this is this has no relation with the underlying meaning okay so we can design very easily the adversarial example and here is a nice a nice paper by thomas mccoy with this data set which is called hands and in hands when you have a lot of overlap it's contradictions so here you can see the doctor was paid by the actor this is the first sentence the second one is the doctor paid the actor so they mean the opposite actually the person who is praying is the opposite between two sentence but because you have a lot of words in common the model will fall in the easy trap and say oh yeah this is entailment you have the same with the negation in this data set we have a lot of case where the negation is associated to the contradiction between the two sentences so it's very easy to build an example which doesn't work or you have a negation and entailment okay so this is called spuriousness and it means the model they really well we know them this deep learning model they are very lazy and in this case where they can find easy way to solve the answer easy way to get a good accuracy they definitely want they definitely will now if you take a step back there are more more fundamental limits to to language modeling and actually limits to even i would say maybe nlp which is the fact that there's a lot of missing information in text we call that often the reporting bias i think there's this nice paper by a golden volume it's also something that was shown by um mitchell and the idea that we are not really writing the obvious common source we don't write it now let's i usually take this example of the of the color of a sheep the animal we usually don't say that ships are white but we talk all about black sheeps so when you ask a language model like dpt2 what color is the sheep he actually might very very much answer black or pink but not white because it's very it's very rare to read to read that there was a lot of white ships in the field nobody said that because it's so much common sense that we just don't ride it and there's a lot of things like this if you enter in a room there's a lot of windows in the walls you just won't say that but you say that if you're entering a room without any we know because that's kind of strange okay so we only state the non-obvious and which means that our models they're brittle in this thing so if you want to add this you actually need to lead to leave the full text domain domain sorry and you have to add for instance the database where you will connect sheep with the color white or you had to add some multi modality where you have a an image or a film where you can see that the sheep are white or you have to add a human in the loop which is how like humans are trained you have like humans say no this is wrong this is not the color of a sheep okay but we have to leave the simple three now the last the last problem that i personally like a lot is is continual learning and it's the idea that our model we train them and then they are static but the world is evolving okay bird gpt even gpt3 they have no idea of what cove it is and it's very hard to add them to this idea to them but today defining most of the way we interact in the world okay so um it's it's it's a big problem i think and it's very hard to actually just add some piece of knowledge to one of this model because we have what is called catastrophic forgetting catastrophic forgetting means when you add something new to this model with just a little bit of additional training you have a huge danger of forgetting a lot of other things the model we focus on with you do so this is a this is a big problem and this is something that is not solved today okay i'll take a little break and answer a few questions and then i can um then i can continue let's see if i can make this little box bigger yeah okay great oh there's a lot of question already okay maybe i go up okay a good book yeah you mean the loop is a great book definitely um oh there's a few answers yeah with larger model amrit is asking how can we use them for real-time scenario where latency is critical yes definitely um you have to use smaller model now there is one thing about transformers actually is that they are they are like the computational costs grow with the length yeah well normally quadratically there is a few recent paper or a few recent models where it grows linearly but if you have a small sentence actually if you have a smaller length you can actually use very big models quite easily we used to use a bird model in production when we hit the chatbot just because the main the main answer of the user were very short so it was quite efficient so you have to benchmark it on your use case and sometimes it's actually possible to use a very large model efficiently but if you cannot um you need to use a smaller model and there is also a lot of ways to make these models faster by using all these amount low level but using onix compilation uh operator fusion so this we we have a lot of experience with about that at an interface and we do have some people doing that but yeah the simplest way if you don't want to dive in this low level is to use like a distilled model or like a smaller approximation of a large model yeah regression type task or pretty good number energy and some deep say regression is it a question oh no this is not a question okay well i'll probably try to scan just for the question in general when fine-tuning a model say a ag is it better to train a full model or to freeze the pre-trained weight oh that's a that's a common question should you freeze or should you train everything um in computer vision it's very common to freeze uh in nlp it's very uncommon um there's a nice paper i will post it later there's nice paper by um matthew peters and sebastian reader who kind of investigated this now the end the end idea is basically if your target task is very very different from your pre-training tasks you might want to freeze to avoid drifting but in most of the other cases you want to train everything you just get better performances now if you have no computational limits i i mean okay if you if you can train the model to find units sometimes you also want to freeze because it's just too expensive computationally but that's that's where and if you can i would advise more to do like gradient checkpointing or to do what is very easy gradient accumulation this is task this is something which is just really really easy to do uh instead of freezing okay i'll go back to the can you provide your point of view i'll go to the last question on using bird for topic modeling oh that's a good question yeah birch is very bad for topic modeling that's my point of view you should not use it for that yeah these models they are not very good for like extract you you have to fine tune them on a specific task if you want to do some some sentence similarity for instance or like from for some general topic modeling if you don't fine tune them a lot of people want to use them without fine-tuning them to do something similarity or to do a clustering of inputs it usually doesn't work well okay um yeah maybe i will continue the 512 limit per training document or per sentence in a training document the 512 limits for birth is per input so if you want to use a longer input you have to either like do a windowing across your model across your your um your your data or you have to use some models which can have bigger inputs there's a few models in the in the transformers library this is also a very common question so i think we should just have some pointers high level pointer and in phase library you have a few models which can have a yeah exactly the long former you can you have a few models which can handle long inputs that you can use do i recommend gradient accumulation yeah i do okay let's continue do i recommend adding words to the vocabulary yes i recommend adding words if you have enough data to train them okay if you have like enough data to really train these new words well you should add them if you don't have enough data then uh yeah then it's not good obviously mr that's the deep learning world okay let's continue a little bit and then i um oh yes okay we're quite late and then i and then i uh take some questions again so about taking face on the libraries yeah let's go so would we let hugging face um [Music] we like to consider intelligence artificial intelligence as we well mean what we like what we are interested in the current agi and the current ai research is all this idea of creativity generating thing interacting so we really started with this conversational ai chatbot that was four years ago now which was really something that was designed to interact with human to create new stuff to create interesting interaction [Music] and while we were developing this this product which was actually quite rarely used where we had like a few million users we were able to train our own birth like models on the messages um and while we were developing this product we actually opened source a few tools and these tools they they caught so much attention in the community that we very soon we were just working only on the open source part and so we decided that inside in instead of just doing our own stuff internally and sharing it we just do everything externally and we just do everything we can to actually catalyze democratize the research work just to help researchers do they would just do out there do a better work another consequences do the whole field would advance a lot faster than what we could do in our in our little laboratory so we're interested in language natural language understanding this idea of classification but also as i told you in all the possibility of generation we think this is really nice i personally think that the the artists which are working with ai are doing like some some of the most interesting thing which i exist today for instance so the way we try to democratize and catalyze the work in nlp is by two things knowledge sharing this is why i'm i'm here today this is where i talk with you we do a lot of things a lot of tutorials a lot of workshops and some summits this one was unfortunately cancelled because of the kovid but yeah we will try to do a lot of knowledge sharing and the thing that you probably know more is that we try to open the source library we try to do that in some way which we think is the most efficient which is to find the right balance between some research code which is very like a snapchat of some research but if you want to use it you have to do very complex modification you have to rewrite the whole code usually to adapt it for your use case this is too much on the other end there is like this fully one line one commonly production code which is nice but it's also not not enough flexible so we try to find the right balance between these two it's not always easy but yeah we do what we can and we also like to break the barriers so we like that the code from researcher can also be used by real life practitioner we like to break the barriers between the framework that you should be able to use python tensorflow your favorite framework and it should yeah it should it shouldn't matter that the one model was trained with tensorflow you should also be able to use it in python if it's your framework of choice this is one example the reverse is true as well um yeah and we like to make it um very fast to to run this model so we have three main libraries today and we'll talk about them the first one the biggest one is called transformers this is a library where we we make it easy to access pre-trained models so this is this general purpose model that i've been talking about we want this to make them very easy to spin them out to like act to run them and also very easy to modify them and to understand what's happening inside we don't want them to be black box we want them to be open box like this is the way we should do code and model sharing everything should be open so you can investigate it's better for everyone we want them to be state of the art so that's it's actually it was really designed for researcher at first and now a lot of people are also using practitioner application but the idea is really that you have like this research level performances and with this team the i talk about tensorflow by touch we have a lot of models now this is actually already out of date this was two two weeks ago but you have all the models you you you kind of heard about that we've been talking about the bird the gpt family except the gpt3 because now it's close to us but yeah although all these open source models were better distilled it's raised from our our company camo bear is a french model we have a lot of multilingual model uh xm roberta bye by a team of um alex econo at fair is really great this is maybe hide it a little bit yeah but we have a few on our multi multi-modal models we have a alex mert oh yeah it's not here but we have a mmbt and alex smith we have models which can ingest eye images and text we have dialogue models and some efficient models my nmt nice model that was translated by by some by biter this is really an amazing team elsinski nlp doing great work on trans machine translation it's better than google translate on many are many language pairs actually and yeah so so many models it's very easy to try to use them so i'll show you this a bit more later and you can check it's all open source and we have the model hub which is a place where people can also upload a model they've trained and the nice thing is that other people can use it also in one line so it's very easy to to access them and we have models in many many languages as you can see here obviously english english is over represented um but yeah we have a lot of other languages so you can check it out you can test the model they are running live here you can actually play with them if you want to see how they work if you want to show some like bias they may have you can you can do all that also just on the model page here you can show that they're all strongly gender biased this is bad so yeah look at the model hub it's on huggingface.com there is about like three thousand models a bit more now and this is the second library so while we were developing transformers we saw that tokenization was a bottleneck very often because it involves a lot of for loops in python and python and four's loop for loops are not friends and so we decided to develop something that would that would be really really fast so anthony in our team decided to develop this tokenizer library in rust which is designed to be just super fast but also very versatile it's it's not the fastest out there but it's definitely fast enough to to process like one gigabytes in 20 seconds which is good for most of the thing but it's very flexible you have a lot of things you have vpe byte level bp word piece sentence piece you have all this algorithm in it and it's also very um yeah we can talk more about that but it can show you like the alignment if you want to know where this token comes from in your original string and vice versa you can have all these high level access super fast so this is really really a nice library it's in python it's australian node gs and it's in rest if you now rest it's really cool and just before the summer we saw like something that was still a bit annoying in the way we were using our model which was the access to the data set on the matrix so i in my in my view it's kind of the two extreme of the pipeline at the beginning you want to have an easy access to some data like all this is also open but it's just hard to to get you always spend the time doing the same pre-processing stuff to finding the squad or i don't know like this so we decided to make this very easy just the way we made the models very easy to access and the same for matrix they are now quite complex matrix for evaluating like natural language generation for instance you have smart metrics based on birds like blurred and the other one is called bird score like even blue and rouge it's always a bit complex to use them so we decided that they should also be like one liner so you just get this thing that works out of the box and easy to inspect like what we do so we release this data as a library where you have data set and matrix so originally it was called nlp but we start to already have some vision data sets some people have added like multi-modal data sets so now it's just called data sets so you have um yeah this is a bit out of date now we have a lot more but yeah we have you have like more than 160 that central matrix it's made as always with what we do it's made to work with all the main frameworks numpy pandas by latosh tensorflow 2. it's made to be really fast so for instance if you and also very memory efficient one reason we developed this library is to be able to process very efficiently like huge corpora if you want to train your model on wikipedia in any language or common crawl it's quite hard right now because you have to put all this in memory or things like that so this this dataset library is just designed to make this very efficient you can load wikipedia just take nine megabytes of ram and you can iterate on wikipedia at pretty much full speed because we have a zero cost serialization format called apache arrow which is really efficient which is really really great and has some spot caching stuff so if you process one time everything is by default cache and you just don't spend time reprocessing it so this is the full um this is the full script to prepare a python data loader to train on glue mr pc so you load the data set with this this is uh well this is the from time it was called nlp but yeah you load the data set you load the associated matrix if you want you know the technizer from transformer you tokenize it with this simple uh function here tokenize the two sentence of of glue and you just map this on the data set you can set the format to to filter just the right column you want to use in the data set for your model and then you're good this is like 20 lines and you're ready to train a model and everything is pretty much visible so you can check it out here and there is an associated hub as as from the model it's the same hub actually where you can see description the citation and we want to actually add a lot more information here wherever we have a phd student who is a ng is really doing nice stuff trying to add more more information to go into other data set sheets we would like to push for the data set of model sheets and you have a visualization thing that was made by by sasha rush which is this this explorer here you can see on the on the web if here if you call on browse you can just browse that essay directly this is also very nice it's at the same place so usually here i show a few resources but i can i see the conversation here going so i just maybe we check if there's a few questions how are we in terms of time yeah okay good anita thanks okay is it only for texts oh manipulating image yeah we have no image right now well we have we have some multi-modal models which use image and text but that's something we we maybe should try at some point um of topic do you plan to support multilingual long document transformers yeah i think we should yeah that's a good thing um yeah yeah there's a lot of question about the long former and yeah i think we should probably try to make it more widely more widely uh usable in any type of model a little bit off topic but i was wondering if you could expand on the connection between quantum mechanics and ml hehe and the quantum theory of language oh i don't know that this looks very uh yeah i don't know for me it's two different parts of my life i used to be to do quantum mechanics then i did some low and i was doing patent law i was a patent attorney and then at the end i was uh doing patent flow for for some people for some startups who were studying to apply the deep learning models in in real life and so i kind of saw that this the math were the same like but more the same as statistical mechanics i was doing quantum statistical mechanics and statistical mechanics and machine learning are really really close but the quantum part i don't really know uh does data set support custom that's loading yes it does you can just load a bunch of csv or text or things so i can oh let's see wait let me just reduce the size of my window yeah no yes okay it is there you go i can pass i would like to pass a few a few links in the chat let's see if i can have the chat again yes and let me actually show you here i have to tap it yeah okay okay so what was the question when will big bird um big bird well usually we we add some models when there is when there is some pre-trained weights open source i don't i'm not sure about big birds i think it's right now there is no uh but i wanted to show you something ah yeah okay i will show you on the just on the data stats library so that set library we now have a very nice dock where you can look i spend some painful nights trying to make it so um you have a loading data set and you can load from local files csvgs and text file and then you have all the same facility that you can have um it's made to load actually really big data set if you have one terabyte of text file it's it's possible to load it okay yeah leaderboard here you have this in a paper with code definitely any model we recommend for sentiment classification yeah so we will add very soon i think we want to add in a model help some way to compare two models together like to see what are the best models because people are obviously kind of lost now that there is 3 000 models which i can understand so we should have some way to compare this model together so this is in the pipe we try to do that as soon as we can so yeah for sentiment classification let me just show you a little bit what you can do with the hub let me show you a little bit what we have as a resource i think it's good i put this on the side okay so if you go on our hockeyface.com here you have a nice thing there there is the model hub the data set let me so if you go in the model hub here you can filter the model which are for text classification for instance here here we have just all the models for text classification if you just want a classification in italian for instance you can look here there is a few bays so you can you can use this model and you can try them and they'll be town i like you i love you well this is a oh this is a multi-lingual maybe yeah this is a multilingual one my nlp town is a nice um nice belgian team so this one let's try it in english i like you i love you this is supposed to be pretty nice and let's try to wait i hate you um i don't like you it's quite strong you know maybe it leads to a little bit ah yeah that was just a cache thing so now we should uh wait a little bit that load but yeah so you can do this kind of thing you can just look at the specification model you can compare with a robot large but large this is uh this is text classification contradiction neutral entailment so this one is not something at all so but yeah you can play with the model like this here and you can also um [Music] look at the documentation everything here and if you want to look at various languages also here this is the list i showed you earlier with all the languages so here if you want to look at all the language training uh yeah let's talk japanese for instance you can look at here and here you have all the models that were for japanese so the asymptote are translation models for most of them but you also for instance this birth-based japanese sentiment and you have well the model this is the model cards so they are uploaded by the user by the people who train a model they can be more or they can be more or less complete explainability and interpretability oh yeah um oh wait where was i uh okay leaderboard yeah which size do we compare yeah which size yeah so this is actually the first uh benchmark we want to add which is to be able to easily see the various size because some of these models they are obviously a lot smaller than the other if you look at the distilled bird family the distilled birds these are smaller for instance than the other than the the birds but it's kind of hard to know just like this you have to read all this so this we will add some comparison yeah definitely other example code for using a downloaded birth model instead of web api yes so uh so for if you want to have example code so just to show you what is talking the web api is that you can actually um here you have a bird here you can play with distilbert and you can also just use the api endpoint here yeah apply oh yeah you have to i have to connect actually to use ah okay now it's the api and but if you want to use it you have to uh you have to subscribe yeah but if you want to use it for for free you can just go on the github and you can go in the examples here and you have examples for all the various tasks so if you're doing text classification you will go in the text specification and here you have a bunch of scripts you can use um depending on your whether you want to use tensorflow by torch well then it's a bit open you have to to decide on your own what you want to use okay read the paper link yeah a tool for interpretability yes we have some so let me show you yeah actually this thing you can also just try it for free for a few days here so like one week so you can just do this this is the just the model running on on our gpus basically um interpretability i want to show you we have something uh which is called expert which is nice okay so you can actually filter by tag and there is an expert tag yeah so if you look at the model which is expert you can go and click oh yeah here on visualize an expert where you can see actually how the model predicts his stuff so in in transformers if you play with it you know that you can extract the attention you can extract the attention of the model at the end and this is a way to visualize this so you can select the model word-based case and you can see a little bit across the layers well it's kind of hard but you can see a little bit across the layers how the prediction are made if you if you mask a token you can actually mask a token by double clicking on one this is something that was made by um by a team from arvad and mit hendrick and sebastian and it's a nice way to try to investigate what's happening and you can then just do this yourself also if you just use the output of the models is there a model that can help identify the intent of a sentence yeah for this you have to fine-tune the model of on an intent classification task so if you want to do that the easiest is to use the text classification example and you replace that with an intent classification database or data set i mean that you could have okay i was a bit lost um could you briefly talk yeah you should try to make just your full sentence in one message because then it's kind of interleaved and i'm a bit lost um [Music] would you yeah okay allen and lp and their v's will this be integrated well allen and llp won't be integrated yep okay yeah and then i'll be we won't be integrated in this because they won't be integrated in us because we are integrated in them already so that would be kind of a back loop that wouldn't work but these is uh i think what i showed you probably expert i don't know yeah a custom ner pipeline so ner is something i i want to improve but we have some example i want to make them a little bit easier but you can already play with them they are called token classification and these are a bunch of any example that you can adapt for your for your task what is needed for us to include gpt3 um yeah a lot of people ask if we should train it so i don't think we want to do that also because we don't really want to be like the only provider they think we're more like we like more sharing and make it easy to to share the thing but yeah we could maybe try to train it i also think that at some point like probably soon some other people we just train well just uh because as when gpt's true was trained gpt2 just a few months after t5 touring megatron all these things were open so so i think it will just take uh not very long for another model like this to go out but yeah if nobody does one we might train one at some point i'm not super excited about this model to be honest but yeah i think it's a bit uh it's a little bit boring um if you want to get the embedding of a domain specific word is it the only way to expand the vocabulary and do the pre-training yeah if you want to add some domain-specific word usually when you do the fine-tuning so i show you these two steps of transfer learning but you can definitely do it three steps so you do the pre-training you start from a pre-trained representation and then before doing the adaptation you do another step of pre-training but on your target data okay so the model kind of learns your vocabulary can learn how it works and then you take this model which has been a pre-trained like second time pre-trained on your target data and you find you need for the target test okay this is very well explained in the in the unm fit paper of of sebastian reader on jeremy howard and it's useful if your vocabulary is quite specific if you're like a medical domain vocabulary or things like that it's useful to do this thing that's how you make money yeah how would we make time sorry thomas i'm just gonna hop in maybe the last couple questions yeah that's right that's right so i think one good thing yeah this is something i wanted to show if you want to ask more questions that's a good way to end we have a forum which is good place to ask questions it's pretty active and uh i'm also there so if you go on our main page okay there is this menu and you have two two cool things that you probably want to explore the first one is our blog where we have a lot of cool demo things i didn't talk about today but uh joe did a nice zero shot topic classification it's possible it's possible to do zero shot very very very well with this model here for instance you just write the name of the classes and then you can use it directly you can use the model like the models just try to to get i didn't see the example who are you voting for 2020 and you can just put some example of classes and the model just try to see the most relevant ones this is zero shot topic and you also have this in the library if you're going to pipeline but yeah so there's a lot of cool demo in a blog post um there is a nice so just check this out there is some this is a continual learning model i was talking about the problem of continual learning this rug is a nice work of facebook that we've been open sourcing and this model is able to update its database i think for instance i think this is a lot more interesting than gdp3 because it's this model could learn about kovid so which is kind of cool so look at our blog here and the other nice thing is the forum the forum is a discourse a discus one uh and so check it out it's pretty active see i was i was there this afternoon but there is already 19 new um and i'm there too with with all the other people so if you have other questions this is the best place to ask the issue for the github now they are better for bug okay but the form here this is really great okay yeah i think it's good okay great well thank you everyone for joining us today and thank you thomas for a great session um please join us again at on november 12th i'll be speaking about efficient language models finding your optimal architecture it uses hugging face and it's based off of their distal burp publication so it'll be a nice continuation into practical application yeah um so thanks everyone and have a great day great yeah thanks everyone and see you soon on the forum maybe bye
Info
Channel: Robert Monarch
Views: 1,497
Rating: 5 out of 5
Keywords: Machine Learning, Bay Area NLP, HuggingFace, Transfer Learning, Natural Language Processing
Id: qWUslmU7BjY
Channel Id: undefined
Length: 67min 17sec (4037 seconds)
Published: Mon Oct 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.