Hugging Face Course Workshops: Pretraining Language Models & CodeParrot

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello leandro hello marie how's it going how are you fine uh how are you i'm good uh so i'm i'm hosting leandro today he's a machine learning engineer at hugging face and today we are going to go through the course parts of uh causal language modeling and he's going to show us how he coded the code parrots which is a causal language model that can basically write code anything to add no that sounds good shall we get started yeah sure so let me start sharing my screen [Music] you should be able to see my screen now right yes awesome so i thought just to like set this the scene a little bit i want to give like a high level overview of what we uh what we are doing actually with co-parrot and maybe one thing that a lot of people have already heard of is like github co-pilot and if you go like to copilot.github.com you can see what copada can do and what it is essentially is just a very fancy like auto completion tool that lets you write some code and then it completes it for you and uh i think it's a really cool tool and it lets you like it assists you like writing code faster and so what we're actually going to do today and what i want to show you with coparot is like a replication of copilot and to see what it takes to build something like copilot that can generate cope for you [Music] let me maybe start with the course we have a a course um on causal language modeling because language modeling is the technology behind copilot because language modeling essentially just means uh you're trying to predict the next word based on the previous words and models such as gpt gt2 or gpd3 or typing neo all these models they're using causal language modeling in the background and if you can predict an expert then you can continue and predict the word after that and so on and that helps you like generate long sequences of text and these gt2 models or gt3 models they're really good at that and in this chapter of the course we're showing what it takes to actually pre-train these language models because usually when you work with transformers you only fine-tune them you take a pre-trained model from google or facebook or some other big research lab that built like such a model for you and then you just fine tune it but in this chapter we actually gonna have a look at what it takes to do the pre-training of such a large model any questions so far and awesome um so before we start like training a large language model like the most essential thing that you need is like a very large data set to to train the the language model on so the larger the the language model is that you want to train the more data you need to actually train it on and for that purpose we since we want to build something that generates code we used github and decode on github to generate like a large [Music] data set with a lot of python files so we're just going to do it on a subset just going to do it for python if you want to do it for another language and the procedure is pretty much the same you just need to gather a lot of data for the language that you want to train on um we created that data set and it's actually available on on the huge face hub so if you go to transformers book to coparot that's where you find the full raw data set with about like 20 million python files which are about 180 gigabytes um inside is there an is there sort of an api to do this or did you scrape it yourself from reposs there is an api from github to to access like uh data inside repos but it's as far as i know it's rate limited and pretty slow so it's not meant to scrape like the whole of github but there is um on google there is a database that you can query on google bigquery uh where you can get all the the the files that are on github so i think they have like several terabytes of files and we just filtered for the subset um that is python and you can actually see it here on the dataset card we we added like the the procedure how you can get it so you go to google's bigquery you search for the github database and then you can filter for all the files for example that end with python and then it creates like a data set for you i wonder uh about the application but maybe we can get to it later when we are answering the questions because i have a question about that one sure i mean duplication is certainly an issue we found that just to give you a rough idea um we started initial training with like the raw data set for co-parrot and we found that the performance was not great and when we looked a little bit closer we actually found that there a lot of files that are duplicated and like i think the file that was duplicated the most appeared like 13 000 times in the data set and then obviously the model that we trained on that data it was really good at replicating that one file but it wasn't so good at like generating uh general python code so that's certainly an issue and maybe uh we can have a look at it a little bit later that would be nice sounds good yeah good so um once we have the data set um we can essentially um almost start training we just need to do a little bit of pre-processing and what we want to do is usually we have a maximum context size for the the model so most models they can only look at so many tokens at once uh for example burke it's about 500 tokens for gpt two i think it's about a thousand twenty four and cheap d3 i think it's 2000 tokens that they can look at in at once and but if we look at the data sets right it's like it comes with files that are longer and shorter and somehow we need to efficiently um efficiently transform them so that we can train the model on them and i have a little slide on that i'm just to visualize that a little bit so imagine that you have a text and you tokenized it so you split it into tokens and maybe that tokenized text is much longer than your context size so maybe you're looking at books and these books are they're much longer than the context window of a model then one easy thing you can do is you just tokenize the whole text and you just uh split it with the context length that you have and then it might happen that at the end of the book maybe there's a page that you can't that doesn't fit well in into one of these context windows and then you throw it away and that works really well when your tokenized text is much longer than your context window but if you have a context window that is comparable to the size of the documents then that can be wasteful for for example i have here three samples of tokenized text and i have a context window and if i have that if i look at the context window and compare to the samples you would see like maybe the first and the third one we i could use but the second one i would need to throw away because it's too short and that's really wasteful because can't you just pad somehow uh yeah for the for the shorter one yeah that's one option but it's also a little bit wasteful because then you have a lot of padded tokens but you still have to pass all those pad tokens through the model right and it's although they're padded and ignored they're still computed in in the model so you still compute like okay the the attention and stuff like that but then after the computation the the paddings are ignored so that's also not quite as efficient so you basically decide which document you need to get rid of based on the context size yeah there's a trick that we can do so that we don't need to get rid of those documents and the trick is actually we concatenate all these samples and we just add an end of sequence token in between so the model knows that's where a document ended and that's where a new document starts and then what we can do is that's what you can see here at the bottom we add them all together we have that end of sequence token in between and then after we do that then we split it and then we don't need to throw away much right then maybe at the very end of the three sequences there's maybe a little bit that we need to throw away and we can scale this up if we do that for a thousand documents then we only need to get rid of like of a tiny fraction of those thousand documents right and that's much less wasteful that's that cool answer your question awesome yeah cool so in the course what we do is we train like a very small version of co-parrot and it's like little co-parrot or maybe it's like a copair chicken it's like a small model and what we do is we just train on very tiny context windows so it's just hundred tokens and that allows you actually that's much cheaper than training a model with a large context window and that actually allows you to train uh that local per model on a single gpu in like a reasonable time so if you run that for a day or two then you already get pretty good results and you only need one gpus not an expensive infrastructure and in the course in that example we actually go for that first approach so what we do is that stuff here because we have a small context window most documents are much longer and then we can easily do that but for the real co-parrot we do actually the other thing where we concatenate many samples um and yeah you can see here how you um can do that let me get to that part we [Music] we load a tokenizer we have a tokenizer created for code that efficiently tokenizes code and if you pass that tokenizer an element of the data set and you set the truncation to true if you only do that what that means it cuts off everything that's longer than the context size which is under 28 tokens but we just said like we don't want to throw away what comes after and what you can do is you can return an extra you can add an extra argument which says return overflowing tokens which then makes that even if like their tokens that are uh more than 128 tokens look after that window you still return them and that makes that windowing for you and in addition you can also return the length of each of those context windows so you can see if you do that for uh for one sample um you can see you get many chunks that have 128 tokens in them and then at the very end there's one that only has 41 tokens is that a returned one so that's the length of all the okay those that are returned so that's the length of all these things and you can see the last one is too short and then we can throw that one away and then we can apply that logic to the whole data set and we do that with like a map function so we apply um that function that we define here that does exactly what we just did and then in the second step it goes through each element and it checks whether that length is equivalent to the context length and if it's if it's the last one then it would ignore it and not return it and that makes like a larger data set than with training samples that each have like the context length that fix context length and that's pretty much all pre-processing that you need to do to train like a little co-parrot you can then feed that data to a model first we need to initialize a new one we can do that by loading the configuration of an existing model for example here we load the configuration of gp2 and then we can define some extra information for example because we have a special tokenizer we have to define like for that new model what the tokenizer will look like what the context length is and what the beginning on the end of sequence tokens are for that model and then that's all to initialize a fresh gpt2 model so you can see in that small small model example we make a gpt model that has like 100 million parameters good and that's then all that you need you can then initialize a data collater that makes sure that all your data is is batched nicely and then you can pass all that to the trainer so you pass to the trainer the newly initialized model um the tokenizer some arguments that specify like what the batch size should be and yeah how long you want to trade and things like that the data collater and that data set that we just created which has these fixed context size windows and that's all the code you need to train like co-parrot mini and i thought what would be cool is now to look at like the the actual the real deal so co-parent large good so um in that case i switch over to the actual code of copart you can find it in the transformers repo it's under examples research projects and there you can find co-parrot and if you go to scripts there is a script that's called co-parrot training and that does all the training so there are a few things that are different from what i showed before when we train that large model one we will train it on a larger um context window we will go to a thousand tokens um two we will do it in a distributed fashion so we don't only want to train it on like one gpu but for training a larger model we need a little bit more compute so we go for 16 gpus that's what we use for copart you could also go for more or less depending on what your infrastructure is um but um what are you going to see is what it takes to like do distributed training which is something when you just do fine tuning usually you don't need to do and the last change is that instead of training just a 100 million parameter model we're going to train like a 1.5 billion parameter model so it's 10 times bigger than the code by the way are you currently scrolling because nope okay i thought you were scrolling and yeah it's fine now yes i just wanted to come back yeah so you used you used accelerator uh for the distributed training and yeah yeah that's the main difference like instead of using the trainer you can also use the trainer to do distributed training and the trainer can do that but accelerate gives you a little bit more freedom to play with the training loop itself okay so so the first thing um we're gonna see like that whole thing is about maybe 250 lines of code but if you would get rid of the logging and some white spacing maybe you could get it down to like 150 lines of code so it's actually not that much code to train something like copilot so one of the biggest chunks that we see here is that constant length data set that we define and the idea of that constant length data set is exactly implementing that logic here so you get a few samples from your data set you tokenize them you concatenate them and then you split the long sequence of tokens into these inputs and that logic is implemented here and you can see there's like a few arguments that you can pass to that data set um one the first one is infinite what that means is whether the the data set should restart once it reaches the end so we have like um 200 gigabytes of data but at one point we're going to reach the end maybe we want to continue training and if we send infinite to true then it just continues so it just restarts at the beginning um we have a sequence length that's uh the length of the the sequence that we want to generate and then we have number of sequences and number of sequences means how many sequences do we concatenate after another so roughly so at once so if we set that to 1024 that means we want to get roughly 1024 inputs at once and then once we produce them we start again and we produce another thousand sequences so we do we do like a batching of batches in a sense and to estimate how many texts we need to concatenate to get roughly a thousand sequences we have that heuristics here that says characters per token and we just measured that with our tokenizer we just had a look and measured like on average like a token has 3.6 um characters and so what that helps us doing is we know we want to create 1024 [Music] sequences each sequence will have 1024 tokens and we know each token has about 3.6 characters and we if we multiply all of this we know how many characters we're gonna need to gather so um and so we know how many uh texts we need to fetch and once we reach that number that number the input characters then we know uh we have enough to create 1024 sequences with sequence lengths 1024. by the way i don't want to interrupt you but we have two questions one is is there a specific reason why you chose gpt to config over something like gpt neo which has local and global attention um not really there's not a deep logic in that the main goal was to replicate something like copilot and copilot uses like cheap d3 in the background which has some minor modifications over cheap t2 but we thought since we have like gpt2 available we just use that but having a look at whether sometimes tp neo performs better on these kind of tasks would be an interesting experiment but there another question is just asking how long does it take to train on gpu without using accelerator um so i'm not sure i understand the question i mean gpu is an accelerator right but maybe the question no it means like for accelerate yes i think they mean accelerate yeah so accelerate actually doesn't like speed up the training itself accelerate is just a library that helps you easily distribute the training over several gpus so you could use another framework that does it for example in pytorch i think you can do uh distributed training in native pytorch um it's just it is more code and you need to do more stuff manually but it doesn't i i felt like the question actually is how long does it take on one gpu uh okay yeah in that case so i can tell you for like the large model um we trained two mods we tried a small cheap t two model uh with that full context size of 1024 and that took one day on 16 gpus so it will take 16 days on one gpu and for the large model it took one week on 16 gpu so it will take 16 weeks on the one so that's that's quite long yeah and you don't actually save any money right because yeah they scale with the number of gpus so um it's better if you go for the money and spend it in short amount of time then you wait for a long time and then you find out in four months so you had like an issue in your training yeah we can continue if you'd like we don't have any other questions yeah so that mind block here um we do what we do um is we create an iterable data set and we're gonna see a little bit later what why we do that um more exactly but essentially we don't wanna load the whole data set into memory because the whole data set is like 200 gigabytes big and if we then also tokenize it it's going to be take up like a huge amount of space in memory so we would need like a huge machine and it comes with a whole lot of issues and there is a cool feature that you can use in the data sets library which is called streaming so you never actually load the full data set but you just stream single samples and so you you were streaming during your training loop yeah exactly that's very good um originally we streamed directly from the hub which is quite cool you can without needing to download the data set you can just stream it from the hub and it will just fetch a sample yeah it's a really cool feature we found that at some point we had some issues with connection so the connection broke up and then the training got interrupted um and so we ended up just storing the data set locally on disk but still doing streaming so you still never load the full data set at once you just go through the files one by one and that and since you're not really like data constrained like the the the thing that takes a long time is like the forward and backward passes and not the data processing so um that's not an issue if you do streaming and so if you have an iterable data set you just need to define like what your iter function does and the iter function gives you um that that's that's a generator that yields new samples and in here we define the logic that i previously mentioned so we have a buffer and the buffer it will be filled with text until we have enough characters in that buffer to generate those thousand sequences with a thousand tokens so we have a thousand sequences we have a thousand sequence length we have 3.6 characters per token and we want to click create a thousand sequences that means we need to gather about like 3.6 million characters and so what this does that loop here does is it goes um through the data set so the data set is in an iterator and we get like a sample from the iterator that's what we do here there's a field called content that has the the python file in it and we append it to the buffer and we add like the length of the file that we just added to the buffer length and once we reach like once the buffer is like long enough once it's like more than 3.6 million characters then we stop so we just gather text files until we have like enough characters in the buffer and then we tokenize all these texts after we tokenize them we can you see it should i maybe make it a little bit bigger is that maybe a bit better it's fine okay so after we tokenize the texts we concatenate all those texts with that concatenation token in between which is like the end of sequence token so all token ids that will be a very long sequence with just tokens and then what we do is we just chunk them into the sequence length so we go through that very long array and we pick out like chunks of 1 24 1024 tokens and we yield those as a tensor so that's a data set that you can loop over and they would always return a thousand 24 tokens a window of 20 1024 tokens for you so far so we have also yeah but we have three questions if you'd like uh to answer so is deep speed used for the training um no we don't need to use deep speed so we used um a100 gpus and they have enough memory that you don't need the optimization that deep speed gives you if you use deep speed it also comes like some drawbacks no worries if you use deep speed it also comes with like some drawbacks in terms of efficiency right you you need to then do these extra optimization steps so that's a little bit slower and since we don't need it we don't use it so another question is for the duplicating the data sets do you think it would be faster to use bigquery to remove any duplicate content or would it be faster using hugging face data sets um that's a good question so you probably could do it on bigquery and it's maybe more efficient if you do it that way but also if you do it on data set it actually doesn't take that long so since we do exactly deduplication we only remove files that have exact duplicates it takes about if i remember correctly like two or three hours to deduplicate 200 gigabytes of data which is actually quite fast um i used a machine that had a little bit more cpus so i had like uh 16 cores so that helps but it's still pretty fast so i i thought like it's not really worth optimizing at this point um is code parrots appropriate for generating sql uh code conditioned on a task descriptor and some schema info um i mean the architecture of the model in general yes but not the model that we trained because we only trained it on python code so it could be that within those python within the python code that we trained their sql bits right because people can also use python to do sql queries so maybe it works but i'd suspect it's not great at it so maybe if you find unit on a data set that has a little bit more sql data then it would work better and i think like if you just want to create like short sql queries you could do something similar like in the course where you have like a very small model and a small context size to to do it because you don't need like that many tokens to describe a query makes sense um so this is a nice question uh how do you evaluate the quality of code parrots you did you use something like perplexity for this one as well um i mean we measure validation loss and perplexity during like the language model training but evaluating code models is a little bit tricky so most scores that you usually use when you generate text don't really work for code so you can do perplexity but that's even if you just do summarization it's not a very good metric to to tell you whether you're good at summary summaries summarization what you then usually do is like you use scores like rouge or blue and but they also don't really work for code because what they do is they compare like the overlap of a reference translation for example and the translation from the model but if you have code like the words that you use within your function for example they can be very different right some person maybe names a variable x and another person names the variable y in in the code but the the logic is exactly the same but if you use those blur and roof scores then you would get like a very low score because you use different words so that doesn't really work well what we did like for downstream evaluation is we actually looked at how well can copar generate code and there is a very nice benchmark from openai which they used in their codex paper codex paper is the paper behind copilot where they showed like what they did for copilot and what they had is like they had a few programming challenges and what the model gets is like the signature of a function plus the dock string so in that case the model would get this as an input and then it needs to generate the function from that um yes i remember that one and i remember they made like uh i i don't know which lab did but they made an even harder uh and more categorized challenge as well after that one because they they thought that that one wasn't challenging enough yes exactly they had like some i think uh some some other coding challenges of like i will post it when i find it awesome thanks and so what we actually do then for downstream evaluation uh we have these coding challenges and each coding challenge has like a set of unit tests you can run so the model generates a function for you and then you can test whether the function works with those unit tests um and then you can see how many of those unit tests actually pass or how many functions pass the unit tests and that's much better for evaluating code models and actually just uh to show you that code is also in here so there's the human eval [Music] script and that does the evaluation of a model and you can see that we have two things in data sets we have the human eval data set which has these coding challenges and we have a metric that's called code eval which does you can pass a function and a set of unit tests and code evolve then sees whether the function passes the unit test by executing the python code that's cool so another one is um why do you think there there were problems when you had duplicates in your data set [Music] well so what we found is that when we looked at the data set or we evaluated on these downstream tasks it didn't perform as well as we hoped for so like we looked at how well the model can generate like uh code and it didn't pass like many unit tests so we thought something either something's wrong in the generation which i thought that i broke something so i investigated that for a long time but what it turned out is that what i did next is i looked at the validation loss on some samples and you could see like there were some samples that had a very high validation loss and some samples that i had a very very low validation loss and when i looked at the samples with very low validation loss what you would see is like i just then googled these code snippets that i found in those samples and i found that they're like very popular framework so there was for example numpy code that had very low loss and then i thought probably that's because it's popular and then maybe there's like duplication so that's why we do deduplicate it and then after the duplication we trained another model and we saw that the the performance increased by more than 50 percent so that was a very strong hint that deduplication was actually the issue i mean it changes many things like the distribution and everything so it just makes sense that it will perform bad uh if you had too much duplicates in your data yeah to to give you a sense of like how bad to do deduplication at the duplication was like we found that 1.1 of the data of the unique samples were responsible for i think 30 of the data so it's like only a few samples appeared so many times that they actually dominated the data set to some degree and then obviously the model wins if it learns like these samples right because then it gets a low loss so we have a couple of questions um this happens like last week there were so many questions uh the rest of it we went uh with questions but i i hope it's fine for you as well sure um this is a collateral related question in my opinion your data set was several python files that you just fed to your trainer model a chunk at the time the question is um so i wonder if this is related to the sequence lengths and the number of sequences here so what the data set does like what's happening outside of that constant length data set is you actually don't see anything of like those thousand sequences you don't know that that's happening under the hood if you um get like data from from that data set you will always get like one sample and that one sample will always have like 1024 tokens and that sequence the number of sequence that that we get that's just to be uh not to waste too many sequences and concatenate a lot of them um before we we um tokenize them so that's actually not what the trainer will see the trainer will just get like one sample at a time and the batching and everything that happens outside of that data set i hope that answers that question are you still there the internet was gone for a second i'm very sorry happens the best of us no i'm fine [Laughter] good um do you have any other questions maybe we can answer one or two more questions and then i show the rest of the script just that we finished that and then i can answer more questions um i'm going to assess if the question is really nice and maybe you could answer later in the discord so um i'm going to release the banner uh but this has changed so if you'd like to if you'd like to ask your questions to leandro there is a general event chat room at discord i'm going to list this and you can ask your questions over there so that we we uh save a little time with his uh presentation as well so you can go on sure sounds good i mean i'm happy to answer questions i just thought it's nice if we can can kind of finish that and like people get like the general overview but i think since uh the rest is not gonna be so special i think i can show it fairly quickly so there's a pretty big chunk of setting up logging and i think that's not very interesting so i don't need to go into much detail here just so you know we do like logging with tensorboard and weights and biases and at the same time we do like normal python logging just to be sure that you store all the important information somewhere then maybe the thing that's a little bit more interesting is creating the actual data loader so the data loader is what we then will need later that generates the batches and you can see um here we load the data set in streaming mode so the data set name is is the the copa transplant and one thing that's quite interesting is because we do streaming we can't actually shuffle the data set globally right because we never have the full data set we can't shuffle it but it might still be good to add a little bit noise to to the data because maybe like all the numpy files are succeeding each other right and then you have a big chunk that's just numpy maybe that breaks your data your training and that have actually happened in big science i think they had an issue where like a big chunk of a file was just backslashes and then the model was trained on like a billion backslashes and it might sound bad cool so what you can do but is you can still shuffle but you just shuffle a buffer so you can say the data set will gather like a few files and then it will shuffle those files before yielding them and that helps you like add a little bit of noise or a little bit of shuffling to your data set and then with that you create your constant length data set um that's the things we just saw for the training we have like the infinite data set because we just want to um get as much data as we want from it we never want to stop unless we say we have enough steps and for the validation we don't want to do that we just want to validate every time we validate we want to validate once over the full validation set and then stop so we set it to false and then you pass that to like a normal pi torch data loader where you define a batch size and so what it will do from your constant length data set it will gather some samples to fill the batch and then it will build a batch um that's also not so interesting that's just like a helper function for the weight decay we don't want to apply weight decay to the bias terms and the layer norm weight usually and that's just the helper function that splits all the parameters in the model between those two so the ones that have a bias or very long neon weight where we set like the weight decay to zero and for the other ones we just use the weight dk we defined that's another helper function to start lock some metrics um you can have a look at it that just locks to weights and biases and tensorboard and then that's probably very interesting stuff starts that's an evaluate function so every n steps we probably want to evaluate to see how well we're doing so we have like a helper function that evaluates the model on the validation set and if you're used to pi torch then probably that looks pretty familiar so you don't see much that's like looks like a complicated distributed training with accelerate except a few things here we need to apply together function what the gather function does is if you have multiple workers it gets the results from all workers into one place and gathers them into one list and then at the end we return the loss which is the mean of all these samples from all the workers that we get and the perplexity we can easily calculate as the exponential of the loss and then we return those um good a few more things before we can start training at the beginning you need to create an accelerator um there's some housekeeping with like arguments that we want to parse um and then one interesting things maybe is we clone the repository from the hub that contains like the initialized model so we initialize the model before we run that script and we push it to the hub and then we can just clone that repository with the hugging facehub library so we cloned the checkpoint that we call that like co-pirate when we initialized it we pushed it to the hub and here we get it for the training and you can see here there is an argument or if an attribute that we use sometimes in that script which is its main process and what that does is it checks whether of all your workers that you have with gpus whether the one that's running that script is the main process because we don't want to clone the repository for example for every worker because we imagine that script runs on 16 gpus and if then 16 gpus start to clone the same repository into the same folder that's probably not ideal and if you do it's main process then only the main workflow will do that step and all the other workers won't do it and then one thing that's quite cool that you can do is if you're used to git you can actually create a branch so here what we do is we create a branch that has an experiment name so when we run that script it will create a branch on the hub for that experiment and it will push all the checkpoints of the model to that branch and then maybe you run many experiments with different settings and in the end you just merge the best model that you have into main and that's the one that you can then download from the hub didn't you didn't you use made some biases why did you need this again um we could also i mean you could also upload your checkpoints to weights and prices is that what you're saying no i mean in general like you can compare like like say it can be tensorboard it can be comet like you can compare your runs on everything so i just wanted to ask oh yeah i mean that's not for the metrics that's for the actual models so each model okay experiment branch so i can maybe quickly uh show you that if you go to my model and if you go to files and versions you can see there are two branches here there's the main branch okay and there's the still spaceship and if you go to that one and you look at the commits you can see at the beginning i added the model and the tokenizer and then you can see at each 50 000 steps i pushed the model to the hub so then all the checkpoints and all the models are actually here and you can still use them if you want them and if you're happy with the model you just merging into the main branch and that's the end model that's here then in the main branch i think that's a really cool feature of the hub that you can actually version all your models and all your checkpoints and then once you're happy with an experiment you just merge it into main i didn't know the spider something new every day yeah i learned a lot today i'm usually like more of a an alu kind of person than energy so i am continuously learning today and it's been i have to digest a little and then i will come up with questions after the stream cool yeah i mean we're almost done so i think maybe a few more minutes and then i can show the rest i mean here we just load the model we load the data loaders with the function we created before we set up an optimizer and the learning rate schedule that's also just plain pie torch it's like nothing special about that and the interesting thing is probably what's happening here so we have a model we have an optimizer and we have the train and eval loader data loader and we pass them through prepare and that is actually where all the magic happens with the distributed training so before that like you have like a normal model and a normal data loader and after you prepare them accelerate takes care to push all the models to all the workers it takes care of creating a distributed data loader um and after that you can pretty much go back to like a normal pi torch loop and you don't see much difference to just doing a pie torch loop on a single gpu everything is handled with like that prepare function which is quite nice that's really good i i also uh did not use uh accelerator previously and it it just looks very good if you are um if it's very native uh like if you can keep your native uh processes then it's it's amazing yeah and then this is the this is the main training loop so yeah that's the i think the last thing i want to show is like that's what the actual training loop looks like and it's also there's no magic here if you look at that if you're used to a pie torch training loop you iterate over the train data loader that looks pretty normal you do a forward pass through the model with a batch and the labels as patch because in causal language modeling the the input data is at the same time also the labels right if you try to predict the next word you just need to shift the tokens by one to actually get the labels and the model does that internally for you so if you pass the batch as the labels it will align the labels and the inputs accordingly we do some gradient accumulation um because we can't actually train a full batch even on the large a100 gpu so um we use the batch size for the large model of 512 and we have like 16 workers so we have each at each step if we just do one fourier pass we have we gather like 16 samples and we can fit at most uh two samples into one gpu so that gives you 32 um samples that you have that you can fit on these 16 gpus in one pass but we want to have like 512 um batches so we just accumulate the gradients over several four passes and that's what you do here once you have enough accumulated enough then you do the actual optimization step and we also do some gradient clipping um but that's all like pretty standard stuff and that's not very specific to like training do you do this for accumulation the clipping uh no the clipping is more like that you want to get rid of like gradients that are too large yeah yeah when your gradients accumulate like they also did this also causes exploding gradients and you do clipping that's why i want to ask is it related to that um i'm not sure that it is because of accumulation because i think the same would happen without accumulation if you could actually fit the the full batch on it because what you do here is you divide the loss by the number of steps you accumulate right so you make sure that it's scaled but it can still happen that you have exploding gradients and then you clip yeah i'm learning nice know-hows today um yeah maybe the only interesting part is here that's a little bit special like if we have certain steps we have saves checkpoint steps maybe that's uh that's fifteen thousand fifty thousand um we wanna then evaluate the model we apply the evaluation function and after that we wanna save the model and you can see here there's a few steps that are maybe a little bit special and one is wait for everyone and that makes sure that all the workers are ready to actually they have all been finished optimizing and they're now ready to do the next step so you synchronize all the workers and make sure they're all that on the same page and the unwrapped thing that just when you pass the model through prepare it adds like a layer on outside the model and you don't want to save that extra layer so you remove it for saving and then if it's the main process only the main process then pushes that checkpoint to the hub and that's quite cool then you have like all these checkpoints on the hub that you can readily use yeah that was great that's it um shall we like answer a few more questions maybe like one or two and then i have to go to uh a conference and talk there so [Laughter] like in in two minutes or so and then like if you want to ask questions to leandro uh you can go to our discord channel and ask in the general event chat room because we might not answer all of your questions we don't have any time yeah and questions like written i'm going to ask one really good question [Music] as codex was also trained on python do you think if a model is trained on more languages it would improve the benchmarks yeah that's an interesting question um and i mean that's what copilot actually does right copilot is the the commercial version of codex and it actually doesn't only work for python it works for like many languages and in in normal language like in natural language you have that thing that's called like cross-lingual [Music] transfer so if you learn something in one language language models that trained or trained on many languages they can leverage that on other languages so if you train a multilingual model on sentiment classification in french and that multilingual model can also do english then it also learns a little bit to do how to do a sentiment uh classification in on english um so maybe that would help uh it would also be more expensive to train because you need to train it on more data and more diverse data we can wrap up if you'd like um thank you so much for coming and thanks for having me yeah it was great actually i learned a lot of stuff uh it's it's great that you can have your own codex co-pilot at your home definitely and just maybe as a last thing you can go to uh spaces and you can actually play with co-parrot yourself so that's a little demo to show the the highlighting of code so you can see which tokens or which uh code parts like co-parrot things or maybe have a bug that looks great um you can if you if you have any other question you can go to discord um thank you so much for coming and all of the all of the participants thank you so much uh for asking questions and uh watching us uh see you on the next session then thanks for the questions and bye
Info
Channel: HuggingFace
Views: 410
Rating: undefined out of 5
Keywords:
Id: ExUR7w6xe94
Channel Id: undefined
Length: 58min 1sec (3481 seconds)
Published: Fri Dec 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.