How To Create A GPT-2 Tokenizer In Hugging Face | Generative AI with Hugging Face | Ingenium Academy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

video we are going to create our own tokenizer we learned about tokenizers we even created our own data set so let's create a gpt2 tokenizer and on our data set we're going to train it to be able to tokenize our data set even better and so first and foremost make sure to install Transformers data sets and pytorch we're going to load in our data set that we created from the last video I'm using a genium Academy but obviously use the path to your data set in your Repository and this may take a second to load but it comes in and just to verify it's the same looks the same here we go we have our data set dictionary training validation and test set now we're going to this is a new way of working with data sets this is a new functionality the dataset.map function works just the same way as or similarly to the map function in Python so if you want to map a function to a list this is the same thing we're going to create a new column and we're going to use that column as the data set as the data that we're going to create our tokenizer on and this new column we have title and body we're going to create what's called a full article and we're just going to concatenate the title and the body inside of this formatted string and this function is going to be applied to every single example across the train validation and test set and it's going to return this full article column that's going to have this concatenated title of body and as you can see we have it now have our data set where we have applied the full article function to it and we have this full article column on each of our trade validation and test and so we can print out what it looks like right here we have a title and then a body all concatenated together so now let's create our own tokenizer and so what this is going to do is going to create a batch data set called training Corpus and it's going to iterate over the training training data that we have right here in train and it's going to create batches of 1000 and it's only going to pull out the full article from each example and so it's just going to be a train is just going to be batches of 1 000 full articles and then we're going to use those batches to be able to train our tokenizer just a little bit more quickly instead of going one example at a time so we're going to load in the gpt2 tokenizer because that's the tokenizer we're going to use to train on our data and so now we're going to pass that training Corpus in here and we're going to use this as our old tokenizer right we're going to say this is our old tokenizer so we can compare the tokenizations between the two but here we're going to call the train new from iterator right on this training corpus with a vocab size of 52 000. this is going to train a new tokenizer from this old one so this 2G C2 gpt2 tokenizer has already been trained on some data we're going to fine tune it to our data set of Reuters articles it's going to create a brand new tokenizer with 52 000 uh with this vocab size of 52 000 it may take a second but it won't be too long and so our tokenizer is done and let's go ahead and test it on an example in our data set this is the second example in our data set there's a Reuters article and we're going to use the old tokenizer to tokenize this example to see how it actually breaks up these words and oops sorry but as you can see the words title actually get broken up into separate tokens now the goal of a tokenizer is to break up text efficiently and since all of our examples contain title then it's probably wise for it just to say that that is one that is one token because it's going to be more efficient now breaking it up into half that's that's not you know more efficient if we look at our current tokenizer that we just trained it is obviously going to have um title as one word and so that's far more efficient and so it's been fine-tuned to our data and so now we have our new tokenizer let's just go ahead and push this to the hub and so as before we need to go in and we need to find our access token let's go to settings let's go to access tokens let's copy it I have two of them right now because I created the last one but let's go ahead and log in awesome and let's push this to the hub now I already have this one but it's going to overwrite what I currently have and so it should save under models so it's going to be a new model that you have and you can go right here and I haven't created a model card for this I think that's okay but you can see that all the files that we need for our tokenizer is here and if I go back and I go to I pull in I go and pull it down again it will load in my tokenizer right this one that we just pushed to the hub and I'll pull out that same example didn't need to do it but we'll pull it out from the data set just because and now let's tokenize the full article and as you can see we get the same tokenization and so everything works just like it did in our notebook and we finally just re created a new we just created a new tokenizer by fine-tuning a pre-trained gpt2 tokenizer on our data set of Reuters articles and so now we can use this gpt2 tokenizer as the tokenizer for a new gpt2 model that we are training from scratch or training from scratch on our data set and so we can train our own gbt2 model from scratch and that's quite powerful just using hugging face and not that many lines of code considering the complexities that are going on under the hood

Info

Channel: Ingenium Academy

Views: 382

Rating: undefined out of 5

Keywords:

Id: jee78mdPL6A

Channel Id: undefined

Length: 5min 52sec (352 seconds)

Published: Tue Sep 19 2023