Fine-Tuning A LLM For Summarization | Generative AI with Hugging Face | Ingenium Academy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so in this video we're going to learn how to do full fine tuning on a Model to train it to summarize dialogues we're going to pull our dialog data set from hugging face pull our model from hugging face and then train it using the Transformers Library first things first if you're in Google code lab please make sure to come up to runtime and change your runtime type to a T4 GPU or some GPU you can use a TPU just don't use the CPU the model that we're using is called is called Bart it's got 400 million parameters and training it is very computationally intensive and so having a GPU is going to speed up your training dramatically on a CPU the training will take a very long time and so I've already installed all these packages so please pip install those and now we are going to load in our model and now we're using a sequence to sequence model Bart is a sequence of sequence model and this is its model card ID you can look it up on hugging face if you would like the um the model is pretty large it's got 400 million parameters and so I've already loaded it in so I'm not going to get any of the low status loading Etc that you usually get I'm going to load in our data set and we're going to have to pip install these requirements real quick and then we'll load in my data set I've already loaded this in so I may not get the progress bars but nonetheless I do have my data set here and it contains a training validation and test set and every example comes with an ID a dialog and a summary and so now what I'm going to do is I'm going to generate a summary using our base llm this Bart model that we pulled off hugging face that's been trained on some general text and it should be able to summarize some text pretty well but for our use case it's it's not going to do so well and you'll see it and so I've created this function it's going to take as input a sample right and then an llm and which is our Bart model and it's going to create this prompt and this prompt says summarize the following conversation it's going to have it's going to put the dialog right here and then it's going to end with summary and then it's going to expect BART to finish that summary and that's what we do here and so I'm going to print this out this may take a second because Bart is kind of big and so generation is a little slow so as you can see this is our dialogue it's a conversation between two people named Hannah and Aunt Amanda and what the model generated the model generated summary Bart Bart just looks like it kind of replicated part of the conversation um Katie had Bay's number let me check Hannah asked Larry Hannah didn't really ask Larry so not only does it it's not doing a good summarization it's it's uh not really uh learning the dialogue um Hannah's not the one that said ask Larry it was Amanda that said um right here like ask Larry so that's not really right and the correct summary says Hannah needs Betty's number but Amanda doesn't have it she needs referring to Hannah to contact Larry and so that's the summary that we want and that's the label and this is what we're getting right now especially you can see it's not really good at summarization but it will get there and so we're going to first prepare our data set for training now if you realized we have 14 732 examples which is kind of big and would take forever with this type of model and so even on a GPU and so what we do is we go through each example in our training data and we create this uh prompt here we turn them into this we create this prompt and then we go through and we create two columns input IDs and labels input IDs is going to tokenize the prompt and labels is going to tokenize the summary which is our answer our prompt completion and what we're going to do is we have to set the pad token because this model has is one of those models where you have to set the token that you're going to pad any sequences that aren't long enough with and we're going to map this tokenized inputs function to every example in the data set we're setting batch to equal to true so it's a little bit faster and we're going to remove any of these old columns we just want to keep input IDs and labels and what we're also going to do is we're going to filter this data set this is another data set pre-processing function that we're gonna that you will use a lot if you use hugging face and what I'm going to do is I'm going to filter any um any this data set down to where I'm only going to keep the the examples whose index modulus 100 is equal to zero and it's just a way to keep every 100 examples and to shorten the data set down from 14 732 I believe that was the amount yeah down to a smaller number we're only trying to get a little bit of data and you could have you could have used Shuffle and select I actually advise using Shuffle and select to get a better representation of the data but this should be fine for our example and so as you can see the progress shows that it mapped to every single example in our data set across train validation test and let's just print out the shapes of our data sets here and we see up 148 examples in our our data set because we went from 14732 divided by a hundred and that should get us or at 148 and about 148 and then our validation and test sets have shrank as well now just to show you what happened in the train set and in the test and validation as well these are the only columns we have the input IDs which is the tokenized prompt and then the labels which is the tokenized correct summary and we're just training this model to take in that prompt and produce that summary and what you need to do is you need to get your access token because after we train this model we are going to push it to our account now let's go back to hugging face I'm going to get my access token here as usual I'ma copy it come back in login and it says token is valid permission is Right remember you have the right the right permission so that you can commit changes to your repo and now this is the new thing that we're learning here Transformers have has this trainer library that allows for you to very easily train a model and we're going to pass in training arguments and we're going to pass in training arguments to our trainer and this right here is what allows for us to specify the directory name of our model in the hub and I'm going to push mine to ingenium Academy which is my account slash name of my model Bart CNN Samsung fine-tuned and I put the output directory as this right here as well and you would just put your account right here obviously and this is the learning rate that I chose this is the number of trained epochs I chose I only chose one because it takes about three minutes per Epoch for full fine tuning you'll see that theft does not take three minutes it takes about a little under two so you shave some time off because you're only training one percent of the parameters and I set auto find batch size to true it should default to one though Bart's kind of large for for for Google collab GPU and so you're going to default to one probably and it's just going to log everything at the end of the steps make sure to pass it a tokenizer with your model to trainer so that you can um you can whenever you pull your model from The Hub it comes with the tokenizer I give it the train and the eval set and I'm going to instantiate these right here and so once it gets instantiated I'm not going to train the model because it does take three minutes here but I did train it beforehand as you can see and this is our training loss and this is our validation loss which is the cross entropy loss function and then now what we do is we just push this model to the hub I've already pushed my model to the hub if I can go into my account right here and look it up it should be right here Bart CNN Samsung fine-tuned and it should show up in your Hub and give it some time if the model car is not right or whatever and you can use it just like any other Transformers model it definitely feels good to contribute and here is some code here to be able to load in and retest your model and so I can actually load it in like this and let's give it a second to load back in and so I'm going to go ahead and generate the summary as well and it already got out and this is the new summary that it gives so we had our dialogue beforehand we remember what it did last time it just kind of regurgitated the conversation and got the conversation wrong here it says Hannah asked Manda if she has Betty's number hey do you have Betty's number that part is correct Amanda can't find it that's also correct says she asked Larry so so she asked Larry to call her that's not correct Amanda is asking Hannah to to call Larry um not a man is not doing it herself now Amanda is going to text Larry to ask him to call Betty and so that part is incorrect as well Hannah is going to text Larry now this is the ground truth Hannah needs Betty's number but Amanda doesn't have it she needs to contact Larry now it's not close but that's a lot better than what we just had it's not regurgitating the conversation and not and also getting it wrong it's actually trying to produce a summary and although it gets part of it wrong it gets part of it right and so in the next video we're going to show how to do PFT on this model and you're going to see that peft is not only faster and but you know it definitely improves the model just a tad and let's hop into that video
Info
Channel: Ingenium Academy
Views: 2,632
Rating: undefined out of 5
Keywords:
Id: msgLLudzlLg
Channel Id: undefined
Length: 10min 6sec (606 seconds)
Published: Tue Sep 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.