Fine-Tuning T5 for Question Answering using HuggingFace Transformers, Pytorch Lightning & Python

Video Statistics and Information

Video

Captions Word Cloud

Captions

okay guys in this video we are going to fine-tune t5 for question answering we are going to dive a bit deeper into what t5 really is we are going to use pytorch whitening to find unit and we are going to export the transformers library by hugging face how we can what the model use it and uh find unit and use it to generate some answers for our questions let's get started and today's video sponsor is ml expert emma expert is a tool that is actually created by me which is going to help you better prepare for the machine learning interview and in general i believe it will uh help you get better at machine learning at least solving practical machine learning problems it contains a lot of both theoretical and practical questions it contains a lot of soft complete projects in the areas of computer vision nlp time series tabular data and some general tips and tricks for how you should do and what should you do to get the machine learning job that you want also there are programming tips for example how to work with git with sql with bash how to better prepare your environment for example how to set up virtual environments and such and of course there are a mock there is a mock interview when you go through the questions that are contained into the system and mock interview is actually going to be a live interview over zoom or some other software which is going to be conducted by me so we are going to have some chat i'm going to ask you some questions uh you will have the possibility to ask me some questions if something is not clear for you and uh am sub to say that we have almost i believe 90 or 95 people already so we have just five left i believe and those might be uh might be out at the time as this video is out but the first 100 people i will get a 50 discount this will be a paid service uh the price is going to be i'm not sure but it's not going to be cheap uh of course uh it's going to be paced so this is pretty much it and if you want to be part of the ml expert uh at least initial discount uh put your email right here in io sign up and when ml expert launches i'm going to send you a discount code or a special link which is going to which you're going to use to get the 50 discount from the initial price also i would i would like to say that the price is uh probably not going to be ever discounted after the launch and it's going to be probably only increasing so if you want to take a part of the expert tool and become a part of the community please go and sign up right now thank you here is the notebook that we've used into the previous video and as you can see i'm currently connected to the google cloud instance and here we have a p100 gpu which is great because we are going to need this right now and i've made just a little bit of changes right here i'm using a newer pie torch whitening library and the latest tokenizers library notice that we actually don't need the sentence piece library because the tokenizers include that and after that i did some cleanup to the imports right here uh and the rest of it is pretty much unchanged but we are going to do a bit of change to the data set that we have right now so after the date set is created or the data frame for the data set is created uh what i'm going to do next is uh to drop the questions that are basically duplicated but i'm not going to search for the questions i'm going to search for the context that is basically not unique and drop those intense instances so to do that i'm going to insert a new cell right here and i'm going to call the drop duplicates from pandas and i'm going to drop just based on the context that we had so after that i'm going to call reset index to keep everything uh tight and clean and now the new data frame yes a much smaller number of rows we have like a five time uh decrease in our data set which is good because we are going to use that to uh basically speed up our training and are experimenting with the model so after this is complete um the rest of it is pretty much the same so i'm going to just run all this and i'm going towards the tokenizer again run basically everything and i'm going to change the number of batches of the batch size that we have here uh to let's now take a minute to discuss what t5 really is and to do that i'm going to show you pretty much the official blog post from google which is about an year old night now 11 months out so they presented their work into the exploring transfer warning with t5 there is of course also a paper which is called exploring the limits of transfer warning with a unified text to text transformer so right out of the butt you can see that this is uh just another transformer but the differences here between something like uh let's say bird and t5 are the way that this model is basically trained where the bird model is trained using the language modeling task which basically says hey there is a missing word or a token right here what are your predictions for that token so this is the missing token prediction task if you will and right here in t5 you can see that we have a different uh type of tasks such as translation we have summarization and we have question answering and all of those are basically pre-training different type of tasks and every time the input is text and the output no matter the task is text as well so that's pretty different compared to birth style models which is a very interesting because those type of models the t5 models are just working with text and outputting text you basically don't have to do any uh additional post processing of the the outputs it's just text and this pretty much can encompass all of the nlp tasks which you might be looking to tackle with so for example if you're doing something like text classification for example sentiment analysis you can basically input the text and output for example positive or negative just as text and this will be your training sets and you are pretty much good to go to use t5 so what was t5 trained on it was trained on a large pre-training data set c4 so the guys from google has actually open source that and there is a link that will in that i will include into the tutorial that is coming with these videos and you can see that this is based on the common crowd data set uh it's basically a crow data set that contains a wall of html pages available on the web of course and there are various pre-processing steps that they took to basically make the data set more clean let's say and of course the model contains a lot of tasks and some of the tasks you can see into the appendix of the paper which is appendix d i believe so here you can see that there is some task called rt sentence one uh we have right here you can see examples of question answering uh you can see examples of let's see another one uh proper choice which is basically is this grammar correct i believe type of task from the nlp we have the summarize task right here so basically our tasks contains uh the name of the task followed by uh double dots and then you have the input package that you want so we are going to not actually add a new task to the t5 model we are going to basically use it and find unit using just a route text without the task let's say identification but you might want to add just another task to your model that depends on you we are going to use the model just for question answering another interesting thing about the t5 model is that it is implemented into uh the transformers library by hugging face and you can see that um here we have the t5 model which is right here and the guys at hanging face state that this model is still our still work in progress but i believe that it's pretty well very well implemented already so the text here specifies the abstract from the paper which is again very good thing to read and they also mentioned that it is achieving state-of-the-art results on many benchmarks which is another great i guess achievement in the nlp community for the nlp community and um you can see that we have a sequence sequence generation model which is going to be used uh you can use using t5 for conditional generation method generate and we are going to look at what this method does and right here at the start we have a training section and some of you might think that the model needs some decoder input ids i actually saw this in a comment in the previous video and as far as i understand uh the decoder input ids are let's say optional because the forward function automatic creates those based on the labels that we pass in so i believe that the training of this model can be really simplified uh in the following order we can put in the input ladies maybe the input id mask and then just the labels so we can use just that to train the model and i'm going to show you how you can do that later in this video so i believe that the d5 model is a very very very interesting and we are going to use in particular this one e5 for conditional generation and we are going to see how we can instantiate it or initialize the pre-trained module and then we are going to use it to fine-tune based on our data set for question answering so now that we have some understanding of the t5 model let's take it for a spin by loading a pre-trained t5 base model and to do that i am going to use the t5 for conditional generation as shown previously and we're going to call the static method from pre-drained we are going to pass in t5 base which is the model name and i want this to return a dictionary as an output and this will go ahead and download the pre-trained weights you can see that this model is very large even though it's a base model it's almost 900 megabytes of weights which is of course really large and next we are going to use this model to translate some text from english to german this is just the pre-trained model and we are going to see how all of this works so i'm going to use the tokenizer and i'm going to pass in translate english to german task which is of course a built-in task and i'm going to say that i want to translate this sentence so i talk i talk a lot so i want to tune myself out and i want this to return tensors in the form of pytorch and i'm going to take just the input ids from this and i'm going to set this to a variable and i want to call the model.generate method of course we're using again d5 for conditional generation so we have the generate method here and i want to pass in the input ids and this will return the generated ladies from the model and we're going to have a look at those so this pretty much resembles um id of the some tokens as you can see and we are going to dive a bit deeper into how generate works in the next couple of minutes but for now just let's look at it as a black box and those generated ids we can use the decoder uh sorry the tokenizer to decode generate ids they are just tokens of course and to do that i'm going to create a variable called predictions and in here i'm going to iterate for the generated id and for each id i'm going to call the tokenizer.decode passing the generated id and i want to skip the special tokens and i want to clean up organization spaces so the output is a bit uh easier to read and if you print this you are going to see just an array of some words in german i believe and if you want a better representation of all this you can basically join the predictions and this is the output string you can go get this and pass it through translate google google translate i talk a lot so i want to express myself this is the translation from google at least which is very cool not not the same meaning but i think still pretty okay i guess so next we are going to look at a summarization task and i'm going to copy a text from here so the task is summarize do i need to be liked absolutely not i like to be liked i enjoy being liked i have to be liked but it's not like a compulsive need to be lacked like my need to be praised so let's see how our model is going to summarize this and i'm going to basically do the same thing uh just as before and i'm going to use the tokenizer to encode this get the input ids and i'm going to sorry i need to run this and i'm going to take the predictions just the same thing but i need to also use the model generate to get this okay so now we have the predictions and once again i'm going to join the predictions to see what the output is so like the output or the summarization is do i need to be like absolutely not i like to be liked i enjoy being so this is pretty much the summarization of the model uh and in all of honesty the generate method has a world of um parameters and uh working with those can be a little bit complicated but to understand how you can uh work with those and how you can adjust the generation we need to um at least get some view or some understanding of how the generation is working and what is doing under the hood so what is the magic behind the generate method right here uh there is an article on the hanging face library a blog post in which contains the answer to that question so the article is called how to generate tasks using different decoding methods for one language generation with transformers and in here there presents the basic building block of generating text from transformers and i believe other nlp methods which is called auto regressive language generation so they say that in short auto regressive language generation is based on the assumption that the probability distribution of forward sequence can be decomposed into the product or conditional network distributions so what they mean by that is that given that you have a some sequence which starts with the initial some initial word from the the context that you have uh based on the all of the probability distributions of the all possible next words you applied basically the next word with the highest probability and then that new sequence you're going to do again the same thing and the same thing and the same thing uh and of course you can just use the the next word based on the highest probability and you can use some other techniques for example the hanging face transformers library implements grid search and it implements bim search and finally it does sampling for you there is a multiple types types of sampling right here and you can go into the details of what those are and the t5 model at least the one for the language for conditional generations implements i believe all of those methods one interesting feature about the t5 model is that it contains a config file that has different parameters for different tasks that you might want to do with the d5 model so let me show you uh what this looks like if i call the module.config file uh config property into them of the model you can see that we have some task specific params right here and all of those are at least some of those are just used for when the generation is happening uh for example when you're doing a summarization you are saying basically when i do beam search so i'm going to use just beam search because the number of beams is more than one we have four beams here so when you're doing summarization you're going to use probably beam search and the maximum number of tokens or the maximum number uh the maximum length of the sequence that you're going to produce is going to be 200 characters and the minimum is going to be 30 characters uh of course for translation from english to german you have another parameters for maximum length and another prefix so you can see that this provides parameters for the specific generation methods and of course looking at the outputs of summarization you can't or at least i can't help but think about this very well explained image of what is happening so let's just look at it alright so let's pass some parameters to our model some encoding and some labels and let's see what the model is actually returning when we are going to find unit so i'm going to use the question encoding that we had right here which we did into the previous video so the question was would i rather be feared or loved so this is pretty much uh sorry uh this question was from the dataset that we have with the from the bo ask challenge so let's do just the enco uh the passing through the forward method right here and to do that i'm going to pass in the input edits and i'm going to use the encoding let's do the tension mask and the labels and this will be the output of our model so if we look at the output it will be a sequence sequence or seek to seek aom language modeling output so this contains widgets which is a tensor which has this crazy shape and right here you can see that basically this dimension is the number of vocabulary that we have into our model and um the the 32 is coming i believe from this parameter so this is basically uh the number of attention buckets that we have and yeah we have just a single example so this is uh with the batch size which is currently one of course you also have the loss and this is just the number which we are going to use to backward uh the error through when we are training the model and of course you can i believe you can guess that we have basically 32 outputs for each of the tokens that we have into our vocabulary at least that is the case with the t5 base model let's build a whitening module using the t5 for conditional generation model and to do that i am going to create a class which we are going to call view question answering model and this is going to extend pl whitening module so here we are going to implement the need the forward method the training step validation step and test step those are basically included into the base whitening module class so this should be pretty straightforward in the constructor i'm going to call the super the super so the parents constructor and then i'm going to instantiate the model which is of course the t5 from conditional generation pretty much the same thing that we did before and i wants to return did equal true so after this we are going to define the forward method which is going to accept the input ids the attention mask and optional labels of course because when we are testing the model the labels will not be included and i'm going to just take the output of the model which is going to be this really and i'm going to pass in the input like this the attention mask and the labels so after we have the output i'm going to return the was and watched right here so now that the forward method is pretty much ready i'm going to implement the training step and this of course accepts the batch and the patch index this is all coming from pytorch lightning and if you are not familiar with by torch whitening you can look at my previous two or three videos in which i am going through some of the basics of python frightening of course uh the original or the the documentation of the official website by torch whitening is also a great place to to better familiarize yourself with the library and here we are going to take the input ids from the batch i'm going to do the same thing for the attention mask and of course the labels i'm going to take the wash and the outputs of the forward method based on those parameters so we're basically calling this with the parameters from the batch and the labels all right and of course this is coming from our data module that we did into the previous video and i'm going to walk the train was here and i want to display a progress bar and use the worker which is going to print this to the screen and of course we are going to look at the training progress in tensorboard and i'm going to return this was so i'm going to basically do the same thing for the validation and uh test step so i'm going to just copy and paste the code this is pretty much the same thing so here we have the validation step pretty much the same thing and the test step same thing of course i'm logging this up as different parameters and finally we are going to configure the optimizers and there are a lot of fancy methods to do to do this here you can use a warning right scheduler as we did into the birth example with pytorch whitening but i found that in this case uh the standard atom with the weight decay fix optimizer with a correct working rate which is uh based on the proposed value from the original paper is doing just fine so i'm going to pass in the parameters of the model and the warning rate is going to be 0.001 so this should pretty much give you the model that you want and i'm going to create an instance of that of this model right now and if everything is okay this should wob the model yeah so we are getting this warning and it basically says that it is okay it is expected if you are doing the initialization from checkpoint which is what we are actually doing now that the model is ready we can use the trainer from pytorch whitening to find unit using our data set and to do that i am going to first create a callback or checkpoint callback that will make sure that it we are recording the best model during the training and i'm called the model checkpoint constructor here i'm going to save those to the checkpoints but i'm going to give it a name of best checkpoint and i want to save just the best one and i want to output something when a model is saved i want to monitor the validation was and of course i want to save the model with the minimum validation was after that i'm going to create an instance of the trainer so this will make sure that our model is drained i'm going to pass in the checkpoint callback i want to train for a maximum of n epochs which is six in our case i want to say specify that we have a single gpu and a progress bar a refresh rate of 30 because we are using the google co-op environment right here so if i run through this it says that we have a gpu and the gpu is getting used when we are not using a tpu so next i'm going to word pencil board and i'm going to specify that i want to observe the whitening white ming works directory which is of course not existent at the moment because we haven't started training but once you start uh training your model the the events or the files that are going to be observed by the tensorboard are going to become visible and you will be able to press uh this refresh button during the training and we look at some intermediary intermediate results next i'm going to call trainer.fit and pass in the model and the data module so once i do this the training process should start and hopefully if we did everything correct uh we are going to see a print out of the summary of the model and then the actual training process is going to start so the error is saying that we have some something wrong with the forward method and if i go to it uh you can see that i forgot to say self.model so i'm going to run the model initialization one more time and after this is complete uh i'm going to go back and check just to make sure that we are actually removing the training works folder so this should take care of it white sorry whitening locks all right so now the directory is removed i can start the training process again uh and again i have a typo here it says strain does it so if i go to the data module we have the train data set i believe so if we rerun this and go back to the training step so this should finally and hopefully run the correct training so let's wait and see what is happening here and um of course the training process is going to take some time uh it looks like that is has already started and this is going to take some time and i'm going to get back sometime during the training to show you some of the intermediate results uh into the tensorboard while we are training and after that we are going to a what the best checkpoint or the best model and have a look at it how we can use it to generate some text and answers to some questions from our date set and as you can see we are currently training the model we've already done a three epochs and if i go to the tensorboard dashboard and refresh this uh you can see that we already have some data this is the data for the first three epochs and you can see the training clause and the validation was uh this is a bit it is decreasing at least by a bit uh and if you playing play around with those you can see that we are currently decreasing the was by a bit so this is a pretty cool uh you can basically play around with the smoothing right here and get different smoothing results and i had to do some changes to get the tester boardwalker to work uh at least previously i've been able to do that by default but right now i had to put in the test support logger right here specify the training clocks directory that i want and a name of the model and i had to pass in the vlogger to the trainer and after that everything seems to be working as expected and after the model is strength we are going to get back on how to generate some answers to questions from the validation set all right so the training is now complete it took about 25 to 30 minutes to train and save the best checkpoints and if we go back to the tensorboard and refresh the data once again uh you can see that we have the drain walls right here which is appearing to be uh decreasing and the validation was which is not decreasing by that much it appears that at epoch number 3 i believe we had the probably the lowest value so thank god that we had the best checkpoint only saved so this is the checkpoint for epoch number three because uh or number two because we are starting from epoch zero uh and if i go ahead and run the trainer dot test method uh we are going to evaluate the current model based on the i believe the last epoch that we have and after that i'm going to start with making some predictions but to do that i'm going to watch the bio q a model checkpoint from checkpoints best checkpoint dot c pt and after that i'm going to freeze it we're going to do that after the training test is wallet and after the loading is complete we are going to take some questions from the validation set and uh have a look at some of the outputs from our blue all right so the testing uh has started we have around 130 i believe batches of questions in it and the test was is this number right here so this is uh all right i guess i mean like when you're looking at that watch it doesn't tell you much of course but we are going to have a look basically at some of the mod uh the outputs and decide for yourself is this good enough of course this is not very practical you might want to have a look at some other ways to evaluate your answers but uh of course this is really out of the scope of this video so currently the model the best model is uh wallet and priest uh freezing does basically turns off the gradient calculations and dropouts and such and the model is basically used just for predictions and the process is speeded up so next i'm going to start a function called generate answer and it's going to take a question from our date set and in here i'm going to use the tokenizer and i'm going to encode very similarly to what we did into our data module question context a question the question itself the context we're going to specify a max length of again 396. all right next i want to put the values to the max length and do truncation only on the second value we want the tension mask and we want to add the special tokens and return tensors in the form of pi torch tensors so this will be the sourcing coding and then we are going to call the model generate method i am going to call the model which is the field from the bio q a model that might be a bit hard to understand but uh if i show you the build q a class in it we had this field a model which is basically the t5 for conditional generation preterite model which is now fine tuned based on our data set and i'm going to pass in right here the input ladies the attention mask and i want to specify the number of beams equal to one which will uh equal to some sort not some sort but it's going to do basically uh research so it will take the next word from the vocabulary with the highest probability when is generating the text i want a max length of 80 and this is based on some going through the date set we are going to incur a repetition penalty repetition penalty these are some values that i've found probably somewhere on the internet and i think that they work quite alright on this date set uh of course the generate method is very has a lot of parameters and you can go through those on your own time this should be very beneficial actually for you if you are solving similar problem and basically i'm going to get the predictions from the tokenizer the tokenized generated id which is from generated id in generate ids i'm going to iterate those skip special tokens and clean up tokenization spaces so after this is complete i'm going to join the result think array into a nice string and this should pretty much be the generate answer function that we are going to use right now so let's take a simple question from the validation set let's take the first one and look at the question text itself so what is the mode of inheritance of some syndrome and answer text in this is how to sum up dominance or something like this so let's generate an answer using our fine tuned model based on this simple question and yeah i had i have an error right here so this should be this and if i call this again it says exactly the same thing which is kinda amazing at least to me uh and if i hadn't made any error i'm pretty much blown away by the performance of this but this might be just a quirk let's have a look at some other question i'm going to go to the next one and here we are asking which enzyme is involved in the maintenance of dna something that i'm not really familiar with and we have another i guess domain specific answer and if we generate the answer for this one we get again ex the exact match of the answer which is again great but yeah this might just be another look let's go here and look at this question and generate the answer of this and again exactly the correct answer i mean how great is this of course i don't know i might have some let's say data leakage or something like that but if this is not the case if everything is working as expected uh this is very very interesting and very actually very correct which is i guess great so this is uh the basic let's say the basic approach that you can use to fine-tune t5 for your downstream task in this case we did some question answering and even if those answers are actually not that correct as we might think or as we might seen in this video i believe that t5 is going to mostly give you a very good results uh especially compared to the amount of code that you have to write using uh great leverage like having faced organizers having face transformers and of course by torch and by torch lightning i think that you can do pretty amazing stuff with those kinds of models of course it might take a lot of work to pre-process your data depending on the task that you are solving but either way you have to do this and this is pretty much it if you have any questions regarding t5 please leave them down into the to the comments please like share and subscribe thanks for watching guys i'll see you in the next one bye

Info

Channel: Venelin Valkov

Views: 9,525

Rating: undefined out of 5

Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning, PyTorch, HuggingFace, Transformers, PytorchLightning, Python

Id: r6XY80Z9eSA

Channel Id: undefined

Length: 50min 20sec (3020 seconds)

Published: Mon Jan 11 2021