Create Custom Dataset for Question Answering with T5 using HuggingFace, Pytorch Lightning & PyTorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys in this series of videos we are going to fine-tune t5 for question answering and we are going to dive a bit deeper into what t5 actually is in the next video but in this one we are going to prepare some data that will be able to use to train our question answering model and we are going to have a look at it and define a data set that is uh based on pi torch data set and then we are going to create a data module using pytorch lightning and in the next video we are going to do a bit of an overview of what t5 is and a little preview for those that don't know yet t5 is of course another transformer but this transformer is a bit different from birth and others like birth in that the model is taking as an input text and is it is outputting text as well so it's a sequence to sequence model basically and d5 is very very accurate and of course it is very very big so in the next video we are going to see what are the things that you need to do to find unit but in this one we are going to have a look at the data set that is provided by bo ask and this is a data set that contains our scale biomedical semantic indexing and question answering challenge so the guys from this project are basically providing a set of questions and answers in which we are going to use to prepare our data set for question answering and as a little preview of what we are going to do in the next couple of videos here is a pre-trained model and if we input this question treatment of what disease was investigated etc etc and the sample question right here is acute stroke some type of stroke and when we are asking our model or i our fine tuned model the same question you can see that the model that we've fine tuned that we will fine-tune will basically get the correct answer based on the question and another piece of text that is called the context and for the first time in these videos we have a sponsor and sponsor is actually me i want to introduce you to ml expert which will hopefully better prepare you for the machine learning interview so emma export is currently in development but you can sign up for it right here on the web page ml expert dot io and if you scroll down to the sign up now you can input your email right here and you will get a 50 discount for the at least if you're in the first 100 people who subscribe right here uh when ml expert launches and i expect that emma expert is going to launch sometime during january next year so this will be very soon i hope and what will ml expert do for you it contains basically a lot of questions that are both theoretical and practical related to machine learning it can it contains a lot of complete jupiter notebook projects which are basically getting asked during the machine learning interview process so sometimes people are asking you to hey can you solve this uh let's say nlp problems for sentiment classification or time series uh anomaly anal detection or maybe detect some objects in some kind of images and of course the problems right here will be uh written from scratch and they will completely solve a set of tasks and they might not might not stop your particular problem or your particular question but there will be enough explanation in uh code to adapt to your particular data set hopefully so after that there will be a mock interview with an ml engineer which is going to be actually me so if you complete the problems and theoretical parts in ml expert you will unlock a mock interview which we are going to conduct via zoom or some other software so this will be a full-blown realistic interview in which i'm going to ask you some questions i'm going to follow your answers you will be able to ask me some questions and hopefully this will make you a bit more secure and more confident in your abilities when you actually go for the interview that you're interested in another cool thing i believe is the fact that this will contain a lot of practical programming questions in which you will have to know the basics of git sql a bit of styling for how to write better code in python or other languages and of course i'm going to include a personal set of tips and tricks for the interview day itself during the past year i've been working as a full-time machine learning engineer and i've been conducting with a colleague of mine a lot of interviews actually so i've been in the in the process i know what type of questions do we ask what we are looking for and i think that this should be very very helpful to you if you're interested in lending a machine learning job so give an ml expert a look thank you all right so now that the work is over let's get started with creating our data set using the view ask data that we have so i'm going to start with opening a brand new jupyter notebook and in here i'm going to change the runtime type to gpu so we want to gpu to train t5 in later videos but for now i'm going to just connect to this and check the gpu that we have and unfortunately i currently can't connect to a gpu back-end but that's quite all right at least for now so let's start by installing the transformers library by hacking face then we are going to install pytorch writing and i am going to need to install the tokenizers library but again clicking face and sentence piece which is basically a tokenizer from google so i'm going to install all it and i'm going to specify the versions that we want and then for the tokenizers again a library by hugging face and the version is that we are going to use is 0 9 4 and finally for sentence piece by google sentence this so this should be zero point one nine four all right so let's install all it and this should not print a wall of information i'm using the quiet clock everywhere so after this is complete i'm going to restart the runtime just in case all right and we still don't have a gpu that's okay i'm going to continue with copying and pasting a lot of imports and right here the important stuff i guess is that we are importing some term cover and text wrapping so that we will use basically all of that for bringing the questions and answers from the data that we are importing of course numpy torch pandas atleap and from the transformers we are going to have a look at the t5 for conditional generation and the tokenizer which is going to be the t5 tokenizer of course we are not going to have a deeper look at the the model itself but we are going to have a um to understand what the tokenizer is actually doing in this video next i'm going to call a method feed everything from i torch lightning and this method was actually told to me by a subscriber so i found this knowledge into the comments of the videos that i'm uploading to youtube so thank you for letting me know about this and this is pretty great actually because it sits in the torch the numpy random generators and it sees other processes if your python library for example when you're using multi processing or something like that when you're spawning new processes all those processes are going to have a seed of the same value that we're putting in right here so this is just great and for the next step we are going to get the data and the data as i've already mentioned is from this bio ask challenge but i've actually got the zip file from bio bird model so if you go to github you can see that there is already a birth model which is pre-trained for biology language representations or biomedical language representations and here you might have a look at the data sets and you have the question answering date set right here so this will put you to a drive link which will contain this q a dot zip and in here you have a various data sets and the one that we are going to use is the bio ask and we are going to use the train files right here which are not shown but yeah i'm going to show you those right now so this is where i got this data and i've basically put it on my google drive and we can download it using this id ng down and after this is downloaded we are going to unzip it but i want to be a quiet unzip and the bu q and a zip will be unzipped right here and you can see that the data is into this folder which is very similar to what i've shown you and the training data contains you ask with three different json files and we are going to use those to create our question answering that set all right so let's basically have a look at some of the files and i'm going to open this folder and i'm going to take a sample file let's say this one i'm going to open it and i'm going to vote the json from it all right so the data now is a dictionary which contains data and version so if i go for the version you can see that we are having this version code for this data set and if i go to the data and if i check for keys right here you see that this is actually a list so if we have a look at the length of the list you see that we have just a single element so i'm going to export the keys of this single element and you can see that we have paragraphs and title so what the title contains right here of course i've went through this file before so i know basically what is happening here we have just a single title right here which is matching the version at least in this example of the data so in the paragraphs of course we have more data and i'm going to have a look at it and if we check the length of this we have a lot of data right here so we have a lot of paragraphs and actually i'm going to get is assigned to a variable called questions which of course it's not paragraphs but yeah you see why in a second and if i take for example the first question from here or the first paragraph you can see that we have context which is our word of text then we have q and a's so we have some answers with answer start index here and we have a text and then we have an idea of the question which we are not interested really in this one and next we have the question so what is the inheritance pattern of some syndrome first biomedical question so it looks like that we have all this into a single paragraph or a single json json doc object and then the json object has been converted to a dictionary using json world so this is a very good at least for now next i want to extract that information and create a under data frame from it for that purpose i am going to x to build a function which we are going to call extract questions and answers and this will get a path to the json file which will be similar to this one so i'm going to provide adding and i'm going to open the file just like we did here so i'm going to get this right so i'm going to take questions from it another copy and paste here and then i'm going to convert this into a set of data rows and for each question i'm going to take the context which will be this string right here or string similar to this one and next i'm going to iterate over the questions and answers you can see clearly here that we have an array of those and for each question and answer i'm going to take the question text and i'm going to take the answers which are going to be into to be contained within this one all right so now for each answer we are going to take the text we are going to take the start of the answer within the context so i'm going to create another variable which is going to be the answer start and i'm going to sum this with the uh the length of the answer text so this will give us the nth index of the answer and i'm going to append all this into the data rows that we have and i'm doing all this to show you what the data set actually contains so i haven't really explained to you what the context the question and the answers are but we are going to see that in a second and finally that we have all the data roles we are going to convert those to update frame and i am going to extract the questions and answers from let's say this file that we had here and i'm going to have a look at the hit of the resulting data frame and you can see that we have a question and the questions sometimes appear to be duplicated i guess and we have the context we have the answer text the answer start and the answer end and one interesting thing about those type of data sets is that the answer is basically contained within the context so the job of the model is going to be given a question and given a context give me the set or the the the characters within the context which start at some position and end at another position so give me the text extract the text from the context and this text should be the answer so the model will receive the question the context and the result is going to be a subset of the context okay so now we will do the same thing for all the files that we have for training our bio e5 i guess so to do that i'm going to take all the parts for the factoid files and i'm going to use opt view ask and i'm going to search for all the vo ask train dash r files i'm going to convert this to a list and i'm going to sort this list so this should return those spots and you can see that we have three files so i'm going to call the extract questions and answers functions for each file and this will return uh this will give us an array of data frames so let's iterate over the factory parts and append the resulting data frames and after this is complete we are going to concatenate all of the data frames so this will give us the final data frame or date set that we have train our model and let's check the number of questions and answers that we have of course um since some of the answers can be found multiple times into a single context the number of questions is a bit lower i believed in this but let's check actually price right do this yeah well it appears that we have only 443 different uh questions so yeah it looks like that some of the answers at least are repeated multiple times let's do the same thing for the answers oh let's let's actually check the context how many different unique contacts do we have so yeah this might give us a better number actually so we have around 2.5 k examples of questions and answered players so some of the questions might be repeated but that's all right uh given that the context are different and this appears at least it looks to me that this is a much more representable number of actually how many examples or different examples of questions and answers layers we have from this biomedical it which is quite good i believe because t5 is a very very large model and you can see or you will see into the next video that this will be more than enough to train us very reasonable a reasonable model next let's check a sample question and try to visualize the data that we have i'll take this one because it contains a context that is that contains a rather small amount of text and let's check it so we have what is the synonym of the lubric disease and we have a context which is a large text and then we have something like this which is the answer so this is some sort of medical term which i'm not familiar with and i'm not going to pretend that i am so next i want to show you what the model or the data looks like using some text coloring and i'm going to try to cover the text within the context so we will see what is happening really into the context text and where the answer is within the same context using the data that we have so you can see that we have the answer start and the answer end so let's build a cover answer function and this will take a question let's extract the answer start and answer end from here and the answer end i'm going to also extract the context from this and i'm going to use a function powered which is imported again from the package term color here it is and this function accepts text as a first parameter and the name or the i believe it also you can pass in a hex cover but the name of the color that you want to cover so first we are going to uh print or cover the text up to the answer starting point so this should give us this should be given us by uh the all of the characters until the answer start and i want to paint this in white next we are going to paint the actual answer itself so i'm going to start with the answer and i'm going to continue until the answer ends but i want to add another character because uh the indexing is exclusive and i want to paint this in green and finally i'm going to cover the rest of the text in white so if i print the sample question itself and start printing the answer i'm going to all print our answer simple question and it looks like this did the job here is the answer within the context and you can see that it is really green and at least this works in a google coop but i'm not really happy with the results because you have to scroll horizontally right here to make this a bit better looking we are going to wrap the text itself i'm going to use width of 120 characters and each wrap is going to be uh this will convert the text into an array and each wrap is going to contain at most 120 characters and we are going to print each drop on a new wine let me just try to reduce this to 100 all right so here we have the text wrapped into 100 characters and this will uh show you a question along with the answer itself and the answer the context i'm sorry and the answer will be printed in green of course you can see that in this particular question i believe the answer can be found two times but we are going to train our model only um yeah but we are going to actually give those two examples to our model but we will not be really doing any sort of fine adjustments to the multiple answers within the same context okay so next and we are going to do some tokenization and t5 comes with a standard tokenizer which is pre-trained and we are going to call the base e5 model and this will be from t5 tokenizer from free trains and i want to get the model name right here so this will go and download the tokenizer and now we can give our tokenizer a try we are going to call it just like your call a function right here and we are going to input a sample question which is going to be would i rather be here or left so this is kind of an existential question i believe of course some of you might know where this is coming from and uh somewhat perfect answer to this one is like easy both i want people to be off rate of how much they love me so this is a perfect answer to uh i guess a perfect question as well so if we if you call this and convert and assign this to a sample encoding variable you can see that this is a dictionary which contains input ids and attention mask which is very very similar to what the birth tokenizers are doing and from here we can go ahead and look and the sample encoding of the input ideas and rather ugly i believe so here you go we have a token for uh an encoding or input id for each token and we can do the same thing for the attention mask of course we are not doing any padding in here uh each token has been encoded and there are no padding because we haven't specified maximum length or truncation strategy etc and another cool thing that you can do here is to use the tokenizer to decode the input ids and to do that i'm going to go over each input id which we are going to call input d in sample encoding input ids and i'm going to call the tokenizer called method on the input id and i want to skip the special tokens and i want to clean the tokenization spaces so if we do this and this will return an array and if we join the early you will get something like this would already be feared or left so this is extra space right here i believe uh easy boat so you can see that there is a special token right here which is the separation token uh you get this because you are passing in the question and answer right here so the tokenizer knows that when you're passing uh two strings that you are referring to question and answering or you are referring to sequence to sequence sorry question to a question and answering task i believe so this will give you the correct encoding and if you don't pass in those two parameters you will get basically the same thing i believe so this might be optional really yeah all right so now that you know what the tokenizer is doing we can then encode a sample question and have a look at a bit more of the tokens itself so to do that i'm going to call the tokenizer and i'm going to pass in the sample question which we got right here and we are going to encode the information from it i'm going to start with the question the context i'm going to pass in a maximum length of 396 and i want to put all the questions and context pairs to the maximum length so each example is going to contain that many tokens or input ids and for the truncation i want to truncate only the context uh because yeah the que if we truncate the question it will be pretty bad i believe i want to return the attention mask which is i believe uh true by default and i want to special to add the special token switches again i believe uh true by default and i want this to return tensors okay so now that this encoding is done for us if i fix this error here we can check that the keys are again input ids and attention mask another interesting thing that you can do right here is to call the tokenizer special tokens map and you can see that we have about token unknown token or unique i believe and we have the end of uh stack sequence token which is basically some sort of space or something like that so you get what those tokens represent and of course you can call the eos token get the character representation and you can get the id of this token so everywhere when you get an input id of one you will basically have let's say this one here this is the eos token or end of sequence token so uh you know that this is a place where the first sequence is ending or the question is ending and from that point on the all the input ids are corresponding to the context all right so if we continue with this one we can decode the encoding using the tokenizer and i want to squeeze the dimension and you can see right here that the tokens are added and then we have a lot of paddings at the end so this is the result of the encoding decoded back into the text representation of the input ids all right so i believe that another thing that you should know about is that we are going to prepare also the labels or the answers and to do that i'm going to create a sample answer encoding so we use our tokenizer for the question and context pair and then for the answer text and here i'm going to specify a length of 32 the padding is going to be good on max length truncation i'm just going to say true so we are using the standard strategy afternoon up to the maximum characters i want again the special tokens and the if we do the decoding of the answer encoding you can see that we have pretty much the same thing the answer they call it along with the end of sequence character uh yet token and then a lot of tokens for padding so this will basically give us a way to encode the answer another interesting thing are about this about this e5 model is that when we are passing in the labels which are basically the input ids of the answers we need to convert the labels which are ignored or masked to -100 so when computing the words for the labels those values will be excluded from the computation and to do that i'm going to get the labels again those are will be from the answer encoding and if we check this you see that we have a lot of zeros right here so we need to convert those zeros to minus 100 and i'm going to do that using this comparison and now the labels contain -100 where previously there were zeros so we will use this trick when we are building our data set now that we have a knowledge of how the tokenizer is doing its job and a way to encode both the question and context pair and the answer we are going to build a dataset which is extended from the pytorch dataset and in it we are going to build a constructor which is going to accept self data for the data set a tokenizer which is going to be the t5 tokenizer maximum token length of the source text which is going to be the pair of question and context and the target which is going to be the answer max token length so i'm going to assign this to the tokenizer to a tokenizer field next we have the data and i'm going to do the same thing for the source max token length and the target max token next we are going to overwrite the length method and we are going to return the number of rows into our dataframe and finally we need to overwrite the getitem method so here i am going to take the data row at the index position and i'm going to encode the source which is the pair of question and context and here i'm going to get the encoding from here and adjust it just a little bit so this will take the question in context and this will take the source max length okay i'm going to do the same thing for the answer or the target encoding so we're doing this and target max length and i believe this is it for the target encoding finally we need to get the labels from the target encoding which again are the just the input ids of the tokenizer and i'm going to do the same thing that we did right here i'm going to replace all the zeros with minus 100 and the result of the get item method is going to be a dictionary which contains the question as a plain text then we are going to do the same thing for the context again as a plain text then we are going to return the answer text and next for the encodings from our tokenizer we are going to return the input edits and i'm going to flatten this because the data water will take care of the patching uh but of the batching for us i'm going to return the rotation mask again open this and finally the labels which we are going to also form as well so now that we have this data set let's try it out sample data set and your q a data set for the data i'm going to pass in the data frame and for the tokenizer i'm going to just pass in the t5 tokenizer and if i iterate over the simple date set i'm going to break here i'm going to print the data here and you can see that we have a lot of data right here but let's have a look at the question and let's say input ids the first 10 values and the labels the first and values okay so we have the question answer text and some of the input ids from the tokenizer so this appears to be working uh quite well for us next we are going to split the data set into train and validation sets and this is already preceded thanks to the um fighter whitening library so we have training df and validation df with those shapes okay so we have a lot of examples for training and the final part of this video is to build a whitening data module we want to extend from pl data module and basically here we are going to create the data waters or the training and validation or test sets and i'm going to take a lot of parameters here the training data frame the test data frame which are going to be pde data frame the tokenizer e5 organizer then the batch size which we are going to set up eight up here which is going to be a bit larger than we can train on uh google cloud actually and then i'm going to pass in the same exact parameters because we need those for the instantiation of the date sets so the first thing here i'm going to do is to call the super init method then i'm going to store the patch size the train date frame test date frame and i'm going to take the field initialization from here and paste it in right here speed up the process a bit all right so we have most of the parameters that we need next i'm going to build a method called setup in which we are going to build the training data set and the test date sets and we are going to use the bo q a data set and in here i'm going to pass in the training date frame organizer source max token length and target max total i'm going to do the same thing for the test date set all right so now that we have both of those data sets stored as a fields or properties of our constru uh our data module i'm going to create a train data order which is uh actually overloading up by touch whitening method and this will basically return the trained headset with a part size of the part size that we've passed in as a parameter i want to shuffle this just in case not really necessary i believe and number of workers is going to be four uh i found that this works quite well with the parameters of the google cloud machines that i'm getting then i'm going to get the validation data water which is going to be another data water the previous one i'm going to pass in the test date set the bus says i'm going to fix one to emulate real world performance because you're mostly predicting on single examples in production or that might not be the case for you actually let me know in the comments below and for the test data order i'm going to just copy and paste the validation date or you might want to try to create uh three different data holders to evaluate the performance a bit better but here that's not what i'm after i just want to fine-tune this model and see how well does it do on some uh question answering date set and you will see however it does in the next video and next i want to specify the path size right here and the number of epochs for which we are going to train the model in the next video next i'm going to instantiate the data module just to be sure that everything is uh working properly i'm going to pass in the 10-gate frame the validation date frame tokenizer and the batch size is going to be the bot size constant right here okay and i'm going to call the setup method and it appears that everything passes at least for now we see in the next video if we got anything wrong and in the next video i'm going to show you how you can fine tune the d5 model we are going to build a whitening module based on the t5 or conditional generation model that is available from hanging face i'm going to show you i'm going to explain a bit more why d5 is uh some powerful what are the different types of d5 models uh of course we're using some of the smallest available models and even though it is small or a base t5 it's actually really huge even compared to bird base and other models so it's a rather large model so thanks guys for watching i hope that you've enjoyed this video have very i hope that you also have a very good holidays and you are guilty and you are going to going to the next year with a little bit of hope that everything in the world that is happening right now is going to be ending soon and uh i wish you all the walk in the world i wish you to be healthy and once again i'm going to plug in the ml expert.io which i'm going to put into the description below if you want to prepare for the machine warning interview i believe that this resource which is again created by me is going to be one of the most helpful tools in your toolbox to lend the machine learning job that you want i'll see you in the next one i'll see you in the next one bye bye guys
Info
Channel: Venelin Valkov
Views: 5,145
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning, Pytorch-Lightning, HuggingFace, Transformers, Python, PyTorch
Id: _l2wJb3QPdk
Channel Id: undefined
Length: 55min 48sec (3348 seconds)
Published: Sat Jan 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.