4 - Summarization Fine Tuning BART | GPT2 T5 PEGASUS using HuggingFace | NLP Hugging Face Project

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone welcome to NLP with hugging Face tutorial 4. in this lesson we are going to learn about summarization so let's go ahead and understand what is summarization summarization in natural language processing it involves the automatic generation of a concise and coherent summaries from a longer text in the documentation there are many huge cases of the summarization suppose that you have long text or a legal text let's say you want to summarize that legal text Data let's say you have 100 pages of the legal text Data you just want to summarize that in one or two page just to understand the what about its legal Pages you can get that if you want to summarize a large news article you can do that as well in fact many news articles nowadays if you see just below their headings they write a summary about the news article actually what this is for so you can you can see that how this summer is powerful in fact there is another startup another company which uses this summarization and it becomes very successful the company name is inserts so this inserts company is actually this uses these you know the short form of the news and then users need to just pay like 20 30 seconds to read this whole news article there are two type of summarization here the extractive summarization and abstractive summarization so in case of the extractive summarization what happens it try to extract these text data and then these combine these text Data as it is to make the summary but in abstractive text summarization it doesn't extract these text Data as it is every time what it does it takes all these key points and then it makes here a new sentence which will tell exact context about the source text but you would not find that exact sentence is copied from here all right and for the coding we are going to select here our you know our GPU but as of now we are just going to start with uh we we are just going to start with in the basic label of the coding but anyway you can select this GPU or you can start with CPU as well all right so with the GPU we are going to do and then we are going to toggle this header visibility so that we can get here more space for our coding so we will be doing here basically abstractive summarization we will be using a large language models especially the Transformers model which are available at the hugging phase we will be downloading those models from the hugging face but before that you need to install here necessary libraries and python packages to use the hugging phase we will be downloading here the Transformers accelerators data sets but visualization you map learn this is used for the mapping for fast mapping you have here sentence piece so this is used for the sentence tokenization then URL leave will be also upgraded here because we need here URL leave 3 we need here upgraded version of URL to download the data from the internet and then we have here Pi safe and JDR alright so this one you can just simply copy and paste it here if you want to know about this what about this so this is 7gb archive comparison so what happens in the last of our tutorial when we are going to download our model we are going to use here Pi 7 chip model to compress our model for download all right so let's go ahead and run this so as soon as you run this your notebook will connect with GPU and it's going to install all these necessary packages here all right so installation is done here now we are going to just delete all these logs there so here we will be using CNN daily mail data set so the CNN daily mail data set consists 300 000 pairs of the news articles and their corresponding some range so we'll we will be using these models to fine tune our model although there are other models are already available on hugging page which are already fine-tuned on CNN daily mail data set but just to make sure that you learn properly that how to fine tune hugging face models how to fine tune Transformers we will be doing this exercise as well here so you can simply you you can simply open here a hugging face and at the hugging page you can type their CNN daily data set so I'm just gonna write here CNN daily mail and then you will see here in the data set you have here CNN daily mail date asset so they see it in Daily Mail data set have here three columns article highlights and ID we can ignore safely this ID along with this article if you just click there you will see here a detailed uh detailed description of this article this is a huge article and this article got summarized here with just few line a few sentences here and similarly you can see there this is also a very large article and this whole large article is summarized here with just very few sentences so this are if you read these highlights or this summary it will tell you what about this overall article is this is that mentally ill inmates in Miami are housed on the Forgotten floor just Steven leafman says most are there as a result of avoidable felonies while CNN tours facilities patient sound I'm the son of the President liftman says the system is unjust and he is fighting for change all right so you see there this whole thing is about some some jail there some jail or you can say it's like a ah some some mental jail there and the peoples are getting their mental illness all right so similarly you can read other articles here so this is about the Baghdad Iraq eight workers violence increased cost of the living Drive women to the prostitutions etc etc so this says that how the people are suffering their uh uh through the violence in Iraq and Baghdad so so this Daily Mail data set you can download with the data set package here so you can simply import load data set from the data set so I write here from data assets import load data set thereafter I write here thereafter I write here data set is equal to load data set and then I write here CNN Daily Mail all right so CNN daily mail have various version of the data set that you can check here the files and the versions so it has a various version I want to download only version 3.0 all right so we are going to write there the version which we are going to download So for that I write here version is equal to 3.0.0 all right so with this we will download version 3 of the data set that you can check here if you come to the data set card it has a three version actually version 1 2 and 3 and you can read that here it says that version 2 and 3 both can be used to train model for abstractive and extractive summarization and here this watch uh developed from machine reading and comprehension and abstractive question answering only so these are the three versions used here so we are going to use here the latest version so it's downloading the data it may take a while to download Theta there are here you can check version one and three so this version 3 have 312 key roach and this has a train split here you can check train validation under test split train split have 287 K and validation 13.4 K and test have 11.5 K Rose so that is why it take a little time to download this data and after downloading data it also generates all these train validation and test split as well here well so the data is downloaded here let's go ahead and check our data so the data we have downloaded here into a data set I write here data set so this data set I have here a three split as we had discussed train validation and test split train split have 28 287 000 actually wrote and then validation split have 13 000 rows and test speed have 11 000 rows there are three column article highlight and ID this particular column we do not need and this article column is a full news and here this highlight is summarized news and do remember this is this is abstractive summary that means this highlighted news is not extracted from this article in fact this is kind of a newly generated a new generated text Data where you would not find exact match of the extracted sentence in the article whenever you find the exact match of these extracted uh extracted summary here so you will say that it is extractive summarization but when you get here only the context not the exact uh no not the exact word matching then you will see there this is abstractive summarization all right so now in this data set we are going to uh debug or we are going to see more about the train data set so in this trained data set or you can say the train is split I'm gonna get here the first row so in first row you see you have article and then you have here highlight and then you have here ID and in this one if I just display first 300 character you will see it something like okay so one more thing here you need to put here the article actually okay so first 300 character of the article you will see something like this all right let's go ahead and see highlight so for highlight you need to just write here highlights in fact this 300 you can remove you can see this full highlight all right so this article is summarized as abstract summary in this highlight here so now our task is to train the models so there are many models we are going to summarize the data for our models so we'll be using here we will be using models like gpt2 we will be also using here T5 models we will be using here Bart model and then we will be also using here Pegasus so this gpt2 model comes from open AI this comes from Google this comes from Facebook and this also comes from the Google that's mean it is the uh coming from the coming from the you know originator of the T5 that's the Google there so we'll start here with gpt2 model so this gpt2 model are not specifically meant for summarization because this uses here encoder and the decoder models there the gpt2 models and the gpt2 models are not specifically made for summarization but although we will use these gpt2 model for summarization but do remember here we are going to use these models as it is without any without any kind of uh fine tuning so we are going to use here the base model of these four model and then we will compare our generated summary so that we can see that which model is performing better without any kind of fine tuning okay so I write here from Transformers import Pipeline and sit seed although the seed seed you do not need here set seed is something which you set you will get every time the same set of the output if you do not set then you may get different output every time you run your pipeline so the pipeline you can check it here I'm just gonna open here duplicate tab you can check this pipeline I'm sorry this should be here uh pipeline let me uh search it here so you will see this pipeline in actually okay so this pipeline should be coming from the documentations although the pipeline is not coming there you will search it here so you would be able to see here the pipeline so this pipeline is like ah in all right so the pipeline although you can check it like uh it's a container and in this pipeline container you can use the text generation pipeline so this pipeline will contain all the necessary tokenizer and the models for the text generation and you can directly use this pipeline without writing a very lengthy code so this pipeline you can see there I'm having pipe is equal to pipe line and in this I write here text generation and in this I write here model is equal to GPT to medium model all right so let's go ahead and search this gpt2 medium model on hugging page just to know more about this gpt2 medium model you can simply search this at hugging face gpt2 medium model so this gpt2 medium model have 355 million parameter this is Transformer based architecture it was released by The Open AI all right so this gpt2 have gpt2 base model GPT to medium model gpt2 model gpt2 large and GPT to excel there are limitation of GPU so we cannot use very large model otherwise our CPU will run out of memory so we have to use here we have to use here a very medium size GPT gpt2 models let's go ahead and run this so the first time if you run it it's going to download the model on your PC it will store that into the cache system so we have to wait for some time all right so this model is downloaded here now we are ready to use this pipe for our text summarization so here we are going to use text summarization it's huge text summarization all right all right so here we need to do first text generation data using gpt2 medium model for that we need to get here input Text data so that we already have our data set so for that we are going to use the data set here so we are going to use here a train split in the train split we are going to use here the first article so this is the first article which we have already seen and in this first article I'm gonna see here first 1000 characters so this particular we are going to use in fact this whole article we are going to use just I printed here just to see what it is about so this particular whole thing I'll be using to generate the summary using GPT to medium model so if you read gpt2 model and in the paper it is described that at the end of input text Data if you add there tldr then it will automatically do the summarization of given Text data that means we have to make here a query so I write here a query is equal to input text so I write here input text which is this one I'm just gonna write here input underscore text is equal to this so this is going to be the input text and this input text thereafter I'm gonna do here at the last I'm going to add here a new line and then TL and then colon semicolon and then Dr so this particular thing if you add at the end of gpt2 query that means gpt2 need to generate summary for the given query thereafter I write here pipe out is equal to Pi that's the pipeline which we have already created and then this gpt2 query I write here a query and max length for the query I'm going to generate here 512 or you can say it's going to take that much of the context and then I'm gonna write here one more uh one more input parameter that's the cleanup tokenization spaces so in case of the tokenization spaces a lot of tokenization spaces are there unnecessary spaces those will get cleaned up here all right thereafter we are just going to Simply run it now you will see there a summary will be generated using GPT to medium model and that will be placed in pipe out once that part is done then we need to see here generated Text data or generated summary so for that we are going to print this pipe out here so I write here pipe out now in this pipe out you will see there a generated text and thereafter you will so so this generated text now you see once you see this generated text it doesn't look like uh your your you know whole data so what happens here in this generated text Data your uh you know your your your original data and generated summary both are combined together so if you write it something like this pipe out zero and then you write here generated text now in this case you will see this whole generated Text data and now in this generated Text data we need to read our summary so for summary what you need to do here you need to trim out initial length for the input data so for that I write here length of query and thereafter this so that means we are going to trim out the length of the query that's mean whatever the input query was there initially then we are going to print apart from that so it seems like what it says here okay so seems like the problem is coming from here so the problem is coming from this max length so in gpt2 model it takes 512 uh the context only so basically our this input data is quite huge so for that purpose what we have to do here we have to reduce the length of this data so initially I took this whole thing from here and if you check the length of this input data which would be a lot actually so because of that it is not able to generate and so for that purpose I'm going to take here just two thousand so if I run that now you would be civil able to see here input text number of character is 2000 so we assume that the total length of the tokens would be less than 512 because for any Transformer model the total length should be less than whatever the maximum acceptable length for a Transformer generally Transformer comes with 512 the old Transformers but nowadays new Transformers are coming which can handle the maximum token length of around 4000 4096. all right now I'm gonna see it here thereafter I'm gonna print it here so thereafter you can see here now I'm gonna see this says that there are actually a lot of mentally ill inmates in this facility and then the second line is here so basically yeah you can see there are the new lines here all right so these are the new line N1 N2 and these are the N3 all right and then there is there is no medical facility for the mentally ill in this jail unless you count the solitary confinement cell which are so small that inmates are forced to sleep outside in the cold with no lights all right so this is talking about the uh a jail where these uh the peoples are mostly the you know mentally ill and uh this is generated by gpt2 now we are going to generate these with many other uh with with many other model and then finally we will compare these summaries uh generated by other models as well so I'm gonna create here first dictionary so that I can save this I write here some Ridge and I'm gonna write here gpt2 medium and the total number of parameter in GPT to medium is 380. and in this I'm gonna store this that's how it is going to do let's go ahead and run this now gpt2 we have successfully used and we have generated summary using gp82 model let's go ahead and try out the T5 Transformers so now I'm gonna try out three five Transformers so a T5 Transformer you can simply write here D5 there you will see T5 base and T5 small so I'll be using a T5 base model to generate the summary and this T5 tt5 models are actually a versatile model which uses encoder and decoder modules of the Transformers you can get the details of the T5 modules here so so the T5 was developed by the Google it's a language model you can read this whole detailed document if you know more about the T5 so we'll be using T5 base model let's go ahead and create a pipeline for the T5 as well it's very much simple we just need to write here the pipe is equal to pipeline and then the task of our model is summarization and here I write model is equal to T5 base and then I'm going to just run it so the first time if you run it it's going to download your model and it will store there in a cache memory once all these things are done then we are going to generate summary using T5 base model so the T5 based model have around 223 million parameter if you check with gpt2 so you can see that it is a smaller than the gpt2 model now I'm gonna generate here pipe out is equal to pipe and then input Text data this is the same input Text data which we used for GPT to model and that same data we are going to use for T5 in the same way we can see generated output as we had seen here all right so I'm going to put it here and thereafter now you check this pipe out generated Text data so in fact this length query now we do not need we can just generate this directly from here itself okay let me just check it well so the pipe out now this time it has summary text instead that generated Text data now do you see here the difference between gpt2 model generated summary and T5 in gpt2 model it was generating query along with the along with these uh abstract summary now this T5 is generating here direct summary this is happening because gpt2 model is not trained specifically for generating summaries but T5 models are trained specifically for generating summaries that's why we are able to see it here a proper output now it says here mentally ill inmates housed on forgotten floor of pre-trial Detention Facility in Florida etc etc so mostly seems like it is also talking about the same thing what we had seen earlier now we are going to put it into a dictionary which we had used earlier and in this I'm gonna write here T5 base and 223 million parameter all right so we have used here now the T5 model let's go ahead and use here Bart model now I am going to use here part model so you check here a Bart and we have bought large model Bart large CNN model so you see this Bart large model is actually without any fine tuning on the CNN data set but this one is fine-tuned on the CNN data set so which model we want here let's say either we take this one or this one all right so the both will work here only the difference is here if we use this fine-tuned model it's going to actually you know provide you a lot of accuracy what we cannot get with any other model the reason is but large model is fine-tuned on the CNN data the data which we are which we have already downloaded here so on that particular data it was a fine tuned so we can use this fine-tuned model as well to generate the summary so it's so so it's very much obvious that when we use models which are already fine-tuned on desired tasks then if we use similar to that data then obviously we will be getting better accuracy let's try out that so this Bart model was developed by Facebook and but large model have around 400 million parameter which is uh a little more than uh more than GPT to medium model and almost double than the T5 model so we use here pipe is equal to pipeline and then we use here summarization thereafter we use here model so this model name we need to give so model is Facebook and then we have here but large CNN so we have Facebook Bart large CNN model let's go ahead and get the output so I get here output with pipe out equal to pipe and then we write here you know import text here something like this let's go ahead and run this perfect so we have started this now you can clearly see it here it's going to download 1.63 gigabyte of space uh six the model size is 1.63 gigabyte this is the largest model for as of now we have we used here once all these things are done then obviously it's going to be here the same thing a pipeout we have already done now we can use here pipe out so this is a little large model so it will take some time to download so summary is generated here in summary we see mentally ill inmates are housed on forgotten floor up Daddy jail most often they face drug charges or charges of assaulting an officer now it is very much clear now you see this generated summary is very very clear than what we had seen earlier let's go ahead and store it into our summary dictionary so I write here some range and in this I'll I write here Bart large CNN 400 million it's like this so what do you see here summary is about large CNN 400 M5 out zero and then here we have summary text it's like this okay so you have seen this now you can check the summary but I I think before checking summaries let's go ahead and finish our Pegasus model so now we have here our Pegasus model we are going to search that Pegasus model here so this Pegasus large model we have it is developed by Google we are going to use a pegasus CNN model now you can check it here Pegasus CNN daily mail so as we had seen previously like your Bart model was fine-tuned on CNN daily mail data similarly this Pegasus is also fine-tuned on CNN daily mail data so we are just going to copy this model name from there and uh thereafter I'm going to use a similar technique like pipe equal to pipeline then summarization then I write here model is something like this thereafter it will download model so this model have around 500 or more than 500 million parameter which is huge so this model will take a lot of time to download earlier we had seen there 1.63 gigabyte now it's 2.28 gigabyte thereafter we will generate some Ridge with pipe underscore out pipe out equal to pipe and then we have here input underscore Text data so pipe out equal to the input text now it will run and it will generate the output and in this we will be also having something like this summary Text data so we can simply paste it here and then run it once this is downloaded and the model is loaded it will run this now it is running here so it's calculating the summary Okay so let's go ahead and run it one more time so it's run it's finding out the summary thereafter we we can see the summary here itself all right so summary is calculated here that is generated in fact now you can see the output here mentally ill inmates in Miami are housed on forget in floor the ninth floor is where they are held until they are ready to appear in court most often they face drug charges or charges of assaulting an officer they end up on ninth floor severely mentally Disturbed all right thereafter I'm gonna create here some Ridge and inside this I write here Pegasus CNN 568 million parameters so we have used here four model to generate the summary let's go ahead and put all models together so I write here for model in some Ridge thereafter I'm gonna print here model and thereafter I'm gonna use here uppercase then I'm gonna use here print some rage and then I'm gonna use here the model which I have used and then I'm gonna print here the empty line let's go ahead and run this so it will take a while to run this whole thing okay so we have here gpt2 medium model it says that there are actually there are actually a lot of mentally ill inmates in this facility there is no medical facility for mentally ill in this jail unless you count the solitary confinement the inmates in this country in in this County Jail had been moved towards huge to etc etc something is missing here T5 base model has generated mentally ill inmates housed on forgotten floor etc etc so what do you see here gpt2 model summary is not that much accurate what we can see here the reason is here open gpt2 model has hallucinates or or gpt2 model invents facts by uh itself since it was not explicitly trained to generate the summary so you cannot expect that the gpt2 model will will generate the summary thereafter we have bought large CNN model so in this one we have mentally alienates are housed on the forget in floor of Miami-Dade jail most often they face drug charges Etc so you can see uh what do we see here T5 base and bar large model have kind of same output here they are the quality output is almost same thereafter we have here Pegasus CNN model mentally illit mates in Miami are housed on the forget in floor the ninth floor is where they are held until they are ready to appear in code so what do we see here the Pegasus model is the most accurate model out of these four model so whenever you want to generate the summary each probably I would suggest you to use the Pegasus model because this model is quite accurate as compared to these three models here all right now the time we need to fine tune our model suppose that you have your own data for example let's say you have your data for customer and your customer service team data let's say you are an industry list or let's say a data scientist and your task is to generate the summary work what customer representative to have talked with the customer or you have the set of the emails which you do not have time to read or you want to present a presentation to your seniors or the boss then obviously you have to present there the summaries there and then you have a long emails or long messages then you want to generate the summary so for that case you have to train or you have to fine tune your summarization model on your custom data so that is what we are going to do now we are going to use a custom data that is the Samsung data that's the Samsung data so we will be using uh uh you know conversation data from the Samsung and then we will be fine tuning Our models for the Samsung data now one more thing here now we see here our Ram is quite full so while retraining our Samsung uh while fine-tuning our Samsung data probably these resources will exhaust so for that purpose what I'm going to do here I am going to actually you know uh put I I'm going to actually you know comment this one this is the BART large and this Pegasus model so these two model I'm actually going to comment out so that I can save some space for the fine tuning thereafter I have to restart this because I have to here free up some space so I'll do here restart runtime in fact restart and run all I have to do here all right so in this way what will happen here it will not consume all these memory here because these Bart and the Pegasus models are quite large model so now we will be having enough space to retrain our model for retraining we will be using a Bart large CNN model for our custom data set all right I'll see you then okay so this notebook I have already restarted and it's running here the previous code which we had written and these codes were commented out for Bart and Pegasus to save some space otherwise our rhyme was about to run out of space so we'll be using here Samsung data what's the Samsung data so Samsung data is provided by the Samsung here so this data comes from the Samsung and it's a dialog there and you will see there there are two people who do a dialogue and based on that a summary is presented here something like here you see and here Amanda I baked cookies do you want some Jerry sure Amanda I'll bring you tomorrow so the summary is Amanda baked cookies and will bring Jerry some tomorrow so you see here what's the purpose or what the use of these custom model on a custom date asset the the use of this suppose that you you you are a data scientist and your company want the summarization chat summarization between the customer representative and the customer so customer can do a chat uh with the customer representative but uh but but the management might not have time to go through with all the text data so you can write a code to present a summary so that's uh that's what you can do here so I'm gonna write a code to present that as the summary for Samsung data so first of all we need to do training or retraining in fact not a retraining fine tuning for Bart model it has a three splits here train test and the validation is split there are 14k training data and 819 rows for the test and validation set let's go ahead and write a code here in meantime it is running here let it run okay so I'm gonna write a code so for code writing first of all we need to import data sets from the data set so from data set load data set then from Transformers import pipeline thereafter from Transformers I'm going to write here import Auto model for sequence to sequence language model all right so this one we want y sequence to sequence because a sequence is provided here and its output is also a sequence that's why we need here sequence to sequence language model along with that we also need here Auto tokenizer why we need Auto tokenizer as I have been telling you in our previous video which you can watch here at youtube.com and then kgp talking all right so at the kgp talkie Channel I have been uh talking a lot about our other videos here on which I have taught about the hugging phase NLP tutorials there I have described why do you need to use Auto tokenization so this Auto tokenization will automatically get a perfect tokenizer for your model you don't need to remember the tokenizer for your model import and then I'm going to import Pi torch here once all these things are done something okay so from data sets import load data set everything is imported here successfully and the previous code has also successfully executed all right so these you can see seems like producing the same result the reason is here we have commented out the bar and the Pegasus so it's printing the same output force the T5 which printing here okay now I'm gonna use device as a GPU since we have here a GPU device available with us so we will be using here GPU for computation for model checkpoint I'm gonna use here model ckpt and here I'm gonna use I'm going to actually uh I'm going to I'm going to fine tune here Bart CNN model so you need to write here Bart CNN model you can check it here Bart large CNN model why I want to fine tune but CNN model for Samsung data the reason is it is already it is already fine-tuned for the summarization task if I retrain if I fine tune Bart model as a loan without the CNN fine tuning you might not get a better accuracy so so this model was already fine-tuned on a similar kind of the task so by fine tuning it again it will get a better accuracy all right so the model checkpoint is here Facebook but large CNN thereafter tokenizer that's what I was talking about here I I can simply use Auto tokenizer and in this Auto tokenizer I can just load it from pre-trained and I can then finally give here the model checkpoint that's the ctpt thereafter I can load here the model with the auto class so here we have R2 model for sequence to sequence language model in this again we can load here from pre-trained model c k PT so we have successfully loaded tokenizer and the model so while loading the model it will download here model depending on internet speed it might take some time okay so while model is loading let's go ahead and load the data so here I write here Samsung data is equal to load data set and here I write here Samsung all right let's go ahead and run this so what do you see here there is a Samsung all right this data we will be loading with all the available all available uh are the splits there so it has downloaded this data now you can simply print out this data here so there are three split strain test and validation split as we had discussed previously okay so let's go ahead and see the first train spread data so here I write here uh train and then I'm gonna get here the first sample from the 10 train is split this is the same sample what we had seen Amanda is talking about the cookies and then Jerry says that okay you can bring the cookies and then finally you have here summary all right so that's what we had seen here so on these kind of the data now we need to First prepare our data but before that obviously we need to do some data visualization so for data visualization what we can do here for data visualization we are going to first find out that the length of these dialog and the length of the summary why do we want this see there are the two reason to understand your data before using we need this length because we want to understand that how much tokenization length is there because the tokenization length is limited here all right so it's a it's maximum 512 or 1024 tokens only in one time you can provide to the model but if you provide a large number of the tokens then it creates the problem I mean I mean to say that then it it do not take whole context so what we can do we can visualize and understand that how many tokens are there all right a guess we can a guess we can make here and then finally we can use here a proper model so I write here for dialogue for dialog length for X in Samsung okay so in Samsung you have here train data so what I'm gonna use here a comprehensive list technique for for Loop and then for this x I'm gonna first of all going to do here x and then dialogue there are thereafter I'm gonna do here I split and then finally I'm gonna get here the length of this dialog this is for dialogue Lin and similarly I can get it for the summary length that's all I need to do it's pretty much simple let's go ahead and run this so we have got here dialog length and the summary length now we can simply use here a pandas data frame to visualize the histogram so I can simply import pandas as PD thereafter I can make here a data frame with the PD dot data frame and in this I write here dialogue length and then summary length thereafter we need to do here the transpose so that we can see this data in vertical format so this is the vertical format now we need to just do a print for this so and I'm gonna write here the data dot columns is equal to dialog length and then here we have a summary length thereafter you can just do here data dot histogram and then I'm going to also provide here the figure size and in this I provide here 15 cross 15 as a figure size that's all okay this is pretty much simple code now you can clearly see it here a dialog length the maximum number of tokens are less than 500 that means you do not need to worry about anything all right so you can use whatever the model you want because the anyway minimum token support from any Transformer model is 512 so whatever the Transformer model you select that is gonna work with this data and summary length is obviously just less than the 60 70 tokens and one more output you can see here that generally seems like this length of the summary in in general is seems like a 10 percent of the dialogue length all right perfect so now we have got here uh the features the token features I mean to say that the token length now we are sure it will work with our model now we are going to build here data collider so this data collator is is used to collect all the process it's actually kind of where you prepare your data for the fine tuning so data collator is like a pre-processing step where you will combine multiple steps together and then you will be ready to get your data trained all right let's build data collider so I'm going to Define here a function div get feature and here the batch so you see what feature we are going to get as you know in the Transformers as you know in the Transformers a Transformer first takes text Data the text data is converted into the encodings and then that encoding gets converted into the embeddings context embeddings and the positional embeddings all right then that embeddings faded in we we've had that embeddings into uh attention head there encoders and decoders encoders and then finally it generates a context based embeddings there so for that process we first need to do here these tokenization so tokenization is outside of the model and the other process like generating the embeddings and then encoder decoder all other things are inside the model so here we are going to do the tokenization so we are going to get the tokenization encodings I write here tokenizer we have already loaded this with auto tokenizer class and first I need to get here the dialog so first I'm gonna get the tokenization for the dialogue data and then Target Text data for this for for this task is summary so I write here batch summary thereafter max length okay so so this bar supports Max maximum 100 1024 tokens so these many tokens you can use and thereafter if anything if anything comes larger than one zero two four then a truncation will happen from the last so that I can write here the truncation equal to true all right so anything which goes beyond one zero two four then that will be truncated here all right now you can simply write here uh you know encodings equal to a list here so this encoding will return here input idh attention mask and the label encodings here so I write here input ID is and in input ID is I write here in codings input idh thereafter it also returns here attention mask in this I write here encodings attention mask thereafter it also returns here labels and in that I'm gonna get it from the encoding labels now our get feature function is ready now we need to use here map function or map method on Samsung data to generate the embeddings or encodings you can say that so I write here Samsung PT that's the pi torch then I write here Samsung Dot map and then I write here get feature that's the function so it will automatically pass here the batch inside and then it says that do you want a batched processing I said yes I want here a batched processing so it has started processing seems like something is wrong okay so this is mask actually okay so Samsung PT here now you can see there the mapping has started all right while this mapping is being done I can just simply copy this from here and then paste it here once this mapping is done the mapping means once encoding is done you can check it here so original Samsung data had only two pages ID and the dialog uh earlier it has ID dialog and summary okay and still it is having same just the three column seems like something is wrong okay so what is wrong here actually I did not return in coding so I need to return this in codings okay so once I return encoding these three additional column will be added there okay so I just need to wait here for a while all right now you can check it here so other than these three a three more columns got added here input ID is attention must actually this should be the attention mask not the attention Mast here all right so this data I have to load it again otherwise it will create a problem there all right so the Samsung is there because that mistake was made here just to avoid that mistake I had reloaded the data again and then I'm just gonna run it once again now I can see it here input ID is attention mask and labels okay now let's go ahead and Define the columns which we are going to use for the final training so I'm gonna use for columns is equal to input ID is attention mask and the labels so that label I'm gonna put it here in okay so we have input ID attention mask and labels these columns I'll be using here so I write here Samsung underscore PT Dot set format I set here type is equal to the torch and then columns is equal to columns so what does this mean here okay so this is going to set our data set format as the torch tension Pi torch tensor data okay so there is no change so these three columns will be set these three columns data will be set as the pi torch tensor there will be no change in the available number of the columns here so these columns are exactly the same what we had earlier now we have reached the stage where we can start fine tuning Our model but before starting fine tuning we have to make here the trainer so for the trainer so for the trainer I'm gonna create first trading argument so I write here from Transformers import data collator for sequence to sequence and then I Define here the data collator is equal to data collator for sequence to sequence the same thing I'm just going to copy and paste it here and then tokenizer model is equal to model so we have got here the data collator let's go ahead and import training arguments and trainer so from Transformers import training arguments and trainer so we have imported here training arguments and trainer now we need to make here training arguments so I make here training arcs is equal to training arguments this training argument will be used for fine tuning your model so here we have to provide all the parameters which will be used for fine tuning so I provide here output underscore dir is equal to Bart underscore Samsung all right so so for model output directory will be bought Samsung directory that's where the model checkpoint will be stored and then the number of train apocs I'm gonna use it just for one Epoch that's when fine tuning will happen just for one epoch and thereafter I write here warm up underscore steps is equal to 500 and then I write here per device train batch size equal to 4 and then per device size equal to four so what is this you see there per device training batch size is force that's mean the four batches will be used for the training and evaluation and how the training learning rate TK will happen that I write here weight d k equal to 0.01 while training these weight will decay thereafter I write here you know logging steps that's mean uh your fine tuning logging will happen at each 10 steps that's mean after 10 steps the logging will happen there after I'm gonna write here evaluation strategy which is steps that's mean as it steps evaluation will happen even steps is equal to 500 in fact at 500 steps evaluation will happen and it will save it will it will save checkpoints only after 10 power 6 steps which is huge and we are running this large model on a very you know small GPU so for that purpose I have to sit here gradient accumulation underscore steps is equal to 16 only otherwise your training might get failed here once your training argument is set here then you can simply prepare your trainer so I write here trainer is equal to trainer which we have already imported I pass here model is equal to model and arguments is equal to training arguments thereafter I pass here tokenizer is equal to tokenizer thereafter I pass here data collator is equal to data collator all right so we have got here the data collator and all other things finally we need to sit here the train data set which is Samsung underscore PT and it's the train data and then URL underscore data set is equal to Samsung underscore PT and I'm gonna use here the validation data set all right so this their model which we have already downloaded uh sometimes back somewhere here model and the tokenizer is passed from here and then training arguments we are passing from here data collator we created here so this data collator will collect all the pre-processing steps together for uh along with tokenizer then the trend data set Samsung PT train batch a train split will be used and validation split will be used for evaluation let's go ahead and run this now the fine tuning will start here okay perfect you can simply write here trainer dot train seems like something is wrong okay trainer dot train now model has been loaded here and it's just about to start your model training step training loss validation loss all logging has been started here and trading has started successfully here total for one Epoch we are running it and there are 230 steps are there four steps are already done here and this is large model so it's going to take some time like 20 to 30 minutes it will take so we have to wait for some time so that it can finish the training all right so the training is done here now we are ready to save our model once we save model thereafter we will be ready to download our model and we will be also ready to test our model so I write here trainer dot save underscore model and then I'm gonna give the name of the model so I write here Bart Samsung I just run it now you would be able to see that but Samsung will be stored here all right so here you can see that Bart Samsung model is stored here in fact you can give a different name so that it can store it into some other and some other name the reason is actually the previously we had also defined Bart Samsung as our checkpoint storage place as well so this run and this was created here itself now I'm using this Bart Samsung model so it will be it will be stored here at in other location all right so in this location you can see all the Json files and here the pi torch model is there this one and this one now you can create a pipeline like we have been creating earlier so for pipeline I'm gonna say here custom dialog UE prediction and for pipe I can simply say here pipe equal to pipe and then pipeline in this like we have been doing previously summarization and then model is equal to Bart Samsung underscore model but Samsung underscore model all right that's the pipeline now we have to here provide some generation keyword so these keywords are also important while generating Text data so in while generating our uh while generating the text in our previous video in NLP tutorial 3 when I taught how you can do the text generation then in that in that tutorial I told that you can use Here length penalty so this length finality will put here the penalty over the length if larger the length the penalty will be high and then another generation uh the text generation you know the keyword is number of beams so this number of beams Define that how well and what will be the variance in generated Text data and then you have here max length so the maximum number of tokens here I want 128 a maximum token all right let's go ahead and print the output so I write here print pipe and in this I pass here custom dialog so first of all I also need to write my custom dialog along with that this generation arguments will be also passed here so let's go ahead and build uh write this custom dialog so this custom dialog I Define it something like this custom dialog equal to let's try to create a custom dialog something like this where there are two subject so the one subject I say as myself which is lakshmikant what work are you planning to give Tom then I ask Julie so Julie said that I was hoping to send him on uh business trip first all right then here again I come then Lakshmi Khan says that cool is there any suitable work for him and then again Julie Sage here he did excellent in last quarter I'll assign new project once he is back all right so let's say this is summary a conversation summary for this conversation I want to create summary so I just run it now based on our trained model it will try to generate here a suitable summary for this this dialog seems like something is wrong okay so length penalty is not right let me just check it here if something is wrong here okay so this should be like this length penalty okay while while typing I made a typo there actually I was trying to type it very fast this is large model so it will take some time all right now the summary text is ready now it says that Julie wants to send Tom on a business trip first she will assign him a new project once he is back Tom did excellent in last quarter and she wants to give him more work etc etc so basically now you see there we have got here you know so so now we have got here that our model is doing very good and it has some summarized it here Julie wants to send Tom on business trip first and she will assign him a new project once he is back now you can see that how you can do obstructive summarization let's go ahead and see how you can download your model so you can write here zip but Samsung dot zip so so this model will be zipped as Bart underscore Samsung dot zip I'm gonna use here recursive and then I write here Bart Samsung underscore model so this is going to this is going to zip this particular folder into Bart Samsung dot zip once this ZIP is created then there will be option to download this file then you can simply download this file all right so this is all about in this lesson thanks a lot for watching this I'll see you next one

Info

Channel: KGP Talkie

Views: 10,469

Rating: undefined out of 5

Keywords: kgp talkie, kgp talkie videos, machine learning tutorials, kgp talkie data science, HuggingFace Tutorial, Summarization Fine-Tuning, Gpt2, T5, BART, PEGASUS, Text Summarization, huggingface tutorial for beginners, huggingface projects, nlp projects, latest nlp projects, transformer projects, nlp with transformers, fine tuning bart, bart fine tuning, fine tuning transformers, fine tuning llm, fine tuning t5, fine tuning pegasus, t5 fine tuning, t5 fine tuning huggingface

Id: CDmPBsZ09wg

Channel Id: undefined

Length: 75min 20sec (4520 seconds)

Published: Fri Sep 22 2023