Text Summarization with Google AI's T5 in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi and welcome to this video on texturization we're going to go ahead and build a really simple easy to use text summarizer using google ai's t5 model so this is insanely easy to do all together we actually need seven lines of code and with that we can actually summarize tech using google's t5 model which is actually the cutting edge in terms of text summarization at the moment so it's really impressive that we can do this so easily and we'll just run through really quickly and see what we can do so we need to import torch and the transformers library and from the transformers library we just need the auto tokenizer and auto model with lm head whilst they're importing we can initialize our tokenizer and model so all we do for this is tokenizer and then we load our tokenizer from pre-trained and we will be using the t5 base model and then we do the same for our model except with the auto model with lm head plus we also need to make sure that we return a dictionary here as well okay for this i'm going to take some text from the pdf page about winston churchill and we will just take this text here i've already formatted it over here so i'm just going to take this and paste it in but this is exactly the same as what i highlighted just here without the numbers and headers so we run that and we simply build our input ids so all we're doing here is taking each of these words splitting them into tokens so imagine if we split one sentence into a list of words we would have this his first speech prime minister each one of these words would be a separate token so we split them into those tokens and then we convert those tokens into unique identifier numbers each of these identifying numbers will be used by the model to map that word which is now a number to a vector that has been trained and represents that word we just add summarize at the front here followed by our sequence because we are using pi torch we want to return pt tensors and we set a max length of 512 tokens which is the maximum number of tokens that t5 can handle at once longer than this we would like to truncate so now we can have a look at those inputs and we can see we have our tensor of input ids now we need to run these input ids through our model so we do model generate and this will output a this will generate a certain number of output tokens which are also numeric representations of the words now all we need to do here is pass our inputs and then we give a max length and minimum length as well let's just tell is the model we do not want anything longer than we're going to use 150 characters and anything less than 80 words now we have a length penalty parameter here as well so the higher the number the more the model would be penalized for going either below or above that min and maximum length we're going to use quite a high value here of 5. and we use two beams now what we also need to do here is actually pass these into another variable outputs and then when we want to access these outputs we will use outputs zero as this is the tensor containing our numeric word ids now we can use tokenizer again to decode our outputs so this is converting our outputs from the numeric ids into text and we also want to give that to another variable finally we can print our summary and here we can see that the model has taken some of the information i think entirely from this second paragraph here and created a summary of the full text out of the box this is pretty good because if you read through this it includes a lot of the main points now the first paragraph isn't that relevant and i would say the final paragraph is not either most of the information that we want from my point of view is in the second and third paragraph now the model has quite clearly only distracted information from the second paragraph which is not ideal but for an out-of-the-box solution it still performed pretty well so that's it for this model i hope this has been pretty useful and insightful it's how quickly we can actually build a pretty good text summarizer and implement google's t5 model in almost no lines of code so i hope you enjoyed and i will see you in the next one
Info
Channel: James Briggs
Views: 10,529
Rating: undefined out of 5
Keywords:
Id: egDIqQIjDCI
Channel Id: undefined
Length: 6min 58sec (418 seconds)
Published: Fri Nov 27 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.