Generate Blog Posts with GPT2 & Hugging Face Transformers | AI Text Generation GPT2-Large

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks everyone for attending class today and as discussed your homework tonight is to write a 5 million word blog post by tomorrow morning to demonstrate your eminence in social media uh excuse me sir did you say 5 million yep nick sure did good luck uh i think we're gonna need a little ai help for this one jpt2 here we come what's happening guys in today's video we're going to be taking a look at how we can use ai and natural language processing to be able to generate our own blog posts just from a single sentence let's take a deeper look as to what we'll be going through so in order to generate our blog post we're going to be using a library called hugging face so hugging face allows us to easily access a state-of-the-art natural language model called gpt2 so using gpt2 we're going to be able to pass through a sentence and have an entire blog post generated it's quite powerful and allows you to do quite a fair few different types of things so in order to do that we're going to be first up setting up hugging face gpt2 so we're going to install that and get it ready then what we're going to do is load up our model and encode our sentence so once we've encoded our sentence we can then pass it to our model and decode it and actually have our blog post generated we'll then also export this so you can have it as a text file let's take a deeper look as to how this is all going to fit together so first up what we're going to be doing is installing hugging face transformers so that's the name of the library for natural language processing then what we're going to be doing is preloading the gpt2 large model so this is going to give us a whole heap of power and grunt to be able to generate our blog posts what we can then do is just pass through a single sentence for what we want our blog post to be about and have an entire blog post generated you can configure how long you want the blog post to be and you can actually export this out to a text file so you can paste it up on your blog or push it out elsewhere ready to do it let's get to it alrighty so in order to use hugging face transformers and natural language processing to be able to generate our blog posts there's five key things that we need to do so first up what we need to do is install and import our dependencies then what we need to do is load our hugging face gpt2 model then what we're going to do is tokenize that sentence so what this means is reduce it to individual words and then effectively encode them which means convert them to a number representation then what we're going to do is generate our text and decode it so we'll convert it from a number to its representative word and then we'll output our result so in order to actually do this our core dependency is actually going to be hugging face transformers and specifically we're going to be using the gpt2 model so this is readily available you don't need access to a special api it's all open source you can actually start using this pretty easily and as always all the code that's been shown in this tutorial is going to be available via github just check the link in the description below but if you want to get access to it you can just go to github.com forward slash knick knock knack forward slash generating dash blog posts or dash blog dash post dash with dash gpd2 dash large link's going to be available in the description below though alrighty cool on that note let's actually start writing some code so the first thing that we need to do is install and import our dependencies now our core dependency in this particular case is going to be hugging face transformers so in order to install that inside of our notebook we can just type in exclamation mark pip install transformers to install so let's go ahead and do it alrighty so that's our core dependency installed so in this case we've installed transformers now i already had it installed so it went pretty quick but if you're doing it from scratch it might take a little bit longer so in order to do that what we've written is exclamation mark pip install transformers and that's really it to be able to go ahead and install transformers now the next thing that we need to do is actually import some dependencies so let's go ahead and do that alrighty so those are our core dependencies imported now in this particular case we've imported two key things a model and we've also imported a tokenizer so specifically what we've got and imported is from transformers import gpt to lm head model and gpt to tokenizer so let's take a quick step back and actually take a look at what these two things do so our tokenizer is going to take our input text and encode it from a word to an effective number what we're then going to do is pass it to the model which is our gpt2 model and this is going to generate output tokens so it's really going to generate numbers and these numbers represent words or word sequences we'll then use this tokenizer again to decode those numbers back to word so effectively we're getting words converting it to numbers passing it to the model model is going to generate new number sequences we'll then pass it back to our tokenizer to decode it and we're going to get an output sequence which effectively represents a blog post now the next thing that we need to do is actually go on ahead and load our model and our tokenizer so let's go ahead and do that alrighty so that's our tokenizer and our model loaded so we've written two lines of code there so first up what we've done is we've created our tokenizer and in order to do that we've written a new variable or created a new variable called tokenizer and then we've set it equal to gpt2 tokenizer and then we've used the from pre-trained method to load up our pre-trained gpt to large tokenizer so in order to do that if we read through the whole line so it's tokenizer equals gpt2 tokenizer dot from pre-trained and then we've passed through a parameter which is gpt2 dash large so this effectively allows us to leverage the gp2 large model then we've also instantiated a model and specifically what we've gone and done is we've used our gpt2 lm head model which is what we imported up here and again we've used the from pre-trained model to load up our pre-trained model and again we've imported the gpt to large model now if you get a out-of-memory error or if you get something along those lines you may want to use the non-large model in which case you can just remove the dash large and this will allow you to load the the smaller version of the model but in this case we're going to use the large model because it allows us to work and generate bigger more sophisticated blocks of text all right then the next keyword parameter that we've passed through is the pad token id so this effectively represents what token is going to be used to pad our text so if we actually take a look what we've actually passed to that is tokenizer.eos token id so if we take a look at that you can see that the token that's going to be used to pad our text is going to be token 5256 so if we went and decoded that we should get an effective end of sentence word so if we actually go and decode it you can see that we're actually getting an end of text token so this basically means that wherever we're trying to pad we're effectively going to be replacing it with this end of text token this decode method will go into a little bit more detail later on but for now just know that you can encode your text and you can also decode it so if we take a look at that whole line that we've written there so we've basically said model equals gpt to lm head model dot from pre-trained and then we've passed through gpt to large as the model that we want to load up and then we've specified pad underscore token underscore id equals tokenizer dot eos token id and this basically represents end of sentence token id so that's the identifier for what token represents an end of a sentence in this case it was that number that you saw and if we decode that the word that's actually represented there is a set of brackets uh some bars and then end of text and then the same on the other side okay so that is our model and our tokenizer now loaded now what we want to do in our next component is actually set up a sentence and then encode that using our tokenizer that we just set up up here so let's go ahead and do it alrighty so that is our sentence now tokenized so what we've actually done there is written two lines of code so first up what we've gone and done is just created a standard python string and we've saved that inside of a variable called sentence and that sentence is i like ice cream then what we've gone and done is use our tokenizer to encode that into a number of different identifiers and these identifiers are going to be what we pass to our model that we set up just up here so in order to do that we've written tokenizer dot encode and then we've passed through our sentence and then we've specified that we want to return our tenses as pi torch tensors and then we've saved that inside of a variable called input id so what we've effectively done is we've converted this ids we've converted this sentence here which is i like ice cream to these input ids now if we actually took so we can see that we've got id 40. so if we took id 0 and id 0 again and we went and typed in tokenizer decode and pass through that value you can see that when we decode that number back we're just getting the word i again if we tried one of the other words again like the third second word again it's going to be ice third word or fourth word is going to be cream so you can see that what we're effectively doing is converting this sentence into a number representation and this is the way that our gp22 model likes receiving inputs so that's the way that we're going to send it to it now a key thing to note is that when our gpt2 model generates our blog post we're going to get a number of identifiers again so what we're going to eventually have to do is use tokenizer.decode to decode our generated outputs now it's time to actually go on ahead and generate our blog post now you might have guessed that our blog post is going to be based on i like ice cream but again you could change this sentence and generate a bunch of other blog posts based on other topics and we'll actually take a look at how to do that once we've gone and generated our first blog post but on that note it's now time to go on ahead and generate it so let's do it alrighty so that is our blog post now generated now we haven't actually output anything to the screen but we'll come to that in a second so what we've actually written there let's just bring this back into the center so we can see that a little bit better so what we've actually written there is one single line so we've gone and written model dot generate which is using this model up here and then we're passing through our input ids and a number of other keyword parameters but we'll come back to that in a second and actually delve a little bit deeper let's actually take a look at what has been generated so if we type in the output we've got a number of identifiers because remember we're passing through our input identifiers what we're going to get back is a number of identifiers as well or a number of ids so if we actually go and decode this now you can see that we're actually getting a whole blog post again and again this is quite a short blog post but we're going to flesh this out in a second so by decoding this output we're actually getting back i like ice cream sandwich but i don't like it as much as i thought it would it's not that i dislike it it's just that it doesn't feel quite ripe to me i'm not sure if i'll ever be able to get used to it i'm sure there are plenty of people out there who love this new version of android oh because it's actually talking about ice cream sandwich the version of android interesting but for me i just can't get over the fact that android 4 feels so different from the and again it's stopped there interesting so we've said ice cream and it's actually adapted that ice cream sandwich which is an android version really really interesting there but you can sort of see that we've started to generate the beginning of a blog post so let's take a step back and take a look at how this was actually done and then we're actually going to update our topic and generate a slightly larger blog post so in order to do this we've written model dot generate and we've passed through our input ids which we tokenized up here then we've specified the maximum length of the blog post that we want to generate or the text that we want to generate because keep in mind that although we're doing this to generate a blog post you could actually generate a whole bunch of other different types of text so we've set it to limit to 100 words and then we've specified the number of beams so in this particular case we're actually using a technique called beam search to be able to go and search through and find the most appropriate next word in the sequence and we've set the number of beams so effectively how many search trees that we're going to have to five we've also specified no underscore repeat underscore engram underscore size equal to two so this particular parameter here basically stops our model from repeating certain sequences over and over again because sometimes when you actually go and repeat this it'll get stuck on a certain probability for the next word and it'll just keep repeating so in this case we're saying that sequences of two can only repeat once so basically this is going to stop our model repeating over and over and we're specifying early stopping equals to true so if we reach a point where we're not getting great outputs it's going to stop generating and in this case it's gone and generated this blog post here now if we wanted to we could output this as well so let's go ahead and output it and that is our lightweight blog post output but again we're going to flesh this out in a second so what we've actually done there is we've used our tokenizer and we've decoded our output so in order to do that written text equals tokenizer dot decode because remember we've got an encode method which takes our sentences to our effectively our identifiers or our embedding and then our decode statement which takes an existing set of identifiers or an embedding and converts it to its representative words so what we're doing here is we're taking the output from our model so tokenizer dot decode and then we're passing through that output and we're grabbing the first value or the first part of our array and then we're also passing through skip underscore special underscore tokens equals true because we don't want to output the end of sentence tokens and other special tokens as well we just want the words and then what we've done is we've taken all of this and we've sorted inside of a variable called text and then we've gone and used some standard python functionality to go on ahead and write it out so written with open blog post ice cream dot text we've specified that we want to write it out as f and then we've used our file so f dot right and then we've passed through text so this is effectively going to generate a new blog post or a new text document called blog post ice cream dot text should actually be txt and we should be able to open this up so if we go to the folder that we're in you can see that we've got a bunch of documents so this is the one that we wrote as text text we actually want this one and you can see here that we've actually got our blog post or our lightweight sentence that we've actually gone and generated now in this case we've gone through an entire workflow and we've generated a lightweight blog post but what happens if we wanted to generate something a little bit more substantial so say for example we wanted to use a different sentence as our input well what we can do is actually just replace this sentence here and we should be able to generate a new blog post now in this particular case the sentence that i'm going to use is actually driven from an existing blog post that i saw so it actually comes from this hacker noon article which talks about how noco tools need to be able to adapt to specialists that are already operating within that field now in this particular case what we're going to do is really just use this input into our model so no code needs to adapt to specialists so if we paste that in there what we should effectively be able to do is generate a new blog post simply from that statement or at least some text out of it so if we run this then take a look at our tokenizer so again we've gone and tokenized that by deleting those three so again our tokenizer is converting it to input ids and if we decode it we're getting our sentence back now the other change that we're also going to make is that we're going to change our maximum length so rather than having 100 we're going to change this up to 500 so ideally we should be able to generate something a little bit more substantial than just this line here so let's go ahead and run this now it may take a little bit of time to run because it is going through a number of different processes and quite a sophisticated model to be able to generate a significant blog post so let's go ahead and run this and see what we get out five minutes later and there you go so we've now gone and generated a significantly more substantial blog post so if you actually take a look you can see that we're generating a whole heap of additional tenses and right down here we're getting our full blog post so in this case if we actually take a look at the blog post so no code needs to adapt to specialists and it's actually appended a question mark which we didn't have up here and then if we read a little bit further so i'm not sure what the answer is to this question but i think it's a good question to ask yourself if you're an expert in a particular area then you should be able to adapt your code to the needs of specialists in that area but if you don't know what spec your specialty is you shouldn't have to worry about it you should just write your code that works for you and if it doesn't work that's your problem not the specialist problem interesting it so if i were to write a function that takes a string and converts it to a number i wouldn't worry too much about whether it would work for a specialist in the field of number theory i would write the function in such a way that it could be used by anyone who wanted to use it regardless of whether they knew anything about numbers or not in other words my function would be a general purpose function which means that i could write it in any language i wanted i mean you can see there that it's actually writing some stuff that's particularly relevant i mean it's talking about programming languages it's talking about the construct between specialists and non-specialists and it's going through and actually generating something that's quite substantial so if we actually take a look at the length of our output so len output zero you can see that it's hit right up to the maximum output length of our generated blog post now again if we wanted to we could output this out as well so we'll probably call this one specialists.txt and if we take a look in our folder again you can see that we've now gone and output our next blog post so that about wraps up how to actually go and use the nlp hugging face model to generate blog posts so we did quite a fair bit there so again we installed and imported our dependencies and specifically we started working with gpt2 we took a look at how our tokenizer allows us to encode and decode our sentences so remember we encode it before we send it to our model and then we decode the outputs from our model then what we went and did is we actually went and generated some text using the model.generate method and then we decoded that output and we used some small inputs to be able to generate a small block of text and then we also increased our max length to be able to generate a slightly longer blog post and again we can output that pretty easily using some standard python functionality and that about wraps it up thanks so much for tuning in guys hopefully you found this video useful if you did be sure to give it a thumbs up hit subscribe and tick that bell so you get notified of when i'm releasing future videos and let me know what types of blog posts you are generating thanks again for tuning in peace
Info
Channel: Nicholas Renotte
Views: 7,116
Rating: 4.9435029 out of 5
Keywords: gpt-2 tutorial, gpt2 tutorial, hugging face, ai text generator, gpt 2 tutorial, text generation, natural language processing
Id: cHymMt1SQn8
Channel Id: undefined
Length: 20min 38sec (1238 seconds)
Published: Sat Feb 20 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.