NLP Tutorial 11 - Automatic Text Generation using TensorFlow, Keras and LSTM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: NLP Tutorial 11 - Automatic Text Generation using TensorFlow, Keras and LSTM
  • Author: KGP Talkie
  • Description: NLP #LSTM #TextGeneration In this lesson, I will show how to use LSTM to generate text sequences on given seed text. Please like and subscribe to this ...
  • Youtube URL: https://www.youtube.com/watch?v=VAMKuRAh2nc
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Jan 09 2020 🗫︎ replies
Captions
hi welcome back to a new lesson this is Alexa McCann 30 very in this lesson I'm gonna show you how you can do text regeneration using a tensorflow jarasand lsdm so the first of all I'll describe you how we are going to generate the text but before that I need to explain you how we are going to build the data and then how you how we are going to train the model and then finally how we are going to do a prediction and then I'll be talking more about the the data preparation most of the almost first 30 or 40 minutes you need to prepare the data otherwise your algorithm might not work well so a data scientist will do devote its time in data preparation instead of just writing and building the models so the ones the data is prepared then I'll show you how you can build LST models and prepare your training and the testing data set and then finally this illustrate model which we final the celestia model will be built and then the training will be performed on hundred a box although you can increase okay so without wasting your time let's go ahead and start this granulation in this lesson I'm going to tell you how you can generate a text by using a tensorflow case and the lsdm so this is the particularly in which i'll be our training deep learning model especially the lsdm model by converting its world into a vector alright so it's it is kind of in which this the deep learning model will be learning to predict the next word on given some particular are the sequence of words for example in the step 1 let's say if we have the man is walking then the model will be predicting down and once down is predicted then it will be again taking let's say last for word and then it will be predicting the next word and then similarly this next word will be the included included into the see sequence of the words and in which it will try to predict the next word and so on so it will keep going and it will keep predicting the next two next words and then the next word will be included into acid sequence of words and then it will predict it under the next word okay and but the question is how it would predict the next word to make it predict the next word we need to make and learn our deep learning model and to make it learn we need to provide some data and the debt that data should be India and the text format and later that text format will be converted into integer formats are in a vector and that will be the encoded there like word embedded techniques and for that I have children are the work done by a sex fear william shakespeare's and it is the the sonnets so this is the play name done by william shakespeare's so this is the quiet large text and you see here this is quite large text and it is our label at MIT website ocw.mit.edu I have provided the link for this as well in the in the in the in the video description from the air you can download alright perfect so this text we will be using to the in this text we will be using to to train our deep learning model so let's go ahead the first of all what we need to do we need to first import this text into our the google : notebook and then afters we will be cleaning this text and then we will be preparing it so that we can use it into our deep learning model so the first of all in google colab you need to set all your runtime and that runtime I'm gonna use here the GPU so set it GPU there once it is set you can get click on the connect otherwise it will automatically get connected once it is connected then you will see here a ram and the disk will be initialized so initially the ram 12 GB of RAM is initialized and 68 GB of hard disk is initialized for this notebook but later you will see that this 12 GB RAM is not sufficient to train this model so the Google will automatically crash this to Priya this jupiter notebook and once this jupiter notebook will be Christ then it will ask you to increase the RAM which will be automatically upgraded to 25 gb ok so without wasting time let's go ahead and start the coding since we are going to use here tensorflow version 2.0 that means we need to give here a magic command there so that we can get with the tensorflow version that we are gonna use here 2 X and then I'm gonna use import here a tensorflow ok tensorflow as DF and I'm also gonna import a string and I'll be needing this is string for a punctuation and apart from that I'm also gonna import the requests so that I can get this text file into our Jupiter notebook so let's go ahead and run this cell once you run this cell it will import all these three necessary library ditch and it says that tensorflow to point X is selected that's the tensorflow 2.0 very recently et has been launched now let's go ahead and read this text directly from the internet that we can get with the response is equal to request dot get and there we need to provide this link and once we provide this link then hit the enter and once this is done then you can type their response dot text and you can print this text so this whole text has been returned by request dot get and it might take a little time to fetch all the text since this is a quite large text and it has the more than 20 lakhs of the were there all right so this is very quite large text you see it has been the extracted here in a Jupiter notebook the one thing you might notice here is that there is newline character and this new line is character is inserted at every new line like you see here so these are the new line and this is the hundredths a text file presented by Project Gutenberg and after and you see here a new line and from there there is new line so you see here and after new line and from there there is new line alright so that's the new line connector is inserted here to make it something like this we need to split eight new line character we will do that later so once we have got the response dot text then you can get this data using this response dot text and then split split it at new line so once we do this we will get a list of history there and let's go ahead and print the very last very first line and that very first line is this is the hundred a text file presented by Project Gutenberg and the one thing you see here this is the header file and this is not original work of the sec spear the work of the six period starts from the line number 253 you see here so this is the line number 253 I'll show you so the work starts from here so what we need to do we need to omit the initial line so let's go ahead and see line number 253 that's the data 250 it says that here from highest creature we desire increase now you see here from fairest creatures we desire increase that thereby beauty's Rose might never die but as the ripple sort by time disease okay so we need to update our data with this point that we can do here the data is equal to the data list will start from 253 to by the end okay so with this we get here our updated data now at the zeroth location we have very first line of original play so this is how we have removed our header from the file now once we have removed the header we have got the original work here now what we need to do since this is the list and you see here the length of this data the it will show you that how many lines are there so it says that there are there are 124 thousands more than 120 for thousands line are presented in this data and to do to to make it make it a continuous text what we need to do we need to join all the all the all the values presented in this data that we can do something like data is equal to okay dot join and then data so this list will be converted into a text and there you see here the data is there the newline character has been removed and and the header is also removed so this data will print the whole play but after removing this header person from this line to to the end of this play okay it might take a little time to print it perfect so you see from the fairest creature we desire increase that thereby beauty's Rose might never die but you still see here they have you are the the punctuation Slyke here they arrange punctuation sand here and here is punctuation so what we are going to do here we are going to remove these functions and then there will be just alphanumeric text and that text we will finally train our deep learning model based on those texts so what we need to do here we need to first write our cleaning function so I'm gonna write a clean text function that will take this document and we need to first what we need to do we need to create it a token we need to create a token the token means what we need to do here this will be splitted with the space character and these are the particular tokens okay so the tokens is equal to dog totally split okay and this is split will be based on these right spaces and if they are more than one white space then the enter white space will be included either token and then finally I'm gonna get here the punctuation so that the punctuation we can get with the str dot make trans okay this is the technique to remove the punctuation from any string string dot punctuation all right so here we have tokens now the tokens is equal to w add one thing you need to see here this is a list in which I am guna arm and I'm gonna put a for loop there so for the blue in tokens so you see here the token is individual words and I'm putting here a for loop and this w will we return here and for that I'm gonna create here now I'm gonna create here a function I'm sorry I'm gonna create here a list which will not include the punctuation that's when the punctuation will be removed that we can do here w dot R translate alright so we have here the blue dot translate underscore table all right so with this this will be the translated here so with this in the new token other punctuation will be removed here alright so after this what I am gonna do here the token tokens is equal to there I have word but before that let's say here we have token and now I'm going to remove here I special characters which are not alphanumeric okay special characters which are not alpha numeric so to do that I'm gonna do here were in in tokens actually not word in fact I need to hear a for loop so I can do here a for word in tokens okay so here we have a forward in tokens and if word dot is alpha you know what it will do it will check here if this word dot is alpha turns in this word if it is alphanumeric only then this will be saved in this new string thirsty word okay perfect after this once we have got this now we need to convert it into into into lower okay into a lower test yeah sorry so I had a Google home so it had just turned on and it is asking me to turned on its microphone okay nevertheless let's go ahead and start this one further okay so here we have now we have got the our token which have removed any special characters and the punctuation and now we need to convert these capital letters into lowercase letters and to do that we need to do here tokens is equal to word dot lower now you I'm sure you understand this now forward in tokens all right so with this you see here what happens so this token has been converted into lowercase letters and it has removed any special characters from here and it has removed any punctuation by using these two lines and with this line of code it has speeded this full text into a tokens which is white space separated and if they are more than one by the space then it will automatically add that another white space as a token all right so let's go ahead and run this it's saying that unable to connect at the runtime because my runtime got disconnected I need to again reconnect it it is saying that there is some problem let me see I think my internet core disconnected that is why it is happening let me get connected with my internet and then I'll start it again alright so once I got connected with the Internet I had reloaded and reconnected by the Tjuta notebook runtime so I have not changed any data here or the previously it's exactly the same as I told you earlier okay so let's go ahead and run this cell in which I have written a function to clean the text and to remove any non alphanumeric characters and from Jewison and any extra white-spaces so let's go ahead and get the tokens that we can get with the tokens is equal to clean text okay in the clean text we need to pass here data and once we pass this data let's go ahead and print first few token so I'm gonna print here but initial 50 tokens which we will be using as our CD text okay these 50 texts will be used as a seed text so it says that something built in lowers your object I think I think here we I have a mistake a little mistake here let's go ahead and run this and run this now things will be fine from fairest creature we desire increase that thereby beauty's Rose might never die now you see here these special characters and punctuation have been removed from thee and each token now once we have got this token now we can finally combine these tokens together to get a full are the characters for full text but before that let's go ahead and see how many total number of tokens we have the total number of tokens we can get with the length of the tokens it will say that the total number of words present in this play so it it has 898 thousands okay 898 thousand words present the total number of the words but how many individual words are there so if you want to know any unique words the unique words we need to convert it into the set and then we need to pass this list in itself so with this we will get total unique words so the total number of unique words are 27,000 956 so we can say that this will be our total the vector size during world embedding that's mean in this full play they are 27,000 956 unique words are there and the combination of these words are being used to create here 898 thousand one hundred and 99 total number of words in this play like there could be many we repeated we repeated desire and repeated increase that is why the total unique number of words are less and of course the combination of these unique words are high perfect so let's go ahead and create our the sequence the data the data sequence in which we will be using to to to train our model so as I told you earlier that we will be using the particular set of words to predict the next word so in this lesson I'm going to use a 50 set of previous words okay so in this example there is just four words but I mean in this figure here otherwise in this example I'm going to use there is such 50 words and then after these 50 words we will be predicting the 51st word and then I'll be again taking that the 50 words and then I'll be predicting the 52nd words and similarly I'll be creating this the number of lines here so let's go ahead and prepare the data for that one so what I'm gonna do here I'm gonna say that length is equal to 50 that is 4 as a input and since I'm gonna also predict the output as well so that's the 50 plus 1 that's the 51 so how I'm going to do this I'm gonna first create these at the 51 together and then at the final stage I'll take this last the last column as output so that I can train the model and I can prepare the data I'll tell you that later in very detail so for time being you understand that this the first 50 is input and this is the output for which I will be training this model and then here lines we have a line which is just the empathy list now let's go ahead and create here a for loop for I in range there we have a length that's the 51 here and total length of tokens so how many tokens we have now now once we have this one the for I in range length and then the length token after this what I'm going to do here I'm gonna create a sequence here so that's the sake is equal to let's same sequence is equal to the tokens and inside this token we have our I minus the length okay so here I - length - I now you see here this range will start from length that's the 51 so the 51 - 51 is 0 and then it will go to I test I I starts with 51 so the 0 to 51 and that will come in the first sequence and then what I'm going to do here you see here 0 to 51 and once we have got these individual tokens then I'm gonna join these to create a sentence so and to do that I can do line is equal to dot join and then sequence so it will create a line there and once we have that one the ones we have one line that is the sequence of 51 words and then that sentence I am going to feed it into this empty list that is the lines lines dot append and in that time put this line once we had this one now the one thing if you remember that I told you that this rhyme 25 gb ram is not sufficient to embed this pull play which have more than around 1 million are the world so it is not sufficient so here what I'm going to do here I'm just going to take initial 200,000 words that's mean just 1 by port of this total play otherwise this Ram will overflow and you will not be able to you will not be able to train this model so here we have if I each greater than 200 thousands okay then break alright perfect so with this what I'm going to do here I'm gonna see how many sequence we had total that we can get with line of lines so you just run it it might take a little time to get it complete it says that total we have got 199 lines 199 thousands line around 200 you can say in her 200,000 lines data so inside these lines you see here at the giro location we have first 50 words if you remember this from purest creature fairest creature we desire increase that thereby beauty's and you see here we had a sequin if we had a tokens here all right now you see here at the tokens in the zeroth location of that token have this from and if you want to see here the line line is a consist is a 51 word okay so this 51 word which we have here itself so let's go ahead and see what it has at the 50:53 location that the 50th location is 51 word so let's go ahead and see the token 50 is having self there's been zero to five t is 51 word total okay and now you see here lines now you see here lines one a lines one each the lines one will have the token from fairest to the another word okay I mean just next word for myself okay that's the die you see here so how it will work it will first take these are the first 50 words then it will try to predict the self I mean it will first learn and then again it will take this it will include this self into input and then finally it will try to predict this final the output and so on so this is how this the algorithm will be designed okay now the time to start building our lsdm model so I can just write here build LSD a model and prepare x and y so as I told you earlier that I will prepare this X these x and y in a way that the first fifty word will be used to predict the Y since each of these lying is is having 51 words that's when the first 50 words will be used to predict the fifty first word okay so before we start building our LSD model and preparing our x and y we need to import necessary libraries and the packages so here I'm gonna first import numpy as NP and I am also gonna import the Kira's library from tensorflow tensorflow dot Kira's let's go ahead and copy this we will be needing this many times dot pre-processing okay so in this pre-processing dot text we will be using the tokenizer which comes with the Kira's inbuilt and then from tensorflow toward Kira's utils import to underscore categorical and then we have here from tensorflow dot Kira's and dot models import sequentially okay and then here we have from tensorflow dot layers import dense and similarly we will be needing here in that layer not only just dense layer but we will be also knitting here lsdm as well as embedding layers I'll tell you why we will be using these embedding layers and then finally we will be also loading here pre-processing dot sequence there we will be using the padding as well sequence import pad sequence okay so these are the necessary packages which we will be using throughout this lesson so it says that how that should be actually import NP an umpire's NP perfect so we are gonna first work with the tokenization so let's go ahead and create this tokenizer we have here token IGEL is equal to 2k energy okay and then here we have the tokenizer not pit underscore on texts so what we are going to do here we are gonna fit these lines first on the tokenization and then this tokenization I mean this sequence of word will be embedded as an integer actually okay this is kind of the wording building in which this every every unique word will be assigned an integer since any machine learning algorithm works on only numerical values we need to convert this text data into numerical values and that is being done by the tokenization here so once we do that then we are gonna create here a sequences is equal to tokenizer dot texts two sequences and in which I'm gonna put headlines all right so once we have done this tokenizer tokenizer dot feet on text and then sequences so this sequences will be having a list of the integer value which is created by the tokenization so let's go ahead and see these sequences now once you print this sequences you will see something of the very big array actually I should not happen to decide it might and hang my computer as well okay so this is the quite big array there okay this is quite we get it so what we need to do we need to convert these sequences into a numpy array so the sequences is equal to NP dot array and then here we have Green says okay so once we had the sequences now we need to as I told you that I should not have printed that it might have crashed to my the notebook let's see you should always a wire to print any large sequence otherwise your notebook might get crest so let's wait for some time as well otherwise I need to restart it now you see here the RAM is still 12 GB let's stop this not it is not getting stopped yes so it's saying that I need to restart at the magic notebook otherwise it will not work okay so that's the mistake I did I printed this whole sequence you should not print the whole sequence otherwise your notebook might get cries all right so what I'm gonna do here I'm gonna exit this page all right so I had reconnected my run time it might happen to you them sometimes when your output gets very large your nuclear Notebook might get disconnected you need to restart it again and run all the cell's so I did the same here the number might have changed here it was at layer 2 23 but now it is the 19 but I have not changed the code a bit there so all the code is the same okay I have not changed anything here so the code is same I had just redone all the cells so I had to do that so that it can get connected again so we were here I heard I was creating sequences is equal to NP dot hairy sequences and now I'm going to create X&Y now do you remember that I told you that the first 50 sequence first 50 word will be used for the input vector and the fifty-first world will be used for outward vector output vector that is why so the same here I'm gonna do here a sequences and now you see here the sequences is kind of the 2-dimensional array which have rows and the column so the rows represents that the line of play and columns that sees word column and they are 50won column the first 50 column is X and the 50 first column is y so in which I'm going to select all the rows that is we can select with this column and then I'm going to select all the column but last column thus mean it will now select all the column but to last column that's mean till 50 column it will select the data for all the rows that will go into X and now for y here we have sequences in this we need to select all the row but only the last column that we can do this one so you see here there is just a minor difference this colon says that select all the column but last one but this one says that select only the last column which is the output one which will go into y all right so with this you will get here x and y now you can see here the very first the line that is the x 0 so this is the vector now this with the size of this vector is 50 and here the number says that this is the integer value for a particular world so this is kind of the world embedding in the word embedding you will see that each unique world will be assigned a unique digit and based on these credit sequences a final output which each let's say a y0 okay so based on these sequences now if you have a deep learning knowledge of a deep learning now you will find that this is kind of the now has become a traditional problem in which sorry in which we have sequence of numbers and we need to predict the final output as a number all right so now we have X and the y in the form of the number and the one more thing you see here the y 0 is saying that here the 3 0 7 that is the the chord of a particular word but we need to change this Y into into a one heart encoding that we can do the Y is equal to the two categoricals so that we can also print the probability of each predicted word so here we have y2 categorical Y and the number of classes the how many number of classes we have the total number of classes is vocabulary size okay so we need to get the vocabulary size and that we can get you see here so twenty-seven thousand nine hundred fifty-six is vocabulary size okay so for that what how how we can do that let me see yes the vocab sides here we have I think it's go ahead and print here okay science the okf side but before printing the work F sides the one thing I should show you here that's the tokenizer dot world underscore index okay so if I print this you will see here the index for each word will be printed here alright so the index for each word will each is printed here and let's say that this 455 is correspond to help 4:54 is responding to divert you okay so this will tell you the total length the one thing you see here if I saw you the very fast which is 1 R 0 perhaps I think that's what really G Rho the one here you see here 1 da and I Utley okay so Dai is 1 so in this way what I'm gonna do here I'm gonna print here the length of this the total vocabulary so you see here the length of total vocabulary is saying that total thirteen thousand eight okay since this is your from the first location from the first we need to add here plus one as well so that we can get the right vocabulary sides so here we have okay underscore size is equal to tokenizer dot word index plus one okay so with this we have got total okay of size once we have this total okay website now after removing the punctuation set cetera so this number has been reduced earlier which we had seen their unique words which was here perhaps okay I think after this the total number of unique world is removed okay after lines perfect otherwise in the original text the total number of unique words were twenty seven thousand nine hundred fifty-six but the currently only thirteen thousand left because we had reduced the total the size all right so the number of classes which we have that comes with okay sighs all right so here we have got our the Y and the sequence length which we have actually that is the X dot save one sorry the one is always fifty there you see here we had because taking the first fifty where to predict the fifty first word so this Y is equal to the shape which we have currently there I'm gonna sorry I'm going to store this into sequence underscore length is equal to X dot save at the first so the sequence length is here defined as a 50 all right now we are going to build our lsdm model so this is where I'm gonna build my LSTA model till now it tooks almost the 40 minutes to prepare the data now I'm going to make up a listeria model and then finally I'll train this model so the model LSTA model on which I'm gonna create is V is going to be very simple model so model is equal to sequence e l if you have not seen my previous videos please go ahead and watch those videos I have talked about these sequency and models and lsdm very detail there so you can create model sequential model by calling model is equal to the sequin CL since we are using here word embedding so I'm going to also use here embedding there so in the word embedding it asks about what will be the input dimension the input dimension which which is here that is the vocabulary size the embedding wording reading total vocabulary size that we have okay underscore size and what will be the output dimension so there will be just 50 output so this is kind of the encoding what we are doing there so there we have total around thirteen thousand nine okay will are the words and in that we are gonna produce just fifty words and then then it asks about input length okay so the input length which we are gonna use that's the fifty as well so there we have sequence underscore length so with this we have got our model the first embedding layer and then model dot add there I'm going to use two layer of lsdm model so there we have lsdm and each of these model is gonna have 100 hidden layers and since we are using here are the two layers so we need to also use written sequences so let us go ahead and add the final layer of LSD which also have hundred hidden layer and then we are not gonna use another lsdm layer so by default written sequences we Falls here and after that I'm gonna use here dense layer and then the dense layer I'm going to use here 100 the units of dense layer 100 neurons in the dense layer and the activation function in the last which I'm gonna use there each raloo activation function in this dense layer and then at the final layer which is also a dense layer right but here it is very important to understand that the total number of layer total number of units in final dense layer will be the size of vocab because we are going to predict the probability for each word and then here the activation function will be a soft max so that we can also get the probability for each predicted word all right so after this after this we are going to run this and then finally we will print the summary of this model which we can just get with model dot summary now you see here this is the summary for this model alright so they are more than 2 million parameters in this model so this is going to take us some time if you are doing this on Krishna the Google cool lab notebook that is gonna take some time the one thing you should also notice here is that the RAM is almost full that's the eleven point three series pull out of twelve point seven two GB so the anytime this notebook can get crashed and after that the Google will ask you to increase the RAM which will then automatically get increased to 25 gb all right after that what I'm gonna do here the model dot compiled and in this I'm gonna use your loss function that will be used categorical cross-entropy categorical cross entropy the optimizer is equal to atom and then matrix is equal to a crazy all right perfect so we have here model dot compiled now after this we have reached into a very interesting mode in which I'm gonna call model dot picked so this is the final space of this model building now it will train this model which is definitely gonna take a time and there we have input X and then find out why I'll put the batch size which I'm gonna use here is 256 and you can change these besides a box I'm gonna train it 400 a box the one thing before I start it let's go ahead and start this and then I'll be telling you it the first thing it is going to take a lot of the time on the google cool lab it might take around an hour to complete this training and the second thing second thing this get crass and yes it has got class okay automatically it is restarting and it might ask get more I'm okay yes and yes so now you will see here it will get 25 gb of ram and all the previous work is deleted now you see here we have got 25 gb so what we need to do here again we need to restart runtime and run all so if this is gonna take some time the one thing is your we are not going to change any code it will automatically come here and then it will start fitting the model so it might take time the first one the another one your notebook might get crashed because of the low ram then it will ask you to increase the ram okay get it increased to 25 gb and then it's gonna the accuracy is not gonna be very good because we have just considered one by four top display and the another one is yeah because i have increased bad size 256 so if we increase the best size then the accuracy might get decreased but the training time will be the faster and the number of a POC is just a hundred there so the number of hundred a POC is not the quite good in this case at least in this model so you might need to train it for at least 500 a box to get the better accuracy but the overall idea to to to make you understand how you can do the automatic text generation using carousel SD amanda deep learning you buy usually deep learning so let's go ahead and wait for some time it is taking around 1/2 minute a per a POC that smith is going to take at least 50 minutes to complete this training so why then let's go ahead and wait for wait to complete this then I will complete the meaning code in which I'll be providing a seed text and and finally I'll be doing Appalachian all right now you see here the training of this model has been completed and I have run it 400 a box and out of this hundred a box we have received 46% of accuracy that neither we can say it's quite good nor it is bad it is kind of okay-- accuracy although you have option to increase this accuracy since this model has not were fitted it so you can increase the number of a box to increase the accuracy so we have got our model now let's go ahead add a few cell here so that we can start running okay so we have our model which is already cleaned in the model now as I told you earlier that we need first 50 sequences as a seed text so that we can predict the further detect so the moreover it needs just first 50 sequence of words so that it can start predicting the next word and to do that we are going to just select here random line 1 2 3 4 4 3 let's say so this is the random line which we have I have just selected so I'm gonna put this random line as I see the text and once I put this as a seed text then these algorithm will automatically generate you know coming words and to remember one thing this is not going to replace the checked play but in fact this is the email algorithm so this is going to add predict word based on whatever it has learned and one more thing you can change the words in this line as well otherwise you can also provide your own set of words so that it can start predicting but the one thing you need to always remember that only to provide those words on which this model have learned otherwise it might fail because we have just trained it only for limited number of words so the model can get failed if it gets a word which is not listed in its tokenization dictionary alright so let's go ahead and write a function so that it can be predict the text and that is the quite large function so I'll explain you in the middle what I'm going to write here so generate text sequence so in this I'm gonna provide the year model and then I'm gonna provide here the tokenizer which I'll pass here and then I'm gonna provide here the text sequence length okay and then I'm gonna provide here seed underscore text which will be the first 51 and then I'm gonna provide here the end words I mean how many words I need to predict in future okay so here I have text is equal to an important list so I'm gonna generate I'm gonna do a for loop here so for I'm not gonna use this index so for in range the number of words so I need to iterate how many times so the N words define how many number of words I want to generate then I need to create encoded sequence for my text sequence so that is the encoded data that I can get with the tokenizer and then tokenizer dot text two sequences and in this I need to pass here CD text so I'm gonna pass seed text see the text is 51 work but I need just a 50-word so I'll truncate that later and this zero tells that okay zero time some of this array which contains the encoded text encoded actually he text and that text is encoded into number as we had seen that earlier every unique world has been assigned a particular integer number and that integers number this is going to return into encoded and since this encoded text might be larger than the 50/50 words so what we need to do here we need to truncate these sequences to the fixed length so that we can do the encoded is equal to fired sequence okay which we had earlier in the pad sequences I need to provide the sequence which I want to pad there I have encoded all right and then here I have max under is : that's the max length the max length I just need the text sequence length which each of the 50 text sequence length is 50 and then the truncating I need to provide here truncating has written getting all right so after this we need to predict the probabilities for each word that we can do why and risk or predict is equal to model dot predict classes and inside this I need to provide these encoded sequence which is encoded integer from text data and then I'm just gonna live or who's I hit it and once I have got this Y predicted that is the world now I need to get it what is the index for this particular word that I can do with let's say the predicted underscore what is equal to the null and then I need to provide here a four loop forward index in tokenizer okay so here with tokenizer dot word index then dot items so for each of these items each of these word I'm gonna see with this swipe read is matched with this world or not okay so for this if index is equal to is equal to Y underscore predict so Y predict gives the index for that particular world so if index is equal to is equal to Y and risk or product that means the predicted word is equivalent to world okay once I have got the world I need to break it here and after that what we can do we can do see it text is equal to see it text and then here plus now add this predicted word in a seed text okay and then finally text dot append okay so here text dot append and then finally the predicted word I'll tell you this let me complete this code first and then finally once it is done and then return return all the the text will join so what is happening here I think this is the predict this is y predict okay so what is happening here the y predict you understand that this will predict the that will be the the integer value and that integer value I'm going to match with the tokenizer World Index item once that integer value is matched with the index of the world tokenization so at that particular integer value the predicted world has been taken from the tokenization once we have got the predicted world we need to add that into the seed text as you remember that here earlier in the CD text we need to put down this into c detect so that the next word can be predicted and similarly this predicted word will be put into the seed text so that the next word can be predicted and then finally this the predicted word I am adding here into a new input a list text and then finally it is being joined to create a complete sentence perfect so let us go ahead and run this function and now what I'm going to do here I'm gonna call here generate text sequence in which I need to pause sorry in which I need to pass here the model and then I need to pass here the tokenizer which we already have to be an idea along with that we need to pass here sequence length the sequence length is 50 and we also need to pass here see the text okay so seed text which we have got from line 1 2 3 4 3 and then finally I'm gonna put here generate next hundred word okay so let's go ahead and run this you will see here that now it is saying that pad sequence got unexpected he argument max Elaine I think here this is wrong this should be the maximum ok so let's go ahead and run this and run this yes perfect now you see here I have generated next hundred word even you can just generate a no word for simplicity so after providing a seed text which you see here we have see the text this is the seed text after providing the seed text this has predicted these next 10 words after this seed text all right you see here on the preposterously be stained to leave for them when the word okay and here this is not very bad actually and the whatever the 6p has written this model has learned that and after learning those it is automatically generating these sequences so you see here now I have generated the next hundred words though art convincin sand we pour their natural for tune race the other fool as a monkey I cannot know the things you see here it is very good predicting into the verbs and auxiliary verbs like I cannot we've and ends you clone new a Mylar has aside the Queen so these are the words actually which is present in to these six peers and the work alright so the most of the world might not be in the dictionary nowadays like here though this could be the wrong in nowadays but these words are included into this experience why it is being predicted here alright so in this lesson you have learned how you can do automatic text regeneration using Kiera's and and LS diem I am sure it might have helped you a lot to understand the word embedding how to use the text data to encode it into the integer by luge and how you can use Garris tensorflow and LS diem thank you so much for watching this lesson and if you have any question you can come into below and please do not forget to like and subscribe this channel so that you can get updates directly into your inbox bye bye have a nice day
Info
Channel: KGP Talkie
Views: 21,941
Rating: 4.9173555 out of 5
Keywords: nlp training, natural language processing, data science, deep learning, lstm, keras, tensorflow, text generation, text generation using, text generation using lstm, text generation using rnn, text generation using keras, nlp lessons, automatic text generation, python lessons, free deep learning training, neural network, machine learning
Id: VAMKuRAh2nc
Channel Id: undefined
Length: 63min 54sec (3834 seconds)
Published: Sat Jan 04 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.