Fake News Detection using LSTM in Tensorflow and Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome to this new lesson in this lecture i'm gonna teach you how you can create a model for a fake and real new detection system using lstm and the deep learning we will be using here a various set of the libraries we will be also reading the data set from my github repository for fake news and the real news we will be also doing the analysis of the fake and the real news thereafter we will move to the data cleaning in the data cleaning part we will learn the various ways to clean the data set finally we will do a little more preprocessing of the text data so that we can feed it into a word to vector conversion system so we will be using their gen sim library which converts text data into a set of vectors thereafter we will test our vector conversion and we will try to understand how it is doing thereafter i'll explain you how you can get the weight matrix then finally i'll show you how you can design your system in just a five line how you can design your deep learning model in just a five line and then we will train it and we will get here a great accuracy which is a 99 of the accuracy for our real and fake news detection system then i'll be showing you how you can test it on a real data set there i have placed a simple data and it says that this one is the fake news but when i took real data from the internet and then it says that this one is the real news this data i took from a real publisher news publisher and it says that okay this one is actually a real news because we are getting here a huge accuracy 99 of accuracy is really a great accuracy for any system so you can bet on this system this is one of the best system to detect this real and the fake news with the text data so i have only one request to you please like this video and subscribe this channel thanks a lot let's go ahead and design this system we will be doing a line by line coding here in the google colab hi everyone welcome to this brand new lesson in this lecture i'm gonna teach you how you can do fake and real news detection using lstm in a deep learning we will be using a fake and the real news data set the data set i have uploaded on my repository so you need to visit here lakshmi made it and then inside this repository and thereafter you will see their fake real news data set this data set have a two csv file one is fake dot csv which have a fake news and another one is true dot csp which is actually a truth so these data set are prepared manually and i have taken these data set from the kaggle and then uploaded on github so that you can access it easily thereafter you need to click on this fake dot csv right click open in new tab right click open in new tab thereafter you will see there you may see this one the file is large so you cannot see a preview then you need to click on this view raw thereafter you also need to click on this view raw and thereafter it will take a while and then you will be seeing this it will display these text so this one is a comma separated file it has a title text subject and date this one is fake news and in case of the true news we also have the same thing we we have a title text subject and date all right we will be using here a tensorflow numpy pandaj and we will be also using here nltk so we will not use in-depth and ltk features but for this stopwatch and some other features we will be using sometimes we might also use this nltk we will be using also word cloud word cloud is text visualization python package it will help us to visualize total number of words and how the words are actually distributed inside the text data all right so let's go ahead and run this once you run this so let me tell you once again we are importing here numpy pandaj matplotlib c born nltk and this one is the regression uh sorry this one is actually a rejects thereafter you have a word cloud so whenever you create a new google collab file like here new notebook since we are gonna use here deep learning and lstm you also need to set here runtime all right so you have to click on change runtime type thereafter you need to select here a gpu by default it is selected null but since we are using here lstm which is a deep learning we need here a gpu for a faster training if you don't have a gpu this one is going to take a days to train so we need to get gpu all right so we are connected with our system and it has a ram 12 gig uh 12 gigabit of the ram and almost 68 gb of the hard disk it comes free when we use google collab thereafter we are going to import here tokenizer so this tokenizer we will be using to tokenize our text data and we are also importing here pad sequences so this pad sequences we will be using to pad those data set which are not long enough because it's the text data some text could be just 100 words and the 200 words and let's suppose that we are taking a constant width of the 700 words then we need to pad 100 words to extra 600 words to make it 700 because the deep learning models takes only a constant input a constant length input it's not like that if you have designed a system which takes 700 length of the input you cannot give the 500 600 or more than the 700 or less than 700 it always needs a fixed length of the input thereafter we are also inputting here a sequential model in which we will be feeding our model layers inside this sequential then these are the keras layers dense embedding layers lstm convolutional layer 1d and the max pool one day thereafter we are actually importing here a trend test split and we are also importing here classification report and accuracy score so trend test split will be used to split our data set into a training set and test set classification report and the accuracy score will be used to test our model performance how accurate our model we can do those analysis by using these two python methods which we are importing from sklearn and matrix all right now let's go ahead and get started with importing our fake data set thereafter we will import true data set so if you open a fake data set and in the url section you will get there uh github user content dot com and luxury made it that's actually raw data so from there we are going to access with the pd dot read underscore csv and thereafter i'm going to just put that link here all right and if you just run this you will see there all right so if you run this you will see here ah just one more thing i also need to run this one i did not run that one okay perfect so there you have got this data set this data set have a title text subject and the date now we are gonna read it into a new variable which is known as a fake variable so this one is for fake text data and thereafter you can simply see here a first few lines of this fake data set and the columns of this fake data set you can see by typing fake dot column so these are the four column in these column we will later on merge our title and the text together and the subject and the date date says that when this news article was published what was the subject what was the title and text data that is the actual main body so let us go ahead and see how many types of the subject for these news all right so for that you can do like this fake subject dot value counts just run it thereafter you will notice there are total 9050 news articles 6800 almost the political news and 4 500 almost left news 1500 almost government news almost 783 is u.s news and 778 is middle east news so these are the news there all right in the fake data set and at the same time if you want to visualize this in in actually a graph you can simply do like this sms dot count plot and there i say this data equal to fake all right and this one is data equal to fake and if you notice there you have x y all those things so there i'm gonna put here the part which i'm gonna pass which is actually the column for which i'm gonna plot this count plot i'm gonna use that as a subject all right so i say here i think i i say here x is equal to the subject all right so it says that here we have this one but still you see there these are overlapping to each other so for that we can simply increase this figure so we can say the plt dot figure and then i say here this field size is equal to 10 cross 6 superb so there you have news politics government news left news u.s and middle east news all right perfect now i'm going to show you how you can plot the word cloud if you remember earlier i told you that i'm importing this word cloud so that i can visualize these text data so for the text data visualization we have here a word cloud and that word cloud we can use that word cloud but before that we have to mix all the data together so for that what i'm gonna do i'm gonna take this text data and then i'm gonna mix those text data together all right perfect after this word cloud we will explore the real news so i'm gonna just put all those fake text data together this one is a single series of the text data and then i do like this to list so this will be converted into a list now this this was earlier series now it has been converted into a list all right so this one is a complete list here and this list is actually one one item of this list is one row of that series i mean this pandas series so this one we have got the list but to feed the data in our word cloud we need a single text data that's when these list we need to merge into a single text data because if we check the type of this you will say the type of this is actually list so we don't want this list so we want to make this list into a single text data so for that i'm going to use here join and if you do like this then we are going to get a complete text data without a list so all the text data will be joined together it's like this let's say uh let me just show you here how it is working so as you know the this one is actually a list so i am just going to replace it with the list and let us say in the list you have this data all right then after i say that how would you mix these data into a single sentence for that we are going to use a join and these two elements will be joined with a space and if you just run it you will see there all these words will be joined with a space but if you put here a comma all these will be joined with a comma so that's what i'm going to do here i have joined all the text data with a space all right so i have got this text data thereafter i can use our word cloud so to use the word cloud i'm gonna just say here word cloud is equal to word cloud and this word cloud takes input as a font path we don't have that one it also take the width and the height so width and the height i'm gonna just put by default i'm not gonna change anything but if you want you can change that all right so thereafter uh most of the things i'm not gonna change i'm just gonna keep everything as it is thereafter i put here this generate text so there you have got that text data thereafter i am just going to simply put here this plt dot i am so so this word cloud is generated and that is inside this variable word cloud and now we need to visualize that that we can do with i am so that's the image so with the word cloud that's a small letter all right cap small letter that's the word cloud all right and the most of the things i'm just gonna keep as it is i'm not gonna change anything that's the that's where i'm gonna use a default setting if you want to change your settings you can definitely change like you can change a background color you can change its uh width and you can change the face color that's when the text colors you can also change the edges of the word cloud which we are gonna draw here so all those things you can do since the data is quite large that is why it is taking a time it it may take around 20 to 30 seconds to plot this word cloud now you see there the word cloud is plotted but only the problem is here this word cloud is actually a quite small so to increase the width of this word cloud we are gonna feed here a figure is equal to plt dot figure and then inside this i see here this fig size and in that fixed size i put 10 and 20 all right that one is the height and the weights i put there so in fact this 10 and the 20 i'm gonna put here 10 and 10. so this one is the number of unit this is going to plot and apart from that if you remember i i told you that in the word cloud you have here width parameter and in the width parameter i'm going to give here 1920. and you have another parameter which is height in the height i'm gonna put here one zero eight zero and thereafter you see there these grids are also coming these axis are also coming i need to put those grid ah i need to switch off these grid actually so you can simply do plt dot axis and thereafter i put here off this mean don't plot any axis then plt dot i say here this tight layout and in that tight layout i put a default padding is zero that's when there won't be any padding there and then finally to to to hide this text data i'm gonna use here plt dot so just run it all right this one is awesome now you see there a a word cloud plot is given here and the text data which have a larger font size that's windows text data have frequently occurred like donald trump said say now one even democrat american republican featured image obama country think mini united states president obama twitter so all those words have actually happened so frequently that's why these words are coming here so that's how you can visualize and this comes in actually a fake news and similarly we can explore our real news data set so to explore our real news data set i'm gonna just start that from here all right so in the real news data set i am gonna just read it as a real is equal to the pd dot read underscore csv and then you have here if you remember i said to you there is two file inside this data set one is the fake another one is the true i am gonna get this true file all right so i'm gonna just get this url and then i'm gonna paste it here there you notice true dot csv this file i'm gonna read here so i got here this real news in the real news i'm gonna once again put all the text data together i as i did here so i'm just gonna copy this whole thing from there and then i'm gonna put that thing here so instead of fake i'm gonna put here real because the rest of the things are same in the real as well so i'm not going to do those analysis to save the just time but definitely you can do to understand this data set so real text to list so it's going to produce a text data set and thereafter i'm just gonna copy this whole thing from here and then i'm gonna put this thing here and then i just run it it's gonna take 20 30 seconds to complete it and then it will show a similar word cloud as we have there all right once again in the real news you will also see here the most of the words are almost same like donald trump president donald united states barack obama etc but one main thing if you see carefully in these two one main thing you will notice here one white house white house is not so many times mentioned here and you will also get here north korea there clearly new york but the main part which is hiding inside this data that is the washington reuters so this washington route is actually the source of the news all right and that washington's reuters are not mentioned in our fake news so you see the difference here whenever there is a fake news of course there is not going to be any difference of that particular news but if it is real news then it has a reference to the publication the publication is here washington reuters there all right so you have got this washington there this one is the source so so this one is the real news this one could be also one indicator that whether a particular news is real or this news is actually a fake so that's what we are gonna do in this lecture series we will be using here lstm and the deep learning models to evaluate based on the text data of the news whether that particular news is a fake news or a real news all right perfect so the differences as i have mentioned there let me just delete these things so the differences are there if we evaluate this whole data set real news and the fake news these are the few differences which we will get there one is uh realtor's information washington present in the real news some texts are tweets from the twitter and few tech data do not contains in the publication information in the real news so we also need to handle those parts as well so let's go ahead and get started with the cleaning of the data set so the first of all we are gonna you actually remove routers information from the text data and if you notice this router's information actually comes in our real text data which is actually real dot sample pipe so if i say the sample five it's gonna show us a random five approach there you have got the router information so this one is router information here all right washington and addis ababa all these information is there so these information we are going to actually get from the text data all right so the first of all we are going to create a list of indexes which do not have a publication information so how we are going to do we are going to actually break this text with this hyphen a first hyphen occurs in a text data thereafter if rooter information is not there then we won't get this information all right so what i'm going to do here i'm going to just write here one a one method to evaluate this so i say here unknown publishers is equal to something like this for index row in enumerate inside that i have here real dot text data and then i say here dot values all right so there i get this real text data dot values and if you see this one you will get there an array if i put the real text data dot values all right so this enumerate says that it's it's going to actually iterate over this text data and then it will be returning index and the row in this text data all right thereafter i am going to put here a try why this try because if there is no hyphen inside this text data this one is going to produce an exception so if it produces an exception we say that there is no hyphen that's when there is no if there is no hyphen then there is no text data or this router's information is present there all right so for that i'm going to say here this try record is equal to rho dot split and that is split i'm gonna use here this hyphen all right and if you notice this hyphen there is also a space if this particular hyphen is there space is also there so i do something like this thereafter i put here a limit like a max split the limit of max split is 1 that's when there is going to be only one split in this thereafter as i told you that if there is no hyphen then it's going to produce an error all right and if there is uh if there is a hyphen then we are going to check whether this particular news comes from the tweet or not since uh you know the twitch data contains less than 260 word uh so what i'm gonna do here 260 actually collectors so i'm gonna assert here that's mean it's it's going to let us know that what is the length of those uh if text data have a length of less than 260 then it's going to say that okay these have these news came from the twitter so it's like this record all right so here you have a record 0 and then i say it as less than 260 characters then it's going to produce an error there all right so this one is going to produce an error that's mean the exception so it will come here inside this except all right so inside this accept either this one will produce error or this one will produce error it will ultimately come here and if it come here we know that either it is the tweet data or there is no hyphen that's been it has unknown publisher there so i see here unknown publisher and then i put that index inside this unknown publisher unknown publishes dot append and inside that i say here this index just run it so once we run it it says that this enum rate is not there let me see what is here yeah i think this r extra is coming there all right so we have got the unknown publishers let us go ahead and see how many unknown publishers are there if you just run it you will get these length of the indices and if you put here length then with this particular length you will notice there these are the total 31 unknown publishers so these 31 unknown publishers that's mean these 31 rows of the text data present there all right perfect so thereafter once you have got this let us go ahead and see the text data of these unknown publishers for that i'm going to put here real dot i lock and in that i put here these unknown publishers all right just run it so there you will get these are the text data and to actually get one text i mean one column data i put here the dot text so these are the text data which are actually unknown publisher but one thing you notice here there is no guarantee that there would be actually this space and the hyphen there there is no guarantee that there is always one space before and after with this hyphen so just to make sure that it is perfectly correct i'm just gonna see that if i don't put that then i get here total 16 but if i put here one space i get here total uh 22 there all right 22 error actually there all right so moreover i put it like this so there are total it says unknown publishers are there total 31 these are the unknown publishers text data since we have the publisher information now we are going to extract this publisher information from our text data so we already have the publisher index and for that i'm going to put here like this publisher equal to empty list there so i mean to say that i have already unknown publisher indexes and i also have the publishers all right so i'm gonna create a new column in which i'm gonna put the publisher information and since for a few rows i do not have the publisher information so those things also i need to handle here so i say like this there i have a temp underscore text data which is actually for unknown publishers thereafter for index and row in enumerate data and that one is actually real dot text dot values all right so this one i'm actually enumerating that's mean i'm iterating it so if this index in you know uh we have unknown publishers index which is inside here thereafter i say here this temp text so there in in the temp text if there is unknown publisher i'm gonna just put that text into a temp text list there so this one is going to be the append there row row means that particular text data thereafter in the publisher information in that case i am going to put here the publisher information is actually unknown here all right so then i'm gonna just continue it thereafter if this one is not inside this publisher information then what else we can do there instead of the continue we can put here the else means if index is inside publisher information then do this if it is not inside the publisher information then you do this one all right i mean if it is not a known publisher then do this one so in this case i am going to say here this record is equal to row dot split if you remember earlier we did it with the space hyphen and the space and then i say here this max split is equal to 1 so we have got here the record and then i say here this publisher dot append so the first first index in if if we actually if we split it then the first element of the text data would be the publisher itself like this and the second element of the text of this is split based on this uh hyphen will be the text data so that's what i'm gonna put here so the first one publisher dot append which is actually the record 0 that's when the 0th index data is actually a publisher here then i have here this tmp underscore text dot append then i have here this record one all right let's go ahead and just run it so it says that list is actually out of list is actually out of range the reason is here we are not actually capturing here the data correctly here i think i have here this correct data perhaps i have list 31 that is okay there it says that for a few this one information is not available but if i do like this okay just a second let me just debug this code and then i'll get back to you well a small mistake in the code can actually harm a lot so i did a debug here and i got the error so you don't need to change anything there is only one place where we are gonna do the changes otherwise uh rest of the things are going to be the same you don't need to do any of these things all right i had added these things while doing the debug but these things you don't need to do you need to just come back here here all right there we did the split and there you need to just put here this record one so you might ask that we are not printing it anything we are not doing so why we are doing this the reason is let's say if any of these texts if any of these rows also do not have ah also do not have the text data which will be actually this one is the text data this one is the publisher information so if we try to print right if you if we try to print this the second part then it is not going to print anything so because it will actually throw an error so with that error it will be actually saved as unknown publisher so that's what we are going to do here and there after you will see there these length will be changed from 31 to 35 all right thereafter you will notice one more thing that this 8970 rows row number 8970 is also empty so we need to remove this particular row as well so for that we have to actually drop that particular row all right so to drop that particular row but before that we are going to just see that the row name there so that's the real dot i lock and that one is the 8 9 7 0 so this row we actually want to drop because there is no text data inside this particular row so we need to drop this particular row so to draw so to drop this particular row we are going to use here something like this real sorry it's real dot drop only this particular number of row which is 8 9 7 0 and i want to drop it from axis equal to 0 that's mean i want to drop it from the row thereafter i read this inside here this real equal to real all right perfect so now the rest of the things are same you don't need to do anything just run it and it seems still we are getting the error wait a second all right so once again i did a little debug and i'm sorry uh this is actually happening uh most of the time so the finally uh we have got the error where this error was coming so earlier it was actually this uh with a space so i have removed that particular space and then i put it something like this thereafter i also i have also reduced this particular number earlier it was 260 but now i put it 120 all right so you need to come back here you need to do just two changes here just remove those space from here and then this 120 thereafter you will see here unknown publishers will be 221 there all right and these are the text data for the unknown publishers and this one we have already removed so we don't need to run this once again and rest of the things here you need to come back here and just remove these things there and then finally you need to put here strip all right so just put there a strip and then you need to put here this particular strip all right perfect so just run it now once you run it there won't be any error so it has successfully run after uh trying so many times i'm really sorry for that so thereafter i'm gonna create here a new column for which i see here like this publisher real is equal to publisher and then i say here this real publisher is equal to the publisher and then i say here this real text all right equal to temp text data all right something like this then just we are gonna run that's mean we have replaced the text data from there to this temp text data and there with these publisher data all right so thereafter i'm going to just read here real dot head data so it's something like this and i'm gonna also get here the shape of this data so with the shape we get total 21 416 rows from here all right so this tmp text actually comes from here it has two things if you notice it has the text data and this publisher information and this one also have the text data and the publisher information all right so these things we have got here and these publisher information we have also separated there are many rows which have uh which have actually different uh which have actually unknown publisher as well so these text data now we have separated now let's go ahead and check if a fake news have empty text data as uh earlier this real news had that empty text data so for that what i'm going to do here i'm going to do here fake dot text all right thereafter i see here this to list so it's going to convert into a list thereafter i'm gonna do here a comprehensive list analysis so i say here like this for text in here and then i check if this particular text is equal to if this particular text equal to empty or not so for that i'm going to do one more thing here ah i put here this index so that i also want to know the index so for that i say enumerate so i'm going to just do the enumeration on this and then i check here if str of this particular text data and then i'll be also doing here a strip so that if there is any white spaces those white spaces will be gone if it is like this that's when if it is empty then we are gonna put here this index all right so it's gonna give us the index list of actually index how many how many rows have empty data all right so it has quite long list so for that i'm just gonna put here this empty fake index something like this then i say here now let's go ahead and get those text data so for that fake dot i lock and then i say here this empty fake index just run it then you see these index in the actually fake news these text data are also empty so we need to remove these text data from the data set so that we can pre we can feed this data into our deep learning model so the text of these rows seems to present in the title itself so we are going to merge this title and the text data together so for that i'm gonna just pre-process it something like something like this let's say you have here real text data plus there i'm going to add here one space plus real dot title so instead of putting title here actually i'm gonna put title first and later on text so it will be something like this so the title and the text is added together and then i'm gonna put it here inside our real text column all right perfect and similarly we do with the fake as well so there you have a fake you have also a fake news all right so that's how we are doing here real and the fake these text data and the titles are added together after adding these together now we are going to do the pre-processing of these text data but before that i would like to actually convert these text data into a smaller text data currently there are many text data is in the capital letters so i'm gonna do that with the smaller letter all right perfect so for that i'm gonna do like this real text data dot apply then here i say lambda x is tr of x and then i do here as a lower all right so it's gonna do everything in the lower case and the similarly i'm gonna i'm also gonna do for the fake as well so it's like this fake news and then i'm going to put this into a text and there i'm also going to put it into a text data all right perfect so this have become for the real and the fake news that's mean this lower case let's go ahead and just run it so we have converted into lower case let's go ahead and do the pre-processing of the text data so as you know that there is no level for supervised learning in our real text data and the fake text data so we have to make sure that we have these things there so in the real i'm gonna add here in the real i'm gonna add here class equal to one and similarly in the peak i'm also gonna add this particular class equal to zero so this class has been added into a real and peak now this is time to combine these columns together so we only actually need our text column and we also only need our class column so let's go ahead and see the total number of columns present in the cl so in real we have title text subject etc etc we don't need we only need text and class so for that i say here this real text and class all right so that i'm gonna put here like this real equal to like this and same thing i'm gonna do for the fake as well so there you have got fake database so it has text and the class all right data frame was actually it was i said database i'm sorry thereafter i'm gonna append these together so for that i say real dot append then i say here a fake and then finally i ignore the indexes while appending so ignore indexes which is actually the true now if you run it you get it something like this so in which real and the peg are added together so i'm gonna just put it here into a data is equal to the something like this so if you see here some random rows of this data set which you can get something like this this one is the fake news this one is the real news these are also a fake news all right perfect so in the next stage of the pre-processing in which we are going to remove special characters and all these things like dot and these special characters we are going to use kgptoki preprocess package for that you need to come at my github repository luxury merit pre-process kgp talking so if you open this luxury made it pre-process kgp talking let me just put here the address for that one sorry i'm sorry let me just copy it once again all right there this pre-process kgb talky then you need to just scroll it here you will notice these are the dependencies for this package so we also need to install this and then the final package you can just install it from here copy it from here and then put it here so we are just going to use this exclamation mark so that we can install it here in our google cool lab all right so it's going to actually install these dependencies thereafter it will successfully install this kgp talkie preprocess package so once these pre-process package is installed then we are going to use here a method in this pre-process package which is actually here remove special characters all right so we are going to just import this pre-process package here all right so we are going to do like this pre-process package as a ps and then we are gonna just run it and then finally we are gonna use this one all right so for that i'm gonna use it like this data text lambda x and then ps dot remove special character x so this ps actually if you notice here how it's going to work let's say if you put like this this is great if you put it something like this and if you run it you will notice all the special characters will be removed you see there all these special characters are actually removed from this data so that's what we are going to do by using this pre-process package all right thereafter i put it here so data text is equal to like that so all those pre-processings are done with the pre-process package of the kgp talking all right perfect super so now we have reached on our vectorization topic so if you if you notice here the currently we have this particular data the data is in the form of actually a text data so this text data we have to convert it into a numerical data so to convert this text data into a numerical data we have vectorization technique so the technique of that vectorization is known as a word to vector technique so if we use this word two vector technique of the vectorization if you feed any word this word will be converted into a sequence of vector so that's what we are going to do this sequence of vector is actually known as the word to vector conversion technique and you can also choose how how many columns you need in a particular in a particular vectorization method so you can choose a fourth column or you can say a four dimension you can choose here at 300 dimension or how many dimensions you want you can do it like that all right so this word two vector is a recent technique it was introduced in google in 2013 all right now let's go ahead and do our word two vector conversion so for that we are going to use here library which is jen sim library all right so you just use this gen sim library and thereafter we are going to read our y inside this data there this one is the class dot values so this one is going to give the array of power y which is actually a class or you can say the target input all right that's been the target thereafter we are also going to get our the text data in the form of the list there in the form of the list of the word actually that is how this takes that particular words there so then i do here this text and then to list so you will notice here all the text data will be converted into a sequence of list and there after all this sequence of list we also need to convert these words into actual list so list of list we need to create here for that i i do like this for d in this so then i'm gonna use here list comprehension method in which we do a for loop so in this list comprehension method there i do d dot split so this d dot split will convert all the text data into a list of list so this might take a while to process but in mean time i am going to assign it as x is equal to the x now you notice there now everything is converted into list of list all right so this one is list of list this x now need to be fed into our gen c model word two vector conversion so if you check here this type of x you will notice this one is actually a list and if when if you notice here the type of the first element of that is also a list so for that what you can do like this if you want to see the first list data so it's like this and you can also use here print to get everything here all right so this there we have got the list of the list data then we are going to say here the dimension in which i say here dim i i'm going to use here 100 dimension of the data that's mean each of these word will be converted into a sequence of 100 vectors all right these words will be converted into a sequence of 100 vectors thereafter i say here word to wake model is equal to zensim all right there i have jensen dot models then i have here this word to vector and in that i feed here these sentences as x that's the list of the list and the size which i'm gonna put here put here that is the dimension and this window window says that how it is going to how it is going to provide us the vector so i put here a 5 so 5 says that it is going to take these 5 in a sequence uh and if these words come together then these these are going to say that ok these comes together all right these words comes together instead of 5 i'm going to put here 10 all right and then i say here this minimum count this minimum count is equal to the one all right that's mean even if there is only one word it is going to do it is going to generate the vector there for that just run it so the vector actually converges and takes the time so we are going to just wait and once it completes these conversion time then we will start this lecture all right so the vector conversion is done here once these vector conversion is done now we can check how many words are there in these vocabulary so for that i use here length word to vector model there dot word to vector thereafter i say dot cap all right so there you will get this complete length there so it says that total we have 231 000 vectors there those are actually unique words for which these vectors are created then after just run it thereafter you will see there these are actually word vectors vocabulary all right perfect now let's go ahead and explore these vectors a little more and in that i'm gonna get here the vector for let's say uh i'm gonna get here the vector for a particular word this one is the model and i say particular word like love all right so this one we are actually getting here something like this all right perfect that there is a warning i think we need to also use here this word to vector sorry all right so there is this array here for the love love and the similarly we can do it for the others like let's say you want to do it for the china you can do it for the usa you can do it for the india there so these are the actually vectors which we have converted here currently similarly you can also find out here the most similar words there all right so you can simply type here this most similar most similar to this word it will give us the list of the word which have a most similar to these words all right this one is actually a method in which i need to pass this parameter so it says that similar to this india it has these countries like pakistan malaysia so when i say the most similar to india then it says that with the india the mention of these countries are coming quite often like pakistan malaysia china ghana norway modi indian australia narendra and tunisia all right and the similarly if i say here this china so the for the china you get these words coming mostly like beijing taiwan china chinese pyongyang and beijing's all those things and similarly similarly if you use here us so in the term of the u.s you will get these american iranian nato iran nato's all right so these things are coming these are these are quite often coming so this says that how how these things are correlated with each other all right so if if modi names come the narendra modi is india abe and gujrat all those things are coming there all right perfect so if i say here this gandhi so gandhi you will get these name like something this similarly if trump name is coming there so you will get these names all right that's how you get the most similar words if you use here gen sim all right perfect all right now our text data is converted into a set of vector now here we have two method to train our machine learning model one method we can directly use these vectors and another method we can feed these vectors as an initial weight in our machine learning model and then machine learning models recreate these vectors and i find out that if we feed these vectors as initial weight in our machine learning model and if we recreate these models again then then we will get the better accuracy so for that what we are going to do we are going to create here once again this tokenizer is equal to tokenizer which we have already imported from the keras thereafter we get here this tokenizer all right dot fit on the text data that is the x is there so there tokenizer the organization is done then i get here our x vector like this now one thing you will notice here this x is actually currently a text data after this tokenization the text data is converted into set of sequence not like a vector here but set of sequence like one two three four five all right then i'm gonna do here x is equal to tokenizer dot text to sequences and then i pass here this x there all right so once i pass this x there this if you if you get this x you will notice that this is converted into a set of sequence like 1 2 3 4 5 not like this vector all right so we have got this sequence which is a list of the sequence there thereafter let's say if you want it to get it for a particular word then how would you get this sequence so what is the meaning of this sequence if there is one two three four five is coming that's when for a particular word a sequence number is assigned and to get that particular sequence you can just use this tokenizer again and then you can use here this word index all right so inside this word index and if you run it you will notice here these are the word index all right so there let me just bring it here so one two three four five like the two a of a so all these are the sequence whenever the will come it will use one whenever two comes it used to so this has just converted each sequence all right just it has converted this sequence there and just let's go ahead and comment this one so we have got our the tokenizer and we have also created our vector and set of sequences now we are going to analyze our text data which we currently have in the x there so what we are going to do here plt dot hist we are going to actually put here a histogram so there instead of that let me first show you for x in capital x then we are going to actually calculate here the length of this x all right so why this length of x is needed here i want to know here how many words are present in each row so this x is actually sequence of words so if we take the length of that list it will give us something like this so i'm just gonna feed this one inside this plt dot hist all right thereafter i say here the bins is equal to 700 and then i say here this plt dot so so this is going to show us a histogram of total number of words present in our news and it says that mostly the words are less than thousand all right you see there the words are less than thousand so we are just gonna we are just gonna clip or you can see we are just gonna truncate all those words which have more than 1000 words in a particular more than 1000 words in a particular news text data all right perfect so let's go ahead and see how many numbers are there how many numbers are there so for that i'm gonna just put here okay i'm just gonna create here np array so for that i'm just i'm gonna get it like that and then i say here nos is equal to np dot array thereafter you know yes inves is greater than thousand so it says that these are the greater than thousand and we can get the length of words which have i mean to say that length of news which have more than thousand words so they are total 1584 news which have more than thousand words all right perfect so what we are going to do here we are going to actually truncate it at the thousand so i'm going to put here this max len is equal to 700 then i put here x is equal to pad sequences and then i put here x and then max len equal to max len which is actually which is actually instead of 700 we have discussed that we are gonna put it thousand just to run it so we have got here instead of small x i'm just gonna put it here a capital x so what we are ah what we have got it here we have got every sequence at the length of thousand all right so when a sequence is greater than one thousand that is truncated when the sequence is less than thousand then the zero is added that's mean that is the padded in that particular so if you check now the length of any index of the x you will always get that thousand all right you can randomly check any length that will be always one thousand all right one more thing i just uh wanted to tell you that if you notice here this word index this word index starts from the actually a zero sorry fro it starts from the one ah this starts from the one all right let me just bring it here all right so you see there this index starts from 1 so to make sure that it works correctly what i'm gonna do here i'm gonna actually put our total vocabulary size plus one so plus one we are gonna put because there are many words which might not come in these tokens so for those those machine learning model will consider that as unknown word and for that unknown word it will create another sequence so i say here this cap size that's the vocabulary size length of this tokenizer dot word index and then finally plus 1 all right so plus 1 we are adding for the unknown words now let's go ahead and finally get the vectors now we are going to get the weight vector i told you earlier that we are going to feed these vectors as the initial input vector or you can say the initial weight in our machine learning model and then machine learning model will retrain these weight to find out the maximum accuracy so for that i am going to create here a method like def get weight all right get weight matrix and in this weight matrix i'm gonna put here this model and this vocab all right thereafter i'm gonna put here okay website which we already know there all right so this cab size we already know it all right so we actually don't need to put this uh cap size all right so this cap size we already know there we don't need to use actually this part all right so their model and then weight matrix is like going to be something like this weight matrix equal to np dot zeroes and then i'm gonna create here zeros with the size of the ocab and dimension we already have which is 100 so what cap size whatever okay size is there and then dimension is hundred so this one is going to be the weight matrix for the words thereafter i say like this for word and i i is index word is word in our vocabulary so there i say here this vocab all right so so so this one is actually going to be okay i have here that i'm going to just a second i'm just going to put it here okay i will apply like like this so in ocala dot items that's when it's gonna get here the words item by item and then these weight matrix this zeros numpy array of the zeros which i have created i am just gonna put that here so the model and then that particular word there all right and if you remember earlier we used model all right so this one is this one and then this one all right so here word vector that's the w and v so the model dot this sorry this one is like this like this something all right so that's how it is going to do once it is done then it's going to return this particular weight matrix here sorry so it's going to return this weight matrix from here then i then i'm gonna call this get weight matrix like this and in that i pass here word to vector model and thereafter i'm gonna just read it into the embedding embedding vectors so this says that this cannot interpret a hundred just a second let me get it here all right np dot zero is there okay up size all right so it says that this hundred dimension perhaps i have made a mistake somewhere in the selection of the dimension dim is 100 there let me see if dim is modified somewhere all right so it doesn't say anything cannot interpret hundred as a data type just a second let me even if i put it here 100 then let me see if it works okay it says that all right the problem is actually not like that i'm sorry so this has to be actually a tuple here all right this has to be a tuple there so now everything is okay embedding vectors are there it is being calculated then you have here this embedding vectors and uh we are gonna put it as a shape so in this you have got embedding vector something like this these number of rows are there and each of these row have the size of vector which is 100 there all right so everything is here on the place now only we need to create our model so we are going to create our machine learning model it's so quite simple model is equal to sequential and thereafter i say model dot add then i say here this embedding all right embedding okay have size all right embedding okay have size and then i have here this output dimension is equal to dimension thereafter i have here these weights all right so this weights is going to be the weights which we have just created embedding vectors thereafter we are going to put here this input length is equal to max length which is 1000 and then these weights which actually we created from word to vector we are gonna put here these trainable is equal to false so earlier i told you if you want these vectors to be trainable then put here a true since i have put here pow pulse then this is not going to retrain this vector earlier i told you the model is going to retrain these vectors but since we are putting here a false this is not going to retrain but if you want ah you can put it here true see the machine learning in machine learning algorithm there is no there is no fundamental rule to define these things you can try and test whichever works best for you you just choose that one all right now let's go ahead and use here lstm model so i use here something like this model dot add lstm layer and in this lstm layers i'm going to use here total number of units is equal to 128 units and then i say here this model dot add then finally i'm going to add here the dense layer one and then the activation is equal to i'm going to use here a sigmoid since we have only two classes zero and the one so we can use here a sigmoid otherwise we had to use here a soft max model dot let's go ahead and compile this model so we are going to just compiling this model and the optimizer we are going to use here this adam optimizer and this loss function we are gonna use here binary cross entropy and then finally we are going to use here matrix is equal to accuracy all right which is accuracy so by default matrix is already accuracy so we also might not need to put this one but anyway so we have got this one it says that weights or these keywords are not understood let me see here what is happening okay there it it's wrong there all right so this also not understood this optimizer as well all right perfect so the model have successfully compiled and you can also see the summary of this model sorry i'm sorry i'm sorry i'm sorry summary of this model something like this it says that it has lstm and the dense model and one embedding layer which is actually passing the weights which is already pre-trained weights which we have took from our uh from our gen c model all right so we have everything is on the place now we need to just train and the test is split our model our data set actually train test here we have a train test split then we have here x and y then i have here this model dot pit x underscore train y train thereafter we have validation splits equal to 0.3 then we have here a box a box equal to 6 all right so we are gonna actually train it for six epoch and then finally we will see how much accuracy we are actually getting here so just run it it's gonna take some time we are gonna just wait for it all right the training has completed and we have got almost 99 percent of accuracy in detection of the fake news and the real news this is a huge accuracy level there you get 99.33 in the training and almost 99 in validation accuracy so the model is not over fitting and we don't need to look for the overfitting techniques but you can still try to change some parameters and see if you can increase to almost 100 percent but still this 99 is a huge accuracy all right superb now let's go ahead and try to see how much accuracy we are getting with our the test data set all right so on the test data set to check the accuracy we are going to do like this y print equal to model dot predict x underscore test and thereafter since this model dot predict x test is going to give us the probability so i'm gonna see here if probability is equal to or greater than the 0.5 then we are gonna convert it into one otherwise we are going to convert it into a zero all right so i say it like this as type int all right perfect so we have got here y print let's go ahead and create our accuracy score so in the accuracy score we have y test which is a true value and then y print which is the predicted value here let's go ahead and just run it all right so once we run it we still get here a huge accuracy which is almost 98.90 percent which is almost 99 all right so as you know we have already taken here some of these uh um some of these parameters a default parameters like test size so default test size is here 0.25 and other things we have just taken their default parameters all right so this is a huge accuracy congratulations let's go ahead and print here this the classification report so i'm gonna just put here this classification report and there i'm gonna put here white test and then y print all right superb so in the y test and the y parade we get here almost the precision is 99 recall is 99 f1 score is the 99 everything is here perfect we have got here a perfect model for our fake news detection all right now you may ask that let's say when you have a text data then how you can perform this classification on on a custom text data let's say you have something like this this is the news so let's say if you want to detect your model on this particular data then how would you do because if you use that here you will get the error all right so if you use it here you will sorry not that i just wanted to put that here let me just stop it if you use it like this you will get the error because this is not going to work on that text data directly to understand it you need to see here x test understanding so you see there this x test is actually a sequence of words all right so the moreover you have to convert this data into a sequence of words so if you remember if you remember we used their tokenizer all right so here i have a tokenizer and then i have here this transform all right so i do here tokenizer text to matrix and there i put here x data so this one is going to generate this matrix here and if you check the length of this tokenizer you will get sorry length of tokenizer this is a huge there all right so this particular text got converted into a sequence of the text but that's not going to work like this for this you need to come here where we actually converted our tokenization so we did our tokenization here all right so we did our token assistant here text to sequences all right thereafter we did this padding as well then we got our the data that's how we are going to do here as well so instead of doing this we have to actually do text to sequences all right instead of that i'm gonna do here this tokenizer dot text to sequences and in that i'm gonna pass this x so now it is converted into a sequences then let's say this is going to be uh i'm just gonna copy it and paste it here so that i can get everything here then i put it here in a small x all right so we have x here thereafter we have to use here a padding all right so i say here pad sequences this x with the max lane all right so maxlin is equal to max len if you remember we used this max lane to ah i think it was how much it was 1000 perhaps all right so it's gonna conv it's gonna create this so it did padding and thereafter it has added the g to there all right so that's how the padding is done and then finally i put this x is like this so once you have got your x which is similar to this x test then we can perform our model predict on this so we have got something like this and once you got that then we can use actually this one all right so we use actually this one there we have got x here so it says that zero this one says that this one is actually a fake news that's mean this is not a real news this is a news this is not a real news this one is a fake news so the moreover you have to also uh you you have to put here large data set to understand that with their particular news is your fake news or what kind of the news so this is how you can do but let's say you have a set of sequences and you want to read or you have a text file you need to just read the text file and you feed the data here and then you do the prediction so for the time being i'm just going to put here the copied news all right i'm just going to put here uh the kobe news and for that kobe news we will see how this is going to happen all right so there you have a kobe news this one is real kobe news and this covered news i'm gonna just put it there all right this is that cannot open no worries let me see if i can see this one see the text data has to be large since the text laser take text data we are not getting large data set so let me see if we can do here all right so i have just copied that one and then i'm gonna put that here something like this once that happen then i'm gonna put this all right so this says that this one is real news because as you you you had saw that i have copied it from a real news publisher so it says that this one is the real news we have got a perfect perfect fake and the real news uh detection system here congratulations and i have just one request please do like and subscribe this channel thanks a lot bye bye take care
Info
Channel: KGP Talkie
Views: 10,164
Rating: 4.9589043 out of 5
Keywords: kgp talkie, kgp talkie videos, machine learning tutorials, kgp talkie data science, kgp talkie ml, kgp talkie for machine learning, fake news detection, fake news in python, fake news detection with deep learning, fake news with nlp, use lstm in fake news detection, fake news detection using python and lstm, accurate fake news detection, fake news detection using, fake news detection using machine learning, fake news detection using deep learning, fake news detection using nlp
Id: eLjs52-gsJQ
Channel Id: undefined
Length: 90min 38sec (5438 seconds)
Published: Sun May 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.