NLP Tutorial 8 - Sentiment Classification using SpaCy for IMDB and Amazon Review Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys welcome back to you new listen this is Lakshmi contact Avari so let me show you to first off all the final result of this lesson in the final result of this lessons we will be predicting a sentiment of a sentence for example let's say if I say wow this is amazing listen and if I run this it says that one that's been the C+ sentence I mean positive sentiment in this odd sentence but if I say wow this sucks now if I run this you will see here a zero result that's mean this is a negative sentiment but if I say here CL if not predict worth of watching it please like it it says that the one in meantime I can test one more let's say the white dot predict just I see loved it amazing now you see here this is the one okay perfect so this is how this is the result which we will be getting in this lesson but let's go ahead and start from here so the first I will explain you what is the NLP and how we can do the text the data cleaning with the especi and s killer and then what is the bag of word and then TF ID then we will start this step by step and line by line coding and then I'll show you how stop words and these things are how you can do the limitations part of speech and the dependence in tactics dependency entity detections etcetera and then we can start text classification on these data set imagine IMDB and L play world revealed data system alright after that I'll be doing a lot of the data exploratory analysis then I will do the tokenization then the text data cleanings then finally vectorization of feature engineering then finally I will be doing twisting and the training training and testing of the model and then finally this is what I had shown you earlier in starting on this lesson alright if you have not watched my previous video you can go ahead and watch those videos at my channel kgp talky all right so it the kcp toki i have met so many videos on the NLP and tensorflow 2.0 and and and the classical machine learning and on the Python normal Python lessons and the far from that the feature selections then in the random for a decision tree and all those things there are so many lessons in this listen you can watch these lessons and you can download the working files from the video descriptions and you can test those by yourself alright so without wasting time let's go ahead and start this brand-new lesson alright let's go ahead and get started with the NLP tutorial in it amazing and I am divi divi of sentiment classification using speci the first question comes in your mind what is NLP but before that I should tell you if you do not know anything about the NLP I would prefer I I would recommend you to go ahead and watch a basics of NLP lessons at my channel kg pataki there I had started an LP tutorial 1 to n alpha tutorial 7 and this will be the NLP tutorial number 8 alright so in these tutorials I have talked I have talked about the basics of rain will be so this would be a little bit advanced but I will be covering all the things together in this lesson although so the natural language processing is the field of artificial intelligence concerned with the processing and the understanding of a human language recomputed only understand the numeric values or numeric data but the computer do not understand the humans so in the NLP what we do we convert the text data or wise data into a numerical form and then we train our models and make the computer learn that model our that data and perform accordingly as we instruct a computer alright so there are many applications of the NLP like a text classification which we are going to do in this lesson is spam filtering wise takes messaging sentiment analysis a spell are a grammar check like like grammerly you might have heard about a product grammarly and chad what now almost every companies are using their the first line of the customer support as a chat bot then search suggestions automatic search suggestions automatic search Corrections automatic Raviv analysis system which can found out the negative and the positive review and let the customer care take action accordingly and then the machine translation like a Google translation and so much more all right so there are a lot of the application of natural language processing our NLP around the 50 percents of the data are into a text form and you know the text data are mostly the unstructured data that is why it is a little difficult to work with the natural language processing so if you do not have necessary libraries you need to install those libraries like people install scikit-learn if you are using anaconda soap scikit-learn comes automatically and then you need a speci pip install speci and then you need a Python speci download English version or Tia Rd library or a package speci and then here you can also download Python speci download Ian could be a same there's been a small model there are three type of model available in space a small model which is the basic model which we can use in this lesson that would not affect the performance there is another model medium and the large model we use medium and the large model when we use a vector a word to vector model in this lesson we are going to use bag of words model but in coming lessons we will be using vector word two vector models we will be needing a larger a larger data set which we can say Ian code wave MD that is the medium are Ian core wave energy that is for large alright so how we will be cleaning our text data there are so many things which we can do in especi like case normalization we can make a capital R we can make everything in lower case we can remove a stop words stop words means a frequently occurring words like and hence so that okay he hi hello so these are the Europe they stop words which occurs most of the time and then the removing punctuation are a special symbols okay so these also comes very very frequently and that and does not include much information in the text classification then we can do the limit adjacent RS dimming a limitation are stemming means the converting the text into its root form or are stripping a text into shorter form alright so like let's say if we have two word continuously and the continuous alright so continuously in the continuous from continuously ly will be the cut down and continuously will become a continuous only because these two words are conveying same information and if we if we if we strip down the view if we strip down few few characters and just convert few characters in last that is known as a steaming but if we convert if we if we bring down any any word to its root form that is the limit ization and that root form is known as a lemma for example let's say you have a run runs and running so the run runs running have a root for off run all right then we have here part of a speech tagging in a part of speech tagging we can understand that how this how the what is the part of his speech of each word and then with the Intertie detections and and and dependency syntactic dependency analysis we can find out how the words are correlated to each other and then with the entity detection we can understand is there any organization geographic locations are the person etc are mentioned in a text and then finally we will be using bag of words and TF ID if formula to to to to do text classification that is the amazin sand IMDB review sentiment classification so let's go ahead and understand what is a bag of words the this is the simplest word embedding techniques you can say this is the simplest word embedding techniques although this is not very much used in industry but as a learner as a beginner you should love what is a bag of words and you should be able to make model and train your model and test your data at least with a bag of word at least you should have that capability and how it works bag of words I have already explained in my previous videos but in this I'm also going to explain it very shortly let's say you have these three dogs knock on door - in the dog 3 in this bag of words what it does it make a column of unique words and then it counts in each dog how many times that word has occurred alright so in the let us say how many unique words are there I am hi and then I am hi alright so I am hi this is there so s is only being added from dog - and then I M is already present in doc 1 and dog - then the kidding is only being added so these are the just five words are unique and then we make a frequency total how many times these words occur in each of the document so this is how a bag of world works and how this bag of words and tf-idf work together you know if we see here if any words occurring most of the time then the frequency will be the high and in that case that will contribute more and if we use a straight bag of word then that is the tree that is the disadvantage because let's say if if something is happening very often or some word is happening very often that doesn't mean that that conveys some information all right for example many people are mentioned alright okay are so so these people mints and many times these words but it doesn't mean that these words are giving any much information regarding what we are talking about so that is why we that is why tf-idf is introduced that means a term frequency times inverse document frequency there soon there is tram frequency let's mean how many times a particular word has occurred in particular documents and then multiplied by Universal document frequency there's in how many times it has occurred in whole document all right I mean in all the documents so if you see the overall the formula is kind of this okay so so this is the very simple formula let's say you have n number of documents then you can calculate a tf-idf Y using this formula log 1 plus total number of documents in your data set divided by 1 plus how many times has occurred a particular word in a particular documents I mean for that word for in in their documents that would be the tf-idf value and plus 1 all right so this is the tf-idf alright so how the NLP work I mean how speci work so the tf-idf we will be using this from the scikit-learn library but we will be also using the NLP module of a speci for cleaning of our data set as we mentioned here that we need to clean our data set like case normalization removing of stop words and the punctuation limit I Jason's all those things we will be doing in this lesson and for that we need especi and especi NLP module in speci is having this kind of this kind of a pipelining if we pass any text to the in helping module then it first tokenize that tastes a tokenization means breaking attached into a part the first it breaks a text with a space and then it breaks text with if there is any punctuation like a comma the Christian mass exclamation mark or a period etc and after tokenization a tokenization is always a first step then after tokenization we have tagging parsing in ER and sentence segment is an exception I accept ah there are so much things okay all those things I have already covered in previous video I like here here and here so indeed I have covered these things in very details you can disable human use you can enable few menus you can change these pipelining sequence etc you can go ahead and watch those here in my previous videos alright so let's go ahead and get started with this lesson so the first of all we need to import a speci by assuming that you have already installed the speci you can import a speci one more thing once you install the speci from the jupiter command window or any the command window then what you need to do first go ahead in the kernel and then restart and the clear output so in this way what would you make it the jupiter notebook will make sure that you will get at libraries which you installed packages which you installed get linked with this new jupiter notable otherwise that might create a problem to you alright so let's go ahead and get started with this the first of all you need to import a speci and then you need to import a matcher ah I don't think we need a matcher and neither we need a span so you need to input a speci and then you need to import display see we will be using display see through visualize our data our texture data let's go ahead and run this so once we run this it might take a little time the first time whenever you run it then we are going to load a small version of English library English package from a speci that you can load with NLP is equal to I speci dot load and then e in underscore core and then here we have underscore wave underscore ISM for like so this even says that loaded English language library and this SN says that load a small model so there are multiple models go ahead in previous videos and see though and models okay so here NLP is loaded now let's go ahead a text which you want to first are tokenize so I'll take each steps first I'll tell you each of these steps first then I'll explain then I will take you how you can classify text using these the NLP modules and scikit-learn tf-idf the package alright so we are going to first do a tokenization so this is first sentence and this is another one here third one each alright so you have let's set this type of the text so we what we are going to do we are just going to call here doc equal to the in will P and then text alright so if you see here a dog you will not notice any changes here but now the dog is containing a lot of the information like tokenization s-- parsing part of speech except rasoi talk is containing all those information but if you just bring the dog it will say oh you a same text which was earlier alright so let's go ahead and print the tokens how we can get those tokens we can just type here for token in doc all that print token all right so if you see here this is the token which has been printed out all right so this token has been printed out now let's go ahead and break it into the sentences how you can break break read in sentences let's say we have a st. is equal to in LP dot create underscore pipe now you see there is no sentence Iser here so this is how here we can add a sentence segmentation after named entity recognition so we have here NLP not only after the named entity recognition in fact we can add any module anywhere in between tiger our before Tiger between parser and Tiger between ER and parser or after the Union alright so let's go ahead and create first sentence idea so here we have NLP dot create pipe and then here we have st. incisor alright then we are going to add this in a pipeline in LP dot sorry dot add pipe so in this we are gonna add st. but we are going to add this st. before parsing otherwise what will happen if parsing is already done then then the token will become a single word then st. incisor will not work in that case that is why we have to put at the st. incisor before a parser let's go ahead now get to talk doc is equal to and LP and there is a text now what we are going to do we are going to print the sentence all right so for st. in doc dot Saints all right so print Center so in this case now you see there are three sentences so it has break down it has your broken down those sentences into I mean whole document or voltage in two or three sentences which you can also verify that these are the actually a three sentence so this is how the beautiful this especi works it works better in the faster than the traditional in LT game alright so let's go ahead and see first I stopped words for a stop words why stop words because we need stock works to reduce our data set and to also reduce the effects of most frequently are most repeating words like if you see here in these three sentences easy is occurring most of time one has occurred two times and this okay and these things are actually they stop words so let's go ahead and see what are these top words how many stop words are there so let's say we speci dot length dot e'en dot stop words all right then from here import stop words all right so here we have stop and we are gonna say stop words is equal to just list of all the stop words let's go ahead and think the stop words and see how many stop words are there now you see these are these stop words so these are the most frequently repeated words in any the English literature like you see beyond hers anyhow I seen before although therefore beforehand something so these actually doesn't carry much information that is why these words should be removed now let's go ahead and see the lengths of these words so you see here 326 stop words on there at the time of this video I am making if you are watching of course later at the time when I'm making this video this is the count of stop words might change okay it can get bigger or it can be reduced it depends on the on the people who are managing this special abrading alright so now let's go ahead and remove these stop words from these sentence and see how many actual words will be left out alright and to do that what we can do we can say here for forward in talk alright so that is the token actually alright so this is actually a token for token in dog if token dot each stop I mean if tokens stop were equal to equal to false if token is not a stop word then what I'm going to do here I'm gonna print that broken oh it says that all the all the words are actually stop words alright so let's go ahead and make some changes to taste it I'm gonna add here just open google alright so let's go ahead and run these things again now if you see here we have these now here we are gonna get our sentence April this is first sentence and Google this is another one and here third one alright so let's go ahead and run these things again here we have now now still we are not getting anything let me see why we are not getting anything let me see if it is oh I think this billing is wrong actually sorry my mistake alright so we have a token now you see here at the Appel a sentence Google and this third so these four are the words and along with these punctuation so we will remove these functions later but you see the words these four words are actually not in stop words otherwise you see how many words were there so this was quite large word and if we make a bag of word which is mostly a sparse matrix that becomes very large but if we filter out the unnecessary words the total bag of word size becomes less which takes a less amount of the time to train your model and also it increases the accuracy and it generalizes your overall model alright so this was how you can remove waste awkward now let's go ahead and see how you can do a limit I Jason for a limitation let's say you have a here doc is equal to n LP and there we have a run runs a running runner alright so let's go ahead and see so let's see your tests have something like this word I mean in your text a same are attached originated from same route just having a different you know the path the different participial like some words and present participle and some was some are in a plural form and someone past participle so in that case what especi do expensive brings down limit Isis and actually brings down those you know the variations of words into a route form so let's go ahead and run this and then what we are going to do here for blame that is the limit ization for him in talk and then we are going to just print those documents which we can print with lame dot text actually that is the real text and with a space I'm going to say that Lam dot Lima with the space then it will print a real number I mean these words all right in a string format so let's go edit run this now you see a run is run but runs is run running is also run runner is only run okay that's mean these four words has been converted into just two different word run and runner alright so this is how limitation work now let's go ahead and see the part of his page alright so far for part of his speech what I'm gonna do here let's say you have a dog okay in dog here I'm having NLP and there I say always well at your end okay so here we have a dog now we are going to do part of speech analysis on this lesson so how we can do that for token in dog and then print token dot text and then but token dot POS that is the power of speech all right so run this so it says that all is determinant is your zero reverb well is your work and at is adpositions URI pronoun and is noun and here is a punctuation all right so with these we can get a part of his speech and with the dependency parsing we can also understand the relation between D although we'll understand this with the display see in graph later okay okay all right so now what I'm going to do here I am going to do entity detection all right but before that I am going to display here dependency all right so using a display see so let us go ahead work with the displacing dot render and insert that I'm passing here a dog and then here we have a style and in this style we are matching a depth we run this now you see here along with the part of his pitch it is also suing there the dependency how it is dependent so it says that end is dependent on your which says that ok so it supposes you okay so like your end all right and then attend you can get the complete the imitation details here you see speci dot IO API in rotation part of speech tagging so here is complete list of the innocence so this is how the tagging is you are the being done when we will be doing a Twitter sentiment analysis and the Facebook post sentiment analysis then we will be talking about more about the part of speech and the dependence all right now let's go ahead and and show you how you can do entity detector all right so I already have here a text a very large text okay so this is text from New York Post which we have here currently and you see here you run and you can see the dog so this is the kind of news okay a small news which says that New York City on choose to declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak all right so this is the information and now what I'm going to do I am going to do entity detection in this news all right using a display see how would you do that let's go ahead and do that use the display see dot a render actually I already have this entity detection cover in my the previous video you see here expanding named entity recognition and I have also turned that named entity recognition later in these videos as well like in tutorial number four you can go ahead and watch that these in details but in sort I'm going to show you here so here this displays see not render and then you have a dog and with a style which you are gonna get that is the ent just run it now once you run it you will get a fantastic and very beautiful named entity recognition so it says that New York City is a graphical geographical location the Tuesday is dead and at least two hundred eighty five is a cardinal number that is the number September is a date and Brooklyn is GPE that is the geographical location and then Williams work is also geographical location then here we have a forage Cardinal and then Bill DeBlasio is a person named that this person is mayor actually and then Tuesday's date Orthodox Jewish the North on or server in the North let me see named entity north northeast nationalities are really these are the political groups yes all right so and young guys six months old that is the date and here is money so these how it's a beautiful look and you can analyze alright so now what I'm going to do I'm going to start our text classification finally alright so you have got the overall idea how how we are going to work on this so we'll be doing limited essence and we'll be also doing stop words and the punctuation city mobile the first of all but first let's go ahead and get and understand the data which we are going to use so I already have this data set and I I have downloaded this data set from the Kegel and you see ie machine learning are the data set but I'll upload these data set on my other github you can get the link from a video description so I'll push the link for this Jupiter notebook as well as for this data you can get these things from there itself okay you don't need to download it from anywhere else so you see here I have here a three file amazing cells labeled now if I open this you see so this is kind of there is reveal and there is a sentiment so there is no way for me to plug it in here in the US unless I go for a converter that means this is the negative wave so it's zero good case excellent value that is the one that is the positive great for the job won't positively wave but you see here needless to say I wasted my money this is the negative reveal so here we have a mission and similarly we have here for I am Dave zero one zero zero one and this is that the tab separated file ok and then here we have LP labeled files all right so these are the three files which we will be reading I mean the data we will be reading from these three files and then finally we will be combining these three files together for final analysis alright so let us go ahead first import necessary libraries which we will be needing and those necessary libraries are import - HPD alright and then we will be also needing s killer lots of libraries from esculent like a feature extraction not text if you have seen my first video in that video I have explained this dfid vectorizer so here we have tf-idf with Roger and then from s killer sorry then we have here from escolar dot the pipeline import pipeline alright then here we have from the SK learn dot Morel selection import clean tasty split and then here we have the from s killer dot matrix import accuracy score classification report and then confusion matrix now you see here we have a confusion metaclasses report and accuracy score so these are the are the five libraries which we will be needing during our next classification so I have shown you these three data sets so let's go ahead and read all these three data sets and do some exploratory analysis so I'm going to first real LP data so I can just say the data underscore L is equal to PD dot read underscore CSV and then give a file name it is under data sets and L labeled dot txt and I need to also define here a separator which I am gonna huge the separator I am going to use a tab separated this is not CSV file and four-headed I'm gonna assign none header all right otherwise the first row becomes a header so after these I'm gonna see the first few lines with the head command these are the first few lines you see here this is the reveal and this is the sentiment alright so what I'm going to do here I'm going to create a column names first columns name is equal to so here we have a first reveal and then another one we have a sentiment all right and we are going to assign that to bethe elf using data elf columns is equal to column names all right now you see the data L dot here you should get that Raviv and sentiment in the header of this bandage data frame all right so let's go ahead and see safe of this data elf how would you get that data elf not safe so it says that it has 1,000 refuge and two columns the first column is Raviv and the second column is the sentiment of that Raviv and now we are going to read the data set from the Amazon as well so the delta underscore amazin information is equal to PD dot green underscore CSV and data sets and then here we have a million cells label dot txt then here we have separator which I am gonna huge that is P all right and then finally sorry all right so I am now gonna use data underscore emission sorry dot columns all that equal to columns name so let's go ahead and see the first 5 lines of data Amazon alright so here we have the data from a median and if we see a shape it says that it has total 999 reviews only alright and I think we have missed here we had missed actually the header so we should put here the header is equal to none all right so now we should get total 1000 all right great now similarly we are going to read the data from the IMDB so we have read from amazing cells LP label and let's go and read the for IMDB labeled dot txt and for that data underscore I am T V is equal to PD sorry is equal to d dot read underscore CSV and then here we have data sets then here we have I am DP label dot txt and similarly here we so to set separator with tab separator and header so Finland alright so we have data dot IMDB now data imdb.com 's equal to column name alright so with this we have data dot imdb.com xij equal to the column name the column has been assigned now let's go ahead and see the safe of this data it says that it has total say 148 data let's go ahead and see a few lines first few line of this data is that okay these are the reveal and these are the sentiments alright so we have total our three type of data data elf data I am DV and data am Asian now what we need to do now we need to mix all the data together so let us go ahead append all the data together in a single liter format single the pandas dataframe data so the first of all I'm guna called data elf dot append and inside that I'm gonna attend the first data dot a mission and then I'm gonna append data dot IMDB and then I'm gonna say that ignore indexes equal to true alright so now you see the data safe should be more than 2000 yes so the data safe is 2748 and there are two columns so with this we have got our the new data that is the data say let's go ahead and get the head of this data so it says that okay this is the head now you see we have we had first data health and now if we go and see data help you see those first five lines are from data elf and then the data emission is appended and then finally data IMDB is appended in the Delta in the data data format alright so let's go ahead and check what is the distribution of the sentiment that we can check with the data sentiment dot willow underscore counts all right so it says that we have total 1386 positive reviews and 1362 negative reviews all right and then we have now what we are going to see here we are going to check if there is any null value okay so it is data dot is null and then here we have a sum so it sees that they are there are no null values present in the data set all right perfect so let's go ahead and add a few the empty cell here and then I'm gonna do here tokenization all right so here we have a tokenization all right so now our the data is ready earlier I have I told you that we will be for doing a tokenization then we will be removing stop words and then concho essence and then we will be doing limitations and then we will be converting all the text into lower D format so let's go ahead and first do the tokenization for tokenization I'm gonna first import a string and from that I need to get a punctuation alright so here we have found is equal to a string dot punctuation all right this is the punctuation sorry this is actually punctuation alright so let's go ahead and get this this sewed with the point all right now you see here this is the punctuation all right perfect now what I'm going to do I am going to write a function for a data caning so how the data cleaning who will work the data cleaning will first convert will we will check if what presents into a stop world as we have done earlier in this lesson and also if it is not a stop word and if it is not a punctuation and if it is not a flow now that's been like a name etc then that will be converted into a lower for me so what I am going to do in the lower text I'm going to write here the first that function by calling def text under s core data underscore cleaning and this is going to take a sentence alright oh sorry sorry I need to first write something here only then I can run it okay so the first we are gonna get here doc at knock is equal to in LP and then here a sentence alright so doc in LP sentence then after that I will be making here a list of token okay then I'm gonna do here limit adjacent first okay so here let's go ahead make it actually the tokens for token in doc and then inside that if if token dot Lima under school do you remember this dot Lima so this is lily matah Jess and I mean the route form of that word if route form of that word is not afro noun all right if route form of that word is not a pronoun then it is going to convert that into a lower format so here we have a temp is equal to token dot Lima dot lower it will convert into lower text format and then if there is any extra characters punctuation then it's gonna each trip died okay and if it is a proper now then else I'm gonna just to get that lower format of that proper noun alright so there is no there is no Lima for a proper noun so the temp is equal to token dot lower underscore alright so with this we will get here lower all right so know who token actually the tokens here this is the tokens so every time I'm gonna append the token and that is the temp all right so obtained temp so with this we have got here we have converted we have limited and converted the text into a lower format all right now after this what I'm going to do I am going to clean the tokens okay so I'm gonna get cleaned I'm just gonna write here cleaned underscore tokens is equal to four token in tokens now I have tokens now I'm going to look inside the tokens and then checking if there is stop or are punctuation s' then remove those so if token not in stop words do you remember stop words there were more than 300 stop words and token not in punctuation all right that is the Punk'd okay so if it is not in stop words and the punctuation only then add that token in a cleaned tokens so that is the clean token dot append and then there is it token all right so then finally what I'm going to do here return this cleaned token so let's go ahead and run this now you see here this is complete function so for testing purpose what I am going to do here I am going to call this function text data cleaning all right so here I'm gonna say hello how do you sorry how are you like this video alright so I'm gonna run this now once I run this you will see here hello like and the video alright so how are you these are the very common words that is why it is removed and this is also very common word it is presents into a stop word and this punctuation you see has been removed now if you see here I had spaces that spaces will be also removed only these necessary only the necessary the words are the tokens will be will be returned from this text data cleaning alright so in this line we have done limit ization and in this line we have removed stopped words and punctuation all right now we are going to do vectorize feature engineering that is the tf-idf this is the vectorization feature engineering tf-idf alright so far tf-idf i'm going to use here on SVM so it is the from a scalar dot SVM import linear SVC alright and I am also going to do you remember that I have imported the tf-idf and initial of this relation that should be somewhere let me see where is that do you remember that where is it at here yes so here it is the tf-idf and the pipeline so now we need all these share libraries all these here are the packages actually those packages we are needed to import from here itself alright so one gonna first create here tf-idf tf-idf is equal to tf-idf vectorizer now if you press shift and the double tab you will see here a lots of the input it takes like lower case is the true pre-processors tokenize years and the analyzer so inside this tokenizer i'm gonna pass this function name okay the text data cleaning alright so it has tokenizer is equal to this text data cleaning so the tokenization will be done according to this function alright that means this scale learning package is gonna use a function written with the help of the speci functionalities alright so we need to just write this line and then here I am gonna say classifier which we are going to use the classifier is our linear SVC all right perfect so let's go ahead and get first X data do you remember that inside the data we have two columns Raviv and y is data sentiment so these are the two X and the y data now what we need to do we need to perform train and the test a split X underscore train X underscore test y underscore train and then Y underscore test is equal to train test a split now inside this I'm gonna pass here X Y and then test underscore size is equal to zero point two and a random state so that you should get the same result random state which is this random state is equal to 42 let's say alright so here we have got X train X test y train and whitest now do you remember I had created a pipeline in the first lesson here in this lesson the pipeline helped us to perform the operations at the same time for training and the testing data as well we don't need to prepare the data for training and the testing separately so what I'm going to do first I'm gonna check at the safe of the training data set as well as the safe of test data set and it says that 2198 data samples are in training data set and if you see here X train is just a words okay now with this vectorizer okay the tf-idf vector ax here when we pass the data in T of RTF it will be converted the whole X train will be converted into a bag of words all right perfect so this is the extreme now what I'm going to do here I'm going to create a classifier that is the C live dot type line and inside this pipeline I'm going to create a list a list of tuples actually so the first function are a first operation which I'm gonna the perf all that is the vectorizer so that is the tf-idf so the function of this tf-idf is tf-idf itself here i had created that tf-idf so this is the one tuple and after this double I am going to perform the classification operation so that is the CLF operation and in the head I'm going to use a classifier so how it works so it will first it will first do a vectorization and then it will do a classification alright so let's go ahead and run this CLF we have got the CLF then finally we need to do a CL f dot fit and then here we need to do X underscore a train and then Y underscore train so in this way we actually need to prepare the data set for X test as well so we can directly do the CLF not predict X underscore test so it this is quite a large data set so it might take a little time since my computer do not have any GPU but let it complete first wait for few second all right so this is completed now you see here so these are the default function which we have used all right the TF ID here for the vectorizer and linear SVC as well so in TF idea vectorizer you see here we have an elijah world and if you remember our we don't have stop words here because we already did that defining tokenizer so here the inter tokenizer it has dysfunction data cleaning function all right and the C is one point zero all right so here we have our C L if not footage happened now let's go ahead and get the vibrating so here we get why breath why under his score is equal to C left dot and the predict and I'm gonna operate that directly into X underscore test now let's go ahead and print the classification report so there we have classification and that is score report in that we need to fast Y underscore test and then Y under is cold bread so it says that we have 78% of the accuracy all right so let's go ahead and film the confusion matrix as well for confusion matrix we have y underscore test and then y underscore prett so it says that here this is the confusion matrix so this is the the y axis is actual and this x axis is predicted value all right so this says that okay the zero sentiment that's negative sentiments and this is the positive sentiment all right so zero and the one perfect now let's go ahead and test it into a real world and how it is performing so let's say if I say CL f dot predict and inside that I say wow this is amazing lesson so it says that here one alright that means this is a positive sentiment Wow if I see Wow this sucks then it says that zero now you see and Wow is here and here as well but what is making difference this sucks is making difference so this is the zero that means here is negative sentiment now let's say if I say CL f dot credit and again I see what of watching it please like it so it says that one that means this is a positive sentiments okay all right so this is all about in this lesson thank you so much for watching this video please do not forget to Like and subscribe this channel so that you can get updates directly into your inbox bye bye have a nice day
Info
Channel: KGP Talkie
Views: 14,019
Rating: 4.9730639 out of 5
Keywords: machine learning, data science, natural language processing, sentiment analysis, machine learning tutorial, learn data science, artificial intelligence, neural network, machine learning basics, machine learning tutorial for beginners, data science for beginners, sentiment analysis machine learning, sentiment analysis twitter, sentiment analysis tutorial, sentiment analysis python nltk, sentiment analysis in deep learning, sentiment analysis using naive bayes classifier
Id: cd51nXNpiiU
Channel Id: undefined
Length: 57min 33sec (3453 seconds)
Published: Mon Oct 28 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.