Complete Exploratory Data Analysis (EDA) on Text Data in Python | Text Data Visualization in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to the new section in this section we are going to read about EDA on the text data and we will be working on the omens ecommerce clothing reviews and it has the many input parameters we will first starts with the data preparation and the data cleaning after the data cleaning then we will start our feature engineering and then finally we will start our distribution plot with the plot ly and then we will be seeing a mini distribution plot in the form of the univariate analysis as well as with a bivariate analysis and then we will be also seeing the UD crumbs bigrams and trigrams analysis of these words and then we will be trying to find out the most occurring words that's mainly what are the most what are the frequencies and how these words are occurring and then finally we will see the parts of speech tagging that's mean we will be seeing that what type of the words are being used in these reviews and then we will lead to a y variate analysis in a bivariate analysis we will be using a multiple inputs like x and y for a division name and the polarity to analyze our reviews and how these are correlated with each other and in the then finally in the final piece of the sections we will be also taking two continuous variables in the form of the distribution plot as well as in the form of histogram and then in the form of distribute a joint plot alright so this is all about in this lesson thanks for watching I'll see you in the next lesson where we will start this section all right now let's go ahead and get started with the exploratory data analysis on a text to data and in this lesson we will be taking Oman's ecommerce clothing reviews and this data is also available on a kegel you can get it from a kegel their Oman's ecommerce clothing review and there are 23,000 customer reviews and ratings are there it you see there it has input column as a clothing ID age title review text rating recommended ind and thats when it is recommended or not that's the recommended index and then positive feedback count and then it has a division name department name and class name alright so what we are going to do we are gonna copy it so that we can later understand our text data in a more better way and this is the information which we have currently I am just gonna put it here some new lines alright so there we have and here we have and there it is alright so these are the information which we have there and if you want to download this data then you can download it let me let me also paste here it's a link as well so that you can also download this data from this link there although I have already included this data into the folder I mean in the working directory there is Oman's clothing ecommerce refuge alright now let's go ahead and import the necessary packages but before that let me insert a few cells so that we can work here comfortably alright and and the Python packages which we will be using in this lesson we will be using pandas of course so import pandas HPD import numpy as NP and then we are going to import here matplotlib dot pie plot as a ELT and we are also gonna import here import C ball as business and then finally as you know we are gonna put here matplotlib and with the in line so it will make sure that all the plot should be plotted in this jupiter notebook itself and part that let's go ahead and zoom in zoom this notebook a little so that you can follow this code in your mobile as well all right apart from that we will be using here a plotly and for plotly you can import the plotly from in fact let's go ahead first importer plotly so we are going to do that import plotly as py and then i'm gonna also import cufflinks so we will be also using here a cufflinks as a CF and we will be using this plot li and cufflinks in offline mode that's mean you need to do here from plotly dot sorry from plotly dot offline import I plot and we will be also importing cufflinks yes we have already done the cufflinks now we are going to set this Jupiter notebook in offline mode that's mean this Jupiter notebook can use broadly and cufflinks in offline mode so that's the py that's been a plotly dot offline then I'm going to use here in it notebook mode and there I'm gonna say that connected is equal to true and then I'm also gonna say here see if that's the cufflink dot go underscore offline and there you have got here plotly and cufflinks in offline mode now we are ready to get started but before that let me also show you if you do not have this plotly and cufflinks in your computer then how you can install those I'm going to put here you can do pip install plotly and then you can also run here we install install cufflinks and we are also going to use here a text blog so you also need to install here text blog install text blog so this text blog we will be using for the parts of speech tagging as well as we will be also using it for the sentiment analysis then you need to run these you can simply run it by hitting here sift and the enter then it will it will install the necessary packages all right now we are ready to go ahead and I'll see you next lesson where we will start with importing the data and analyzing it alright now let's go ahead and get started with data import so we'll be using here a panda's data frame to import the CSV file which we had seen earlier that was omens clothing ecommerce refuge and we are going to use here PD you could not be difficult actually the DF that's the data frame is equal to PD dot read underscore CSV that's mean the comma separated value CH let's go ahead and type here few names and then you can press a tab in your keyboard it will automatically complete it some it will automatically complete the file so we have here o men's clothing ecommerce reviews dot CSV and then I'm gonna set here index column is zero now let's go ahead and see a first few lines of this data frame there we have first few line of this data frame there we have clothing ID and H title review text rating recommended index and then we have here a positive feedback count then we have here a division name and Department name apart from that we have one more that's the class name alright so these are the data which which we have currently with us and now I'm gonna show you how you can how you can actually the few things which actually we do not need in this lesson like a clothing ID and title so these things actually doesn't a dozen term you know the so anything so I'm gonna I'm drop those so how I can do that so I can right there D F dot drop and then I'm gonna provide here a subset all right subset otherwise you press their shift and the tab actually it's the label here as well so you can type there the labels which we are gonna drop and the levels we are gonna are drop here title and apart from that we are also gonna drop here clothing ID so I'm gonna put here clothing ID and we are gonna drop it from axis one that's mean from a column so I need to put here axis one and then I also want to replace new data set in this current data frame that's a in place is equal to true now you can see your head DF head we do not have any title and clothing ID in this new data frame now let's go ahead and see if there is any null values present in this data set that we can check with DF dot is null once you do this VF dot is null it gives you true and the false for all the rows and if you want to get here the total number of null values present in each of these columns then you need to do here the sum you can do sorry you can do here a sum so that's mean it will give you age do not have any null values but review text have 844 45 null values so if preview text time 845 null values that's when there is no need to include those null values in our data visualization so we are going to drop teach rows as well apart from that there are 14 rows in division department and the class name which has a null values so the first of all we are gonna drop those subsets that's mean the D F dot drop any so here a null values we are gonna drop here and the subset that's when the in which we are gonna look a subset we are gonna use here review text alright and apart from that we are also gonna use here a division name now do you see here division name department name and class name all of these three have neither I losing a protein road so it is very highly likely that all these values are null for the same roach so I'm just gonna include it a division name and then I'll check that I then I'll check that how many values are still left null as a null so I'm also gonna put here in place is equal to true that's when this data frame will be replaced with the new then here we have again in DF dot is null and then finally I'm gonna put here a sum that's mean I'm gonna sum all the knowledge it says that there is no null values left in this data frame that's mean we are good to go ahead perfect now let's go ahead and see these review text and and try to understand that what are the values which these texts have so what I'm going to do here I'm gonna first convert that in a series thus mean I'm gonna put here review underscore text now it has become a series and then I'm gonna convert it into a list now it has become here at least and once it becomes a list then I'm gonna convert it into a text that's mean all the lists I am going to join it together so that it can be converted into a single format or text alright so now you see there it is just text here a whole data frame reviews has been combined in the single text file kind of text file and there you see a few of there are a lot of the special characters so special characters are not gonna bother us but there you see there is backward slash so these backward slash might create a problem and there are a lot of the words which are in two contracts and form we to expand these like it's here so it is this one is it each and there we have I am so that should be the I a space m and similarly there you see there we have a high by quad slash then single quote and eight and then double quote that's when it's a five feet and 8 inches and there are you again see there so that's when we need to first convert our these contracts and form into the expanded mode and thereafter we can remove these backward slash so those things we will see in the next in the next lesson so thanks for watching I will see you in next lesson alright now let's go ahead and get started with the text cleaning and this third text cleaning as earlier we had seen that there were a lot of words which we are in a contraction form that's mean we need to convert those contracted form in the expanded mode so I have taken these words from a Wikipedia and in the Wikipedia it is given that the contracted and the expand expanded format that's mean if you have this word that's mean the expanded format of this word is I am I am NOT and then here we have Arendt that's mean are not then we have here account that swing cannot under similarly we have a lot of the other other contracts ins and then we have there the expanded form of those contractions now we need to apply these contractions on this dataset so that we can convert all those contracted form it a contracted form of the data into expanded mode so what I'm going to do here I'm going to first write here a lambda function with the lambda functions I can do these things very easily but before that I need to hear define a methods which will do these operations so I'm going to do here def is equal to cond that's in the contraction to expansion mode that's the exp and then I'm going to take a text as input and then I'm going to check here if the type of that text all right so that's the X is STR that's mean we need to operate only when that text is string otherwise if it is a numerical numbers then they then we cannot apply the contractions and then I'm gonna say that X is equal to X dot sorry X is equal to X dot replace the first of all I'm going to replace this by a quad slash with the nothing and as you see that once we put here a backwards less that's mean immediately after this backward slash immediately after this text to the meaning will be nothing there that's mean it is not the right way to do this I need to put here a double backwards less then it will make sure that this double backwards less that's mean it will take the single backward slash as a meaning of it and then finally it will replace the those with nothing all right and then I'm gonna do here four key in contractions and then I'm gonna check it here with the value is equal to contractions for that key lesson what I'm going to do here I'm going to first iterate over these contractions and then I'm gonna check that if those contractions occur in this fixed data then we need to replace with those values that's mean this is value apart this dictionary so there is a key and then there is a final value which we need all right now apart from these we are gonna do that X is equal to X dot replace and then I'm gonna use a key and then I'm going to replace it with the value and then finally after this for loop I can return a value which we need here otherwise if the data is in integer format itself then we can return extra directly and I need to also put here X as well so there we have our method for a contraction to expansion mode now let's go ahead and check it we can check it within new text data so I'm going to put here a new text like this let's say I write something like this but you see there I need to use here double quote so that I can use the single quote inside the C string so I say that I don't know what date is today alright then I'm gonna put here I am I'm gonna put here I am 5 feet let's say I'm gonna put here I am 5 feet and 8 inches so as soon as I put here a double double quote that's mean I need to put here a backward slash so it will make sure that this double quote is included in this string all right so I'm gonna run it let's go ahead run it and then we are going to apply this contracts and to expense and Method let's go ahead and call this with the count and there we have it and we are going to pass here Y X and we are also gonna print it here once you sorry there is a problem once you print it now you will see there we have I do not know what date each today I am a 5 feet and 8 inch alright so this is yeah we are going to apply it on the text data which we had earlier and there I'm gonna use here lambda method and we are also going to put here a magic command in a jupiter notebook so it will count it will state the total time which will take to execute this cell and then we have D F and then again we have their review text is equal to D F dot sorry then we have their review text dot apply and then we have here a lambda X then count to sorry it should be contraction to expansion there then I'm going to pass here X so how it is working it is passing a row by row text data in this hex way and then the CX variable is being passed to this method which we defined earlier and then the final result is being returned which will be stored into that particular row of review texts let's go ahead and run it and see how much time it is gonna take it says that there is an error it says that yes I think it was wrong the little and then it tooks around 1.5 second so we are done here and once again we are going to see here a D F dot head and with this we are gonna see here our first five lines all right so another method which we are going to do here again the same thing we are gonna do there which we did here all right perfect let's go ahead and run it the one thing I'll tell you here which you will notice that there is still have some backward slash inside that but otherwise you will see there the most of the things has got corrected like there were a lot of it's now it is converted into the e teach all right but it's still there you see a few backward slash is still there it happens because this is the raw text and we have not huge here a print and that is why it is there a backward slash although we can use here a print but the problem with the print is once we use that print what we will get there we will get new the error so error was the input-output operation is not the executed I mean it's not allowing the operation you can use it here if you want but you will see that so input output data rate is exceeded so for that what we can do we can take just few characters so let us go ahead and take the first thousand characters and then you see there we have got it here a 5 in 5 feet and 8 inches all right so this is all about in this listen thanks for watching and I'll see you next lesson we really start over its word length and a little bit more data cleaning part I'll see you next lesson all right so in this lesson I'm going to show you how you can do a feature engineering in the text data that's me we will be calculating few features in the text data before any visualization that's been the thing which we have been doing this is the part of the text data preparation and the cleaning and it is really important whether you are just gonna do a visualization are you are gonna write some machine latin code so in what way you actually need these things now let's go ahead and do this feature engineering for that what I'm gonna do here I'm going to first to visualize this DF I mean I'm gonna just see a first few lines again in a DF we have here HD view text rating all those things so we are going to first calculate the total number of words present in each of these review texts and apart from that I am also gonna calculate how many characters are there and then what is the average word length for each of these reviews average word lengths means how many a character would have on average for example this silky have just five characters and here it is a three and then there is a four and three alright so let's go ahead and calculate those features and we will be also calculating their sentiment polarity as well so we'll be using here a text blob to calculate this sentiment of the text and but before that I need to import the text block here so you can write there from text blog import text block all right so there we have got the text blob and then we have here a DF head and now let's go ahead get started with the first polarity so there we have a polarity that's mean what is the sentiment polarity for a particular text that we can do with DF polarity is equal to DF and then here we have a review text alright so we are going to again apply here a lambda function so we'll be using lambda function most of the time very frequently so we have here a lambda X and then I'm gonna apply here a text blog sentiment alright so there we have their text blog in this text blog I'm gonna pass here X and then finally I'm gonna call here sentiment then dot polarity so it will written there a polarity in this DF polarity and apart from that I'm also gonna do some review length all right so then we have sorry review underscored Lane and again I'm gonna apply here a review text so let's go ahead copy it otherwise just press here a tab in your keyboard you will see that it is completed Auto completion is there and then we have there again apply lambda X is there and then finally we are calculating here a review length that's mean we are calculating the total number of characters how many characters are there so we are just gonna put here simply Lane of X and then finally we are going to calculate the total number of words present in each of these reviews that's we are gonna put here word underscore count actually alright so there we have a DF and then again we have here a review text dot apply and then we have a King lambda X now you see we have these complete text data for each of these rows to calculate the number of to calculate the number of words we need to split these rows so those rows can be splitted with like this split and then finally we are going to calculate the length of this is blip that so in the length of words that's mean the total word count each total Calvert County is calculated and it is inside the word count now finally I'm going to calculate the Everage word length in each of these refuge and for that we need to write here a method and that method we can write with Def get underscore AVG underscore word underscore lynnie that's when the average word Lynn inside that I'm gonna pass here row text data that is X and then finally words is equal to X dot split and then I'm gonna do here word underscore Len is equal to 0 and then finally forward in words that's mean I'm gonna iterate over the words once it is done then I'm gonna calculate here word Len is equal to word Elaine and then finally I'm gonna play a put here Edison of word alright so for each of these words a total word Elaine will be calculated you do remember you could have done J directly Len off the CX but in that our space is also included but we do not need to include those space because because the words are not made of spaces alright so that is why we need to calculate these lengths of words e YY 1 once it is done then we are gonna return here a total word Elaine so there we have word length we do not need actually equal to here so it's a simply word Elaine and then divided by Elaine of words alright so there we have words that's mean how many words are there and this is that the total number of characters present in that words alright perfect so let's go ahead and run it and then finally apply it inside the new column so new column which I'm going to create here that I'm going to put here AVG underscore word and then underscore length that's mean average word Elaine then I'm gonna put here DF review text then again apply and then again we have here lambda X and then finally we are gonna call here get average word Linn and inside that we are gonna pass here X it will take a little time now it is completed all those things let's go ahead and see a first few lines with D F dot head and in this you have got here there is a polarity and we have also got here word counts and average word line so it says that how many on average a word is made off with characters so it says that in the first in the first review so there is five point seven five characters in each of those words on average alright perfect and now I'm going to show you a few random samples so that you can also calculate these polarity like this without the random sample in fact you can verify it right here you see there is absolutely wonderful silky and sexy and comfortable thing something is there and it says that the polarity is 0.63 there's mean yes it is a very high a positive review there and there is just zero point zero seven in a polarity that's me it is kind of a neutral review that's mean there is no positive and there is no negative and this review also if you read it you will understand that like I had such high hopes for this dress and really that's mean with this only we have got the information that someone hope does something this is gonna be great but this is just so so that's me in there is just a neutral sentiment are there and with this 0.55 it says that I love love love alright so with these that's when there is a positive sentiment so if it's a plus one that's mean it's a very high positive sentiment and if it is a minus one that's mean it's a highly negative and if it is somewhere around the zero that's mean it's the mostly a neutral review is there all right so this is all about in this lesson thanks for watching in the next lesson we will start plotting the data bye-bye see you in next lesson all right now let's go ahead and get started with distribution of a sentiment polarity so we already have our D F in which we have calculated few custom features as well like a polarity average word a length word illan and word account so we are one of us to apply our distribution analysis on the polarity and then we will continue for other columns as well and then we are going to use here a cufflinks and plotly together a cufflink is a package a Python package which bind a large data frame and plotly together and it is really very easy to huge they're a vital fault we can just simply write here DF dot I plot so it will automatically plot all these columns which are numerical column I'm just going to show you how you can actually use this DF dot I plot without selecting any particular without selecting any particular column there and we have here plotted it and by default it seems that it is kind of a bar plot and these paths plot is highly mixed together so what I'm going to do here I'm going to select particularly only polarity here so that we can only plot here polarity with this we have seen the most of the polarity each here wedding somewhere in between minus 1 to the plus 1 but they in general it is somewhere it seems the 0.25 in the positive that's mean it says that the most of the reviews are positive there all right perfect now what I'm going to do here I'm gonna I'm gonna change this in a histogram so that you can look it other very clearly so you need to put here you need to press your shift and the tab then you will see there by default it is a scatter plot so the kind which I'm going to use here I'm going to change it here a hist that's doing now it will become here a histogram so with this histogram you see there it has something like this but we are gonna do here a few things like we are gonna change here the color of this so the colors I'm gonna use here red color let's go ahead and use this so there we have got the red color but after the red color you see there there is your the red color is not the properly visible so I'm going to also put here bin so that's the bins that's mean it is gonna say that how many bins are gonna be here so I'm going to put it here a 50 so with this 50 bins you will be able to visualize this the pretty mature are the clearly it's more more of the clear and the pretty much clear alright so it says that the most of the reviews are concentrated somewhere around zero point one seven to zero point two two and there you see a lot of the reviews are somewhere in between zero point you see there this one it starts from the zero point one two and then there is 0.32 that's in if you take these just four bars with this you will get that the most of the customers are satisfied but still there are a few customers who are not satisfied and those unsatisfied customers lists are really low as comparison to the satisfied customers all right perfect so this is all about in factor before ending this lesson let me also show you few more things like how you can add their title so I'm going to put here X title so there you can see X title that's me in the X label so in the X level I'm going to show that this is polarity in fact let's go ahead and put this P in capital and then finally we have Y title the Y title is equivalent to Y label so we have their Y title and in the Y title we have here account and apart from that we have a title so the title which we will be using for this plot that I am going to say that sentiment polarity sorry so this would be the sentiment pole t-distribution all right so let's go ahead and run it now you see there it has a sentiment polarity distribution there which account and they say set the polarity alright so this is all about in this lesson thanks for watching I'll see you in next lesson alright now let's go ahead and get started with distribution of reviews rating and reviewers age so we have here a data frame DF and the column rating represents the rating of a rating of that particular product by a particular reviewers so then I'm going to put their DF rating so it will return here a series and in this series I'm gonna use here I plot that's mean the plot Li I plot and then in this I'm gonna represent kind equal to hist that's mean this is a histogram plot and once again I'm gonna put here X title and that X title I'm going to put their rating and similarly I'm going to put their Y title Y title each count that's mean how many ratings are there for that particular rating and then I'm gonna put their a title for this plot so we have here a title is equal to review rating all right distribution perfect let's go ahead and plot it so what do you see the difference here about default color is here Ella color and apart from that I have also not yes I have not given the color so that's okay and they're the most of the ratings they're a five-star rating and then we have a four star three stars and there are few ratings are one-star this mean this says that these are the mostly unsatisfied customers and these are the highly satisfied and these you can say the satisfied customers and the mostly three comes okay this is the neutral review all right so this is how you can do review rating distribution now let's go ahead and see their review rating the reviewers age distribution then also you how would you how can you do that so there is just D F and then again there is age and then we are gonna say that I plot and then again with the kind I'm gonna see that history in the histogram and number of bins I'm gonna use here a 40 and then I'm gonna use here again X title the X title for this I'm gonna use your age and then finally we have here Y title and in the Y title we have their count all right and then finally we are gonna see here a final title so this final title says that the reviewers alright so there we are going to say that reviewers now I'm sorry so there we have reviewers age distribution so I'm just gonna right there the dist alright let's go ahead and plot it once again you will see there at the most of the reviewers who had left the review so their age is somewhere in between 35 to 39 years and I've are from that the nearby of that you will see that it's 30 and there is a 44 that's mean the most of the people are inside these 15 year of friends that's mean 30 to 44 years so these have a you know the mostly the left the reviews and now let's go ahead and see if you want to change the color of this plot how would you do that similarly as we did earlier you can change the colors from here like colors equal to the red you can also put here yellow color as well if you want you can also put here equal color and then if you want you can also put the ad magenta color as well apart from that you can also put the Eric green color and then a default color is orange which is available in the motley like this and similarly you can also change this line color as well if you want to change the line color you can simply put their line color equal to let's say you want to put it red then you will see that the line has been changed into a red color and if you want it into the yellow then it will be changed to yellow color and if you want it into a black you will see here it is in black color I don't know if grey is also there yes gray is also there and by default the color is grey color alright perfect so this is all about in this lesson thanks for watching in the next lesson I'll show you how you can see a distribution of review text length and word length and other things bye bye see you in next lesson all right so now let's go ahead and get started with the distribution of review text length and word length we will be also taking lesson on average average word length as well so once again with the same thing we are going to start here with the review as you know we had their review Lane again we are gonna do here I plot so once you do this you will automatically get this in the form of the scatter plot but I'm gonna change here a kind and the kind I'm gonna put here a histogram kind and it says that the most of the reviews are somewhere around the 500 and the other review you are constantly in between 100 200 300 but here most of the review lengths are a 500 characters are somewhere nearby you can say all right now what I'm going to do here I'm also gonna put here AB in total bin size otherwise let it be what you were rated then I'm gonna put here X title x title I'm gonna say that review Len and then we have Y title when once again we are having here count in Y then we have their title so in the title I'm gonna say that review text Len alright perfect so there we have and in fact I can say the review text length distribution alright so this is how we are this is how you can do review a review length distribution under similarly copied and pasted it here now you can change it to a review count sorry a word account so in the word count we had their word underscore count and then you can change their like this word count and this is the count and then the word count actually let's go ahead and write it once again so we have here a word count distribution alright let's go ahead and run it so it says that most of the reviews are having around the hundred words for review so you see there the most of the reviews are having the hundred and then again it is decreasing suddenly there once it is reading once it is reaching somewhere around 110 and hundred and eleven and similarly you see there the most of her other reviews the constantly they are around they start somewhere around the twenty and then it varies around the twenty thirty forty kind of the things there so they are the mostly the same alright so similarly let's go ahead copy it and paste it here again and then I'm gonna put here average word length and this in this average word length so I'm gonna put there AVG word length all right and then here I have a average word review text a V G word length distribution all right so they have review test average word length distribution so it says that the most of the words on average their length is four that's mean the word is made up of four characters and apart from that almost all the mostly it seems that around the ninety are 85% of the words are made up of somewhere around three characters you can say yes not a three but it's around the three point eight characters to around four point two characters so this is the on average okay so there is no three point eight characters in a word so it says that the on average the most of the words a length or particular reviews are somewhere around four you can see it is reading a little less are a little larger but still there are a few there are a few words which have actually a high length in the characters so those could be wrong spelling otherwise those could be repeated characters otherwise those could be also genuine genuine word as well all right so apart from that once again let's go ahead and see this one okay word count distribution so I'm gonna just copy it and then I'm gonna paste it here and once you paste it here I'll hit the enter here so that we can get some space there and then you need to come here I press there saved and the double tab once you press here safe than the devil tab you will see there a detailed documentation and in there in that relation you see there is a kind parameter this kind parameter says that we have here a bar plot box plot spread plot ratio heat mass surface histogram bubble scatter so all these plots which we can use in this plot li all right perfect alright so this is all about in this lesson thanks for watching in the next lesson I'll show you how you can work with the distribution of distribution of a department name and class name and then the division name so kind of the categorical values all right see you in next lesson all right now let's go ahead and get started with the distribution of division Department and class so here we are going to use new concept of a panda's data frame so then we have a group by which we have used earlier so we are going to use here D F dot group by alright and then I'm going to group it with the department name okay so once you put there it's it's actually I think the department under is my space and name so there you will get a group of I object so what it do it group all those little so you first head off this data frame with just one character with one row so with this one row you get there there is a department name so with this department name there is move many departments so those department lengths you can get with D F dot department name dot value underscore counts so it says that so these are the departments which are available like tops dresses bottoms intimate jackets count etc so this is the same thing we can do with the count alright so once you do this what it does it counts all those department name and then it finally consider all those together and then finally it counts how many people are with the age and all those things alright so there is age so it says that okay the count of those people who falls in this category so that is the dista and this number you can verify with here a bottoms says that 36 6 2 and this is same for all others as well alright and similarly we have all those things other I mean the count for other department names as well so in this what I'm going to do here I'm gonna so this is how it is working so either we can take this one at this one so by default I'm going to take this one so I'm just gonna copy it and then finally I'm gonna paste it here and then we can plot it with the I plot and this I plot is in the form of the scatter plot and which is being plotted as a line plot as well since this is kind of the continuous line so those are being connected together alright so what is happening here I'm going to also provide here kind is equal to bar plot alright then it will be plotted in the form of this bar plot and it says that there are around the ten thousand ten thousand tops and then dresses bottoms that's when the reviews which have came mostly on the top dresses bottoms then there is intimate jackets and then finally trends so this is how it is working now apart from this once again if you want to put here the titles and X title and the y title you can put here again so there we have Y title is equal to count again since this is univariate plot and then we have there X title in the X title we are saying that this is Department alright so we are gonna plot it there's a connect status would not be like that so this should be like this and then finally I'm gonna also put here a title in that title I'm gonna see that department alright so department name actually bar chart of department names alright I can do something let me put here backward slash no backward slash is not gonna work so I need to put here double coat and then double quote here so it says that bar chart of department names otherwise you can also say that here a departments name actually I think that makes that makes sense alright so the same thing which he will say the same thing we are going to do with the division as well so I'm just gonna copy it and then I'm gonna paste it here but before that I need to get there a division name not before that in fact we are gonna do here as well so that's the division name and then finally we can say that okay so this is division and I think we have missed this line as well so that I'm gonna paste it here bar chart of divisions name alright let's go ahead and plot it so it says that there is an error Division name there is no division name what is the name of hate its division name yes it's there I think this feeling is wrong oh sorry there's the division name I'm sorry so now we have a bar chart of divisions name and there we have a count and the division so it says that the generally we have got I mean in that data file so the most of the reviews are from general division then the general be tight and then the intimates division so there is a little difference because every division have a different department as well so that is why it is changing a little in the intimates there alright now let us go ahead and do the same thing for thee I think we are done with the division and the department now we are left with the class so there I'm gonna do it with class name and again I'm gonna put it here class and bar chart of classes name so there we have many classes like addresses needs blouse sweaters pants jeans and the most of the reviews come in dresses class all right and then there is yeah needs and then there is blouse sweaters all those things all right so this is all about in this lesson thanks for watching in the next lesson we will start with the distribution offer unique ROM data points I'll see you next lesson all right so now let's go ahead and get started with the distribution of unigram by Graham and diagram so the trigram means there is your collection of three words together unigram means there is just one word and by Graham means there is two words together to understand this with any jumper let me first write an X there so this is that this is text it champa let's say alright so in this what you will get in a unigram you will get multiple words like this and then there is each then a and then finally test and then you will get the example but in a pie graph you will get two words together like this is then each a and then Earth sorry then you will get there you will get their test and then you will get their test example alright so this is how you will get these together and then with the trigram and similarly with the trigram the words are like this is a and then each a test and then you will get a test example alright so this is how you will get the unique Ram by Graham and trigram so with this now we are going to see that what are the frequency of these words in the form of unigram diagram and trigram in yuning wrong we will be analyzing words by word but in the by ground we will be analyzing a combination of words so if you want to find out those two words which comes together then you should use your bike RAM but if you are looking for the trigram that's when the three words comes together then you should use your trigram and similarly for other gram as well like Engram and the new library impact new Python package which we will be using here that's the SK learn SQL and automatically comes with the Anaconda so if you have installed the Anaconda you will directly get the s killer so there you have from SQL dot feature underscore extraction dot txt all right then you are gonna do there import count vectorizer i'll show you how this count with roger work account vectorizer counts these words together that's mean the count with roger actually the count the words unique words you can say alright so now let's go ahead and work with counterfeit roger so I'm going to first define a new Y X here and do remember counter vector Roger takes I string in the form of list so I'm gonna put here this is this is let's say yeah so sorry this is the list then I'm gonna again say that this is the list list and then again I'm going to say that this alright so that's the X which we have currently then finally I'm going to say that Wake is equal to sorry Rick is equal to count vector Roger and inside this I'm going to hit it with the this x value then we have got here our wake that's the vector and it says that it has used the analyzer word and then finally it says that maximum feature is none and other things are by default it is said like stop words etc although that we will be using since we are doing with the unigram that is why this end range is set here one that's mean it's a unigram then I'm gonna store here bag of words that's the B Oh W is equal to a wig dot transform alright so there I'm gonna do that transform and inside that I'm gonna again pass their x-value all right and then let's go ahead and check this bag of words here we have got a sparse matrix and it says that it has record unique words and definitely we have here four unique words this is the list these are the four unique words then we have here a sum of words so let's go ahead and get those some in fact some words actually so some words is equal to be o w dot sum and then I'm gonna say that with X s is equal to zero and then you can get this some words so this is the some words and now our task is to get the vocabulary items so those you can get with the vocabulary alright so there we have vocabulary and then finally we are gonna get it with the items alright so again it is saying that it do not have any I think yes some words is not that that should be the vector so it says that this is a three time occurs and sorry this is not a three time occur but actually this what it is saying that this inside this sum word is at third place then it is at first place that's mean if you print it here so in this the index of this is this one alright and then each is this one and the is here and for a list we have here and you can verify this with like this so there we have so again this is at the third location that's mean here it is here it is because this has occurred here three times to make it a little more clear you can it here again you see there we have this far for time and that for occurs here and the index of this is three that's mean it has here alright so now what we are gonna do here I'm gonna create here a word frequency dictionary here and we will be also taking this some words together as well so how I'm gonna do this so I'm going to use here a list in fact and then I'm gonna write some code in this list how I'm gonna do that so the first of all I'm gonna apply there a for loop for word that's when the first one is word and then here we have index that's the ID x in these vocabulary items and once you have this there's many you have got this word and index together then I'm gonna put this word and index in this some words and from there I'll get this words and its frequency so that I can get it here like some words and inside that I'm going to choose their zero and then I'm gonna choose their ID X all right so that's been the index and apart from that I'm also gonna use there a word that's in this word so with this how it will work so it will group this word and its index that's mean here this index together so it will become word and it's frequency together I'm going to close this list there and somehow it says that this is the invalid yes so I think that should we placed inside a couple yes so it says that this has occurred four time each is one time tie is one time and list age two times so that I'm going to take it into words underscore frick that's mean that's the word frequency and then I'm going to sort it a why it's a descending order so that mean this will come first and then list and then ijen tinta alright so there we have words underscore Frick is equal to sorted so there we have a sorted that's mean it will sort it words frequencies and the I'm going to sort it by key and the value that key has so that I'm going to do it with the lambda so there I have a lambda X and then finally I'm going to take there a first value that's mean it's actually the second in the X so this X is not this X but this is your word frequency all right perfect and then it is passing this couple one by one so this one is says that this is the frequency all right so x1 is a frequency in that topple and then finally I'm going to say that reverse is equal to true that's means sorted into a descending orders all right so I'm going to again take hear words frequencies now we have got here this list is and the in a sorted form it and once you get it then we can now let's say if you want to take just crushed two values then how would you get that you can just put it here like two then you will get just first two values all right so this is the code now let's go ahead and put it into a function so I'm going to just copy it here and then I'm gonna define it like def get underscore top the score N and then words alright so there I'm gonna put here X and then I'm also gonna pass the value which it wants written how many values so that is going to be n alright so let's go ahead and paste it here once you paste it here I need to procedure a tab so that the alignment can be done here and then we have here a vector all those things are good to go and then here we only need to change their en alright so there we have got def top in words all right now let us go ahead and check it get top in words and then I'm gonna pass he reacts and then again at two so it returns nothing because I have not written here a return now you will get the same result as we have got the earlier and similarly if you want here are three values then you can write there are three it will return three values all right so now we are going to use it here we are going to use it for the review text so that we can get it like this let's say we have words is equal to get top and words and in this in this I'm gonna pass their DF review text alright so the review text will be passed here and unigram will be calculated here and then I'm gonna go I'm gonna return just top 20 words once you do this it will take a little time and then finally it will return the H words together there we have here words all right so these are the top 20 words and in the next lesson we will plot these words and we will also do it for uni Graham by Graham and trigram so I'll see you next lesson alright so now let's go ahead and continue this lecture and since we have got here our words and it's frequencies mostly and this is we have bought for a review text for a unigram now I'm going to create here a data frame so that we can plot it later otherwise now this is the word so I can do it with let's say the DF 1 is equal to PD dot data frame and in that data frame I'm gonna pass these words and if you see their DF 1 you have got it here and the columns here which I am going to pass there so I'm going to see that unigram and then I'm gonna say that frequency all right so okay this is the a little error there so we have got here a unigram and it's corresponding frequency so what I'm gonna do here again I am gonna I'm gonna plot it alright so how I'm gonna plot it I'm gonna just do here DF 1 then dot I plot all right once I do this you will see there you have got unigram and it's a frequency this is the unigram actually it will not work like this so here what I'm gonna use there a kind I'm gonna use there a bar plot so once you use here a kind is equal to the bar plot then you will get here unigram and it's a frequency together although this is also not gonna work there so we need a few more things to plot it so before plotting it in a bar plot we need to set it its index so how we can do this we can say like DF 1 is equal to DF 1 dot set underscore index alright so it will set the index and that index is unigram so that yuning round will be used as a index and then finally we are gonna plot it as a bar plot so with this what you get there it says that the hay is having maximum frequency then it and is this all those things and we are also gonna set here x-axis as well so we have their X title is equal to in the X title I'm gonna say that it has it has a texture data okay our Unicom actually this is not a texture data that is the unigram and then we have Y title Y title a count or Dowager for those particular unique grams and then finally I have here a title so title says that top 20 words ok impact universe I Unicom words you can say alright so let's go ahead and plot it now you will get here a top 20 unique ROM words all right so this is with the including our this is included these top words so we have not removed those top words that's okay not a problem at all now let's go ahead and check it for a bigram so for checking it a bigram what we need to do we need to copy this one and in fact I'm gonna put here and diagram alright and similarly I'm gonna put here first it is because of unigram alright so that it should be clear to you later alright so this is the words which we are gonna again get it in fact I have not listed that part here so I need to first get this one alright in fact I need to get this one and let's go ahead and get all these together with the just pressing C there and then I'm gonna paste it here with the B alright so this is for a bigram so for a bigram i need to put here inside this pit if you press there shift and double tab not a fit but inside this count vectorizer you will get there Engram range so we are going to copy it here from here and then you are gonna paste it here then let's go ahead and set it to there's me now it is become now it has became hereby crown alright in fact we do not need this let's go ahead and copy it paste it here in fact because this was just for a testing purpose so we have got here diagram alright there is no changes here and there you have got the bigram this in this two time this is one time and each the whole only one times and with these words we have got top 20 pie crumbs it might take a little time to complete and then if you print these words you will get these diagrams and then similarly instead of writing here unigram we have got here by Graham let's go ahead copy it and then paste it here and then I'm going to also paste it here and then I'm gonna paste it here so with this you have got here top 20 by Graham words so these are the top 20 vairam words in the review text and similarly let's go ahead and do it for a Thai Graham so you need to select this row and then you need to select here this last row in fact go ahead first select this one and then you need to press there the sift so keep pressing the sift and then select all these cells which you want to copy and then press a C in your keyboard it will copy all these cell and then come here select this cell and then press here a wiki alright so once you press there of wiki you will get this is here pasted e now you see there is two by Graham alright so this by Graham I'm going to change it to try round and there we also need to put there three all right so now it becomes a trigram and this was just for a testing purpose and there we have got trigram words let's go ahead and run it and at the same time as it is running change it to a trigram change it also a try wrong there we have a trigram and then okay so it's completed now we have got this trigram and there we have got a trigram let's go ahead and run it so there we have got a trigram this true to size the fabric is and it is this traces it is but it is it is not so these are the words which have been used very frequently and these are in the form of three words together which is a trigram all right so this is all about in this listen in the next lesson I'll show you how you can remove these stop words I stop words are those words which occur very frequently like these to the a huge age all those things are stop words so I'll see you in next lesson all right now let's go ahead and see distribution of unigram by gram and program without a stop words earlier you see we have seen that this is a unigram diagram and trigram distribution for distribution but the most of the words were coming from stop words stop words are those stop with those words which have a very high frequency in English literature like we have each the did not to those from that mean you are all those are stop words so we are going to remove those stop words but before that let's go ahead and copy this trigram from there and what you need to do select this cell and then hold your Shift key and by holding shift key you need to press this one okay so you will see those you will see there these cell has been highlighted now you need to press the C key in your keyboard and once you press the C key it will copy all those together and then I'm going to paste it here with V and there we have got it since we are going to work with unigram there so we are gonna remove this you can remove this by selecting it and press the X key in your keyboard then you will see that is deleted since we are working with unicron we need to change it to one that's mean this is a unigram right apart from that you see there we have I'm gonna just remove it let it be in fact and there we have it for Unicron and again copy it and paste it here copy it and paste it here and then paste it here alright let's go ahead and run it then I'll show you how we can remove a stop words so this is the same which we had plotted earlier so this is the plot of uni grams and most of these uni grams are a stop words except here dress perhaps and others are a stop words so let's go ahead and remove those stop words to remove those stop words I need to pass here a parameter stop words so why default I'm good I'm gonna select here and English stop words otherwise you can also pass some lists which you want to add as a stop words so once I pass that English stop words then automatically in English literature those stop words will be removed there and then it will calculate here a new uni grams and now you see you have not you do not have any stop words here like we had seen earlier the is that all those so these are the real unigram stop words sorry these are real unigram words which have occurred mostly in the review and there is a plot for this so the mostly dress love size fit like we are great justin fabric so all these all these unique ram words have occurred greatly mostly you can say alright so now let's go ahead and copy it from here and hold your Shift key press here now you will see there you have highlighted all these cell then press SC in your keyboard and then select this cell procedure a V in your keyboard then it is pasted there and then I'm gonna say that this is a bike ramp let's go ahead and copy it run it and then we also need to change those diagram here so that it becomes there by Crom and apart from that I also need to change here this unique ramp to bigram so that in the plot I can also sue the titles and titles as a diagram alright let's go ahead and run it now you will get here a bigram without any stop words like the true size loved race usually we're so if we remove the stock word so these words have come together the most of the times like you can say with the stop words it might have been true in size I love this dress all those things it could be but the mostly this is you how you can remove a stop words and then you can plot a purely words which matters most alright so we have already copied it for a unigram let's go ahead and paste it again here by pressing V key in your keyboard we will get here and then finally we are going to change it for a trigram alright to copy it and run this one and here we need to change it three so there we have a three let us go ahead and run it I think we have got here a error there empty empty backup Larry we call a perhaps the documents only contain stop words select mean see there yes so there could be somewhere where where documents could be only a stop word so what I'm going to do here I'm gonna just remove this because there could be the most of the words which had just I mean the words were not enough that's mean if you remember we had in the previous lesson our hex let me show you though GX here somewhere yes so I'm going to show you here that yes we had so in this yes we do not have any words which have which have a continuously three words together that is why it was the error there although let's go ahead and move ahead there I'm gonna paste it unicron to trigram and then again here is a trigram and then again we have there try Gramp all right let's go and run it so with this what we get paid true size fit true size there is a little difference it'sa fits and this is a fit runs true size lalala usually we're size so all these are a refuge which have at least these three words continuously and in between there where there might have been some stop words those stop words has been removed alright so this is how you can visualize your words in the form of Unicom by Graham and trigram with the stop words and without stop words and you will understand how the words are distributed in your in your a text data so this is all about in this lesson thanks for watching in the next lesson I'll show you how you can work with the part of speech and it's a distribution in the text data see you in next lesson alright now let's go ahead and get started with distribution of top 20 parts of speech our POS tags we will be using here a text blog and NLT kay library or you can say a Python package you need to first install NLT Kay if you do not have you can install it with the pip install n LT k so this is the natural language processing toolkit so there we have NLT k run it and if you do not have an LT k it will install and if you already have an LT k it will say that requirement is already satisfied and after that you need to import here in l TK and once you imported the NLT k now you need to also download few necessary packages here in the NLT k so there you need to write there NLT k and then you have there a download there I am gonna download their punctuation so that you can get it with the punc T all right so that's the Punk T and then you have NLT k dot download there I'm gonna also download the average perceptron okay so I'm gonna right there averaged underscore perceptron underscore tagger alright so let's go ahead and run it it will take a little time to complete it's a download and if you have already downloaded it it will say that these packages have been already downloaded in your computer alright so once you have done it now we do not need to use NLT Kay because text blob uses this ltk automatically in the background so we are going to directly use here n LT K for the the parts of speech tagging so I'm going to create here a blob blob is equal to text block which we have already imported from a text block then I'm going to convert a data frame in the form of string so STR D F and then we have here the view text alright and once you do this you will get here a blob this is a blob and if you want to see how it is working you can just simply copy it once you copy it and I'm gonna paste it here just above hit then you will get all these elements I mean all the text in the Revue in the form of their a series that's the text actually it is converted into actually a string format if there is somewhere numerical data that will be also converted into a data type that's the object they are in in the form of a string actually so we have here now a blob and once you got their blob then we need to convene it to actually calculate the tags by just calling their tags so these are the tags alright so once you have these tags so once you have got all these tags you might not understand what are these tags now to get all these tags you need to get what are those tags so I'm going to right there the n LT k dot help and then I have there you P e NN underscore tag set so I'm gonna get all those tags set let me see it is giving some error it says that it do not have - tag set so what I'm gonna do that I'm gonna first download the whole tag set as well so I'm gonna download that tag set here and let's go ahead and run it so it says that this is the tag set I'm gonna paint it here once you print it you will get all those tags set there so CC is a conjunction there you we have a cardinal numbers determiners and all these tag sets presents into a NLT like 0 is a cardinal number then we have their RV RV let's see what is the RV RB is adverb occasionally and it's used there all right so ok so we have there this is the tags which we are gonna use now let us go ahead and write a for loop so that we can get all those tags together so I'm gonna right there the part of speech tags that's mean data frame POS D F then I'm gonna write their PD dot data frame there I have a blob which we have already created there then I'm going to use their tags now these tags in the form of topple list will be passed into these tags and there we have here word and then related tags there so I also need to pass their columns columns is equal to words and then I have their tags that's the POS part of speech tags and if you want to see sorry I should have done it outside of this then if you want to see it you can get it here in the form of data frame we have here words and their related part of speech now what we need to do we need to do a value count in this POS so how we can do that I'm going to do again POS DF is equal to pee oh sorry POS DF and then for this part of speech dot value underscore counts so there we have POS DF let's go ahead and get this POS idea and see there so this is how we have got this POS DF now what we need to do we need to plot this POS DF in the mark rod in the form of Bart rod so that we can do with the POS DF dot I plot and then I'm going to set here kind is equal to bar plot once you do this you will get here we have here n n that's I think the noun and n is down there it's non common singular or mass now under similarly other parts of tag there so we have a dtj CD and i think this is the inter junction and then there we have work and this is the conjunction so all these are the part of speech so in the reviews most of the reviews are the form of now then we have their determinant and then there is cardinal numbers as well all right so this is all about in this lesson thanks for watching in the next lesson we will start with bivariate analysis that means we will be taking two variables together and then we will try to find out the correlation between those two variables so this is all about in this lesson thanks for watching I'll see you in next lesson alright now let's go ahead and do here a bivariate analysis that's mean we will be also taking a multiple variables together so the first of all we are going to use here a pair plot if you remember we have already imported Seabourn as SN s so the first of all I'm going to use here a pair plot but before that let me also plot not a plot but let's go ahead and see a first few line of a data frame so there I'm going to see just two lines that's been the two rows of this data frame and then we see we have a recommended rating a recommended index for the two feed accounts and all other things in the numerical form so I'm going to create a pair plot between all these numerical columns so you can use SNS dot pair plot once you use their pair plot then press shift and the double tab you will get there the information which information you need to pass in this pair plot you need to pass here a data and then if you want to pass the any hue that's mean if you want to distribute your data break down your data on any particular column you can do all those things all right so I'm gonna pass here the data frame which we have currently that's the DF and once you pass this all the numerical columns will be plotted here in the form of pair plot so let's go ahead just wait for some time it might take a little time because there is a lot of the data I think yes it is a plotting just wait for a few second Dada Dada Dada Dada Dada yes so the pair plot is plotted here so this is kind of very wiggly jiggly fair flawed because there is one two three four five six seven eight so there are eight columns that's mean eight by eight matrix is plotted here and can we see any relation here we have there is a polarity so this polarity says that something is a positive relation if we I mean it's kind of there is a positive correlation that's when they deposit to slope there so it says that if average word length I mean word length is more then there is very high probability that it's a positive sentiment and similarly if word account you see there let me see is there anything positive I think something is here also yes so it says that the positive feedback count of course if positive feedback count is more then there is a more positive polarity so this is kind of there you see we have a positive feed positive slope there all right perfect and very interesting very interesting relation here between the review length and word account so as the world account is increasing review length is also increasing and it is pretty much obvious okay so as you increase the word account then of course the number of characters will increase so there is a perfect positive correlation here it's kind of far if it seems a 45-degree relation that's mean as you increase word count at the same at the same ratio review length will increase all right so this is about the pair plot now let's go ahead and see a swarm a plot how we can do a categorical plot a so on plot I'm gonna do it with SNS dot cat plot and in the cat plot I'm going to pass X is equal to Division name so there we have Division name and in the Y I am going to pass their polarity and then finally the data I'm gonna pass here DF with this we get here a swarm plot in the swamp lot it says that intimates all right so intimate division has mostly the positive polarity but this general division name has some negative polarity it happens because it's it's a pretty much clear with these three plots that there is most of the reviews are in a general division and that we have already seen in our previous lessons so the most of the reviews are in general division name alright now let's go ahead and see it with the box plot how we can do it with the box plot let's go ahead and copy it and paste it here and then I'm going to pass here a kind parameter and that kind parameter is gonna be box and then similarly it says that the median value of polarity for all the ETA for all these division it's around 0.25 that's when the most of the reviews are two reviews and they are recommending those products from that from that ecommerce websites are you can say the stores all right so this is how you can plot with the division name now let's go ahead and plot it with the department all right so there I'm gonna use here a department all right so there is a department name and why is a polarity so this we can see a trend have a least number of refuge and it seems the dresses and bottom department and this top so these three departments these three departments have most number of refuge and if you copy it and again paste it here since this is the department I'm going to change it department and then I'm gonna pass here find is equal to box brought it should work I think let me see why this error is coming there so it says that could not interpret yes I think the Department yes alright so there we have it says that again for a trained for this trend median a median review polarity is a little less as comparison to all other five departments all right that's mean they stored need to pay attention on this trend a trend Department all right now let's go ahead and see how we can see the division name and their review lengths so that we can do with SNS dot cat plot and then I'm going to pass their X is equal to the region sorry there's the division name and then finally we have Y is equal to preview underscore length then we have their data is equal to bf and then finally we have kind equal to box alright so with this what we get here we get this division name and review length in the form of boxplot so the x-axis is division name and then the review length so there we get that general division and general petite have almost the same review length and the intimate division have have a little smaller review length sighs all right and similarly we can also do it with the department name as well so let's go ahead and copy and paste it here and then I'm gonna pass here department name and it's pretty much clear that there we have at recess and jackets anton trained so these have review length larger as comparison to these tops and intimates all right so this is all about in this lesson thanks for watching i'll see you in the next lesson where we will see the distribution of sentiment polarity with the recommender discourse alright now let's go ahead and continue our lesson on bivariate analysis and before we continue this lesson we need to import now partly Express and broadly graph object so I'm going to do I'm gonna do it with import then I'm gonna right here up plotly import plot Li dot Express as px and then I also need to import here importer broadly dot graph object there we need this graph objects as go let's win this is crop object and then also yes I think this sorry I think these two are okay perfectly to go so the first of all I'm gonna get here X 1 that's the polarity when it is high that's mean when the recommended index is 1 alright so there D F dot lock that swing get all those get all those the the D F that's when the data frame which have polarity is equal to the sorry the recommended index to d1 so there I'm gonna pass it here DF all right so inside this DF I'm gonna pass that I have their recommended index that says that when is equal to two all right I think let me pass it here as well so there we have x1 so now what we need we need this recommended ID and we also need their polarity all right so there I'm gonna pass here inside this yes so I'm gonna get first this recommended index all right so I'm gonna pass it here in fact then I'm gonna paste it here and then I also need here polarity all right so there now you sorry I think I need to put it into a two square bracket there now you will get your x1 so in the x1 we have recommended index there and then we have their polarity now what we need to do is the similarly we are gonna get here X 0 that's mean when this recommended index is 0 so I'm just gonna copy it and paste it here and with the same formula we have got there X 0 now what I'm going to do here I'm gonna I'm gonna plot here our trace 1 so there we have 3 is 1 is equal to go that's the graph object dot histogram so it will be plotted in the form of histogram so there X is equal to I'm gonna pass there X 0 and then I'm gonna pass here name for this one for this plot actually that will be used as a legend name that's me not recommended all right and then I'm gonna also use their opacity so that one's two graphs for x1 and x0 are on the top of each other then the background graph should be visible a little so the obesity I'm gonna use here 0.8 all right so let's go ahead and run it once we run it then I'm gonna also copy it and paste it for a trace number two so that's the trace two in distress - this is I'm gonna see that recommended sorry and then there is x1 so we have got their trace 1 and trace 2 now we are going to prepare their data so that the data is equal to in the form of list that's the trace 0 and then there we have trace 1 all right so this is Europe now inside the data then let's go ahead and get the layout we can get the layout is equal to go dot layout that's means this is a crop object layout then I'm gonna say that graph mode is equal to overlay that's mean these are X 1 and X 0 can be plotted to each other I mean on the top of each other then we have their title in the title I'm going to say that distribution of sentiment polarity alright polarity of sorry of reviews based on the sorry recommendation alright so there we have it and once it is done then I'm gonna say here a fing is equal to go dot figure then we have their data is equal to theta then we have their layout is equal to layout and once you have this then finally we can say here a figure dot so alright so there we are gonna use their fake dot so all right it says that trace 0 is not defined yes I think that's the trace to let's go ahead and convert it into the trace 0 and then the tress 1 alright with this it says that the module object is not callable I think it is here a little wrong there so that should have been poorly all right and still says that it is not correct let me see why it is not correct I think it is wrong here so that else should be in the capital and once you do this you will get something like this oh this is not a pretty much clear so what we are going to do here we are gonna remove these recommended index from here and then similarly we are going to remove it from here as well alright so there we have x1 and the x2 let's go ahead and see so there I think it's a little wrong there if I remove this field or so and if I do it with I plot there I have I plot and then figured so go ahead and see it okay still just give me some time to fix it well so I got the error error error is coming actually from here so this creates it as a data frame but we do not need it as a data frame once we put here at the value square bracket then any series becomes as a data frame that you can check it here here let's see if you run it there then if you check there the type of x1 then you will get it as a data frame but this data frame we do not need it but we need it in the form of series so if you use there a single square bracket then you will get that as a series so this is the series which we are gonna pass it here and then you will get here a bar plot there and we have set the opacity at 0.8 I am gonna decrease it a little and then I'm going to set it 0.7 so with this 0.7 you will get here a pretty much good plot and this plot says that this is the recommended product I mean recommended reviews the most of the reviews are the positive reviews further recommended and there is a not recommended products so the not recommended products also have a positive review but say they say that okay this is the good but they have negative polarity as well and the same positive polarity for the not recomended and for the recommended product but they also have their a negative polarity all right so this is all about in this lesson thanks for watching I'll see you in the next lesson all right now let's go ahead and plot the distribution of ratings based on the recommendation earlier we have plotted it based on the polarity of refuge and I have given this heading and then now I'm gonna plot it with the based on the ratings so what we need to do we actually need to copy these codes so that we can save some times and we can process it a little faster you need to select this cell and then hold on a sip key in your keyboard and then click on this cell then you will see there these cell has been highlighted then you need to press the C key in your keyboard and then finally you need to press that here yeah select the cell and place a V key in your keyboard then you will see they're just below of the selected cell a whole khadijah whole code is pasted here since we are working for a rating so that polarity only needs to change it to rating and this also we only need to change their rating and apart from that we only need to change here a little distribution of sentiment it is actually the distribution of ratings all right so you can say the distribution of reviews rating based on the recommended based on the recommendation all right so let's go ahead I run it then we are going to run the each cell and then we see here for a positive reviews that's mean for a recommendation and then there is also not recommended product so the most of the not recommended products have either one star two star and the three star rating but the recommended products have a four star and the five star rating but there is also very small of not recommended products which have high rating but the not recommended and I think we do not have any not recommended but a five-star product all right perfect now let's go ahead and see it with the distribution with the joint plot actually so I'm gonna right here SNS dot joint plot and then I'm gonna pass there X is equal to first rating and then Y Y is equal to let's go ahead and see the rating and the review length okay so there why actually the we can plot the joint plot only between two continuous variables although we can do also categorical variables but that is not recommended so there we have Y is equal to so review limb all right so there we have data is equal to DF that's the data frame and let's go ahead and run it so once you run it you will see there we have X is equal to the rating and then yeah actually this rating is also a categorical variable I'm gonna change it to the polarity all right then you will see here a kind of a pretty much good joint plot and this is in the form of the scatter plot if you want to change this plot type there's I mean the kind of this plot you can pass their kind is equal to the KDE now it will be converted into a kernel density estimation plot they are just wait a second then you will see this plot this is very beautiful plot and there you see the polarity is the mostly high but the most of the polarity and review length is concentrated somewhere here apart from this let's go ahead and copy it and check the polarity with the edge alright so similarly I'm gonna put here age so we are gonna see X that's the polarity and then the Y is gonna be age here just wait a second now it is a pretty much clear that the polarity is somewhere here which is around 0.25 and the edge is somewhere here so the people who have given rating and the age of those people are mostly concentrated somewhere 35 year it seems all right and the overall the polarity is somewhere they concentrated around 0.25 all right so this is all about in this lesson thanks for watching I'll see you in the next lesson
Info
Channel: KGP Talkie
Views: 10,037
Rating: undefined out of 5
Keywords: text data visualization, eda on text data, data visualization in python, text data processing in python, ecommerce text data visualization in python
Id: HVBk2Ge_Q98
Channel Id: undefined
Length: 109min 34sec (6574 seconds)
Published: Sat May 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.