Sentiment Analysis in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone david maley here and today we're going to do sentimental analysis this is really cool we're going to look at online sentiment using twitter for a grocery store chain out in the west coast so what i want to do here is i'm going to end up i'm going to start off with library stuff what i want you to be able to end up with here is really cool you'll be able to measure customer online sentiment and negative and positive you know how and be able to measure that so you'll be able to end up with something like this here where you can see this was majority positive some negative and i'll show you how this comes out so you can see what the words are and how a store or a company or whatever you're looking at rates in a negative and positive manner online so i'm not going to scrape the data from twitter on here there's many ways to do it you can go pay someone do it you can use a twitter api to get the data you can also manually do it if you don't have that um there's numerous ways to do it i'm not going to go about that in this video maybe i'll do another video but in this video we're assuming you already have the data from twitter okay and then you just need to go and cleanse that data you need to remove things like user names and uh junk pieces like uh comments you know words that aren't going to be a part you want the actual replies okay so anyway let's get started here so on the sentiment analysis uh we're going to load in these libraries right here tidy text text data oh i've got look at that i've got tidy text in there twice we really want that one in there don't we let's get rid of that second one all right so tidy text text data read excel because i'm reading an excel sheet uh deployer vplyr stringer tribble or i'm sorry tibble it's late at night sorry tibble and ggplot2 those are the libraries we're gonna load in then what we're gonna do is we're gonna bring in three not one but three different uh libraries or dictionaries on sentiments right so they all contain different groupings of sentiments of words and how they're negative or positive or ratings based on them and stuff like that so one is a fin the other one is bing and the other one is nrc so we just put gets underscore sentiments we bring those in then i'm going to bring in the data in this case i want to be able to manually choose my data right so i just put this read xlsx underscore xlsx which is from that library i showed you above file.choose just this when you run that it's going to pull up a little window so you can go and look at your files and figure out where it is in this case i already have it and it's right here okay so these are all the very different things we have for people for this and uh what they said all their replies and stuff so i know i have that and it's all one row per reply right and um well some replies could have more it's possible in there anyway so we need to do is we need to take that remember you saw that where replies have multiple words to them mo purple heart trump heart jerk chicken trunk rolls virgi some of these have single words like trump some of them have what's the date funk the shenanigan or whatever it could be anything right but what we want to do is we want to put them one word per row right so how do we do that this right here so what we're going to do is we're going to say we put this file into data2 right so now it's in data2 and then we want to do is we want to pipe data2 into mutate right that's what this is right here this is the pipe function by line number equals row number right and we're going to pipe that into unnest tokens word comma text so what that means is i'm going to go and get a word for each row so if a reply had five words in it i'm gonna have five rows that's how it's going to work okay and then just make sure in your data that the column of your replies has text at the top of it or change this to be what it is so if i go back here and it go all the way top see that that one's got to say whatever that was right here okay so if i call that words then that would have to have words in the top of the other one if you don't it won't read it okay so you run this this will read it by um word okay then what we're going to do is we're going to say okay i need to have because i know there's words that are nonsensical nonsensical means stop verbs right things that don't count so i is that negative is that positive just i right c s e e s e a c do those mean negative positives c could be positive but i doubt it so you want to remove this thing so what we have is there's a data dictionary called stop burst right and we're gonna pull that in it's from the tidy text package that we put in above and so if you run data stop words that brings that in but now keep in mind that stop words that doesn't remove curse words remember so we can't go and give something like this to an executive right and say oh here's your and you have all kinds of curse words in it so you know people online they see all kinds of stuff we got to remove that stuff so we're also going to create a little process here to remove the curse words so we do is we create two variables a and b and we put in a all the actual curse words that we think we're going to run into i'm not going to repeat them here but they're right up above here and then each one is a curse curse curse curse curse curse see that for each one okay and the reason why we're doing that is in stop words if you were to look at stop words you'd see that they has two columns that we need to have data for not just the actual cursors so what we do is we then take that data a comma b and we put that into a data frame so that's what data.frame does and we put that into curse words see that so now we have curse underscores and we stop underscores now what we need to do is we know that in uh stop words if we were to look at that we have two columns word and lexicon word is the actual word itself in this case would be the curse word right and then lexicon which would be this what is it is a curse so in this case so we would want to put c word comma lexicon right and put that into call names because we're changing the call names of curse words right then we need to append that back to stop words right so we do that right here r bind stop words comma curse words we put that into stop words right because it goes back to stop words now our stop words will have all of the curse words in it along with the stop words that's how what they just did now once you've done that we have this thing called stop words but we haven't applied that to our data yet so we need to do that so now that's where we're going to actually remove the stopworks right so if we do this let's take a look at this we've got data by word right which is what data by word is what we created above right where it's data by word that's why i called it that very simple and we're going to put that into cleanse data okay then what we're going to do is we're going to take cleanse data and pipe that into anti-join stop words so we're going to put that in there and that puts that in back into the cleanse data okay so now it has the anti-join the anti-join is the negative so what that means is i don't want the stop boards in there if i put join that would be i want them in there i don't want them in there so i want them removed it's going to remove all of the stop words which also includes the curse words right from the cleanse data then once we're done with that we want to go and look at and count the or show the top 10 default common words right so if i take this right here what this is doing is i'm taking cleanse data and i'm piping it into account word sort equals true let's see what that does watch if i click on that one through ten tells you right there it's a tibble word and and it tells you you know the word and then how many times safeway replying mask store trump i'm lol people tears food and it tells you how many times they're all they all show in those replies okay then uh if we go down here we want to graph the most common words right i want to see you know how often these words show so let's take a look at that here we go from safeway applying store mask trump all the way on down to i don't know grocery don't stores mass floor california assumes ca's california and biden okay and that's how it works there um that's not to put a political stint on this i'm not doing that it's just that's the way the data is by the way i didn't show you up above this is data from october 1st october 12th of this year for grocery store chain out in california so next based on this data if we graph the most common words without the top ones right we don't want safeway in it we don't want replying it okay we want the rest so let's take that what happens if we graph the most common words with those two taken out right let's do this thing again and here we go we've got store mass trump i'm tiers all the top five and the bottom five are stores mass floor california and biden at the bottom i think there's a tie really between stores mass floor california and biden but that's just the way it is um so if we look at this now if we create the word cloud so word cloud's where you saw that earlier that was positive negative this is before that we're just going to do a word cloud on the most popular words right so let's do that uh actually hold on before we do that let me just go over some of this what we did here we did show you the cleanse data i didn't really go into this of the most common words what we're actually doing here so let me go back over this i'm going a little too fast for myself it's late at night so let's go back over this so let me show you the data here that we're looking at so we've got cleanse data right and what we're doing is we're piping in a lot of pipes here and what we're doing is putting in the count word sort equals true right you want to know how many then we're pointing in a filter of n greater than 13 right and you could have picked anything you want it doesn't matter depends on how many replies you have and what words how many you want to show here in this case i wanted to show this many so that's where i put n greater than 13. mutate word equals reorder of the word and comma n and then ggplot aesthetics word comma n which is where we put this in here and uh column geom column of x lab and y lab which are your you know your x label and your y label and coordinate flip and then what happens is when you do that down below you just put in because you can't put in the title in this you've overloaded it so you have to put p plus gg title of most popular words that puts the title up there okay if i try to put that in there it doesn't actually work all the time i've overloaded with too much stuff in there already so that's why i'm piping that into a p and then i bring that back plus gg title okay you've seen that if you want to see more about advanced ways of doing graphs like this i've got some other great videos that go into much more depth than this so next you'll see we're doing basically a lot of the same stuff same thing here piping it into p we're putting you know count but this time we're removing safeway and replying from it so if we run this one it's the same thing except we remove those two see that that's all it's the same thing as the one above it just removing i see right here so instead of just n greater than 13 we also put word not equal to safeway word not equal to reply okay then down below so now you know how to do those let's create the word cloud based on the most popular words a little bit different so we bring in this library called word cloud right and r this is what the library is and then you're going to pipe in cleanse data we're going to remove see this right here replying stores store and safeway so word not equals and each when you do word not equals word not equals word not equals you're going to anti-join in the stop words we don't want to stop words in there right and then count the words with right here's this word cloud word n and max words equals 50. so the word is the actual word like in this case tears or trump or i'm or store right and then the n is the actual count of that and then the max words is how many do we want to show right so in this case let's run this and now sometimes you'll see it gets cut off here and gets cut off here remembers the space for the positive and the negative parts you need to get rid of that so to do that the best way is just do this you know clean it and then re-run it again see it then it shows the whole top part of the bottom part and you can see here by looking at it obviously the words that are the highest up with the most instances will be largest and those are the smallest instances will be the smallest so you get trump being a little bit bigger than biden we got grocery heart laugh out loud quote now this is not necessarily negative and positive it's just the top most positive words or not most positive most popular words i'm sorry now what happens when we want to go and do this i want to go and see what i showed you the beginning i want to see the positive and negatives right it kind of in a measured manner so what we're going to do is we're going to use the same thing but we're going to use library reshape 2 and we're going to use cleanse data and we're going to pipe into that the three sentiment dictionaries from above remember we did that earlier on a fin bing and nrc then we're going to bring in anti-joint stop words count the word sentiment sort equals true now the difference is we're also going to bring this a cast of the word sentiment value dot variance equals n and fill equals zero and then we pipe that into comparison cloud and we're going to use the colors you could use any colors you want in this case i'm using red and blue red is negative blue will be positive okay and the max words is 50. now what happens when we do this let's take this let's run that now see we're missing the top and the bottom again remember so we got to do this yes and then run it again there we go now we have all of our words there now in some cases hell some people might consider that a curse word i didn't place that in there but you know it's up to you if you want to put that in your curse words list and that wouldn't have showed here okay and you can see from this there is more positive words than there is negative words you can clearly look at by seeing this that probably i don't know 60 70 percent maybe give or take a little bit maybe a little more is uh positive versus negative negative words are things like liar tired fear and it shows them by size to how often they appear so obviously love joint smiling so the most bad shows the most for a negative word um so we've got our negative positive but what if we want to also do it in a measured manner i want to see exactly what is the percentage of positive to negative right so we have to do is we got to take our cleanse data from our sentiments and we want to count the sentiment sword equals true look we did it before and put it into sentiments see i put that into sentiments right here okay so we do that then we bring that sediments down here into these two rows and we split them up for positive and negative and we're going to put them into we're going to subset them that's how we're doing it dplyr and up here you've got sentiment positive and sentiment negative that's it so we're going to put the positive and the positive negative and the negative that's exactly what we're doing here it's all we're doing that row then what we got to do is we've got to get the occurrences right so we need to aggregate n is remember the number right for sentiment data equals sentiment positive and then this one data equals sentiment negative so when we bring those guys in we sum them we put them in the correct one so the positive goes to the positive score the negative goes negative score so we've got positive underscore score and negative underscore score okay then it gets pretty simple once you get through that all you're going to do is get a ratio based on the positive score right of the end variable because that's where we put it in right the numeric number and then of occurrences and then the negative score of dollar sign n that's how you in the data frame you access the variable or the column so in this case the column is n and we're doing for the negative score and the positive score once you do that it puts that into the ratio positive negative and then all you've got to do is run this guy right here just like this and there's your score now what that is right there is 0.87 you have to multiply it by 100 we all know you have to do that to get the what percentage right so in this case you would have 87 positive and what would that mean you'd have the opposite of that which would be 13 negative that's why i told you we're probably looking at about 70 60 whatever difference but now you have to lower it a little bit so even though it says 87 remember we remove what oh we removed well the the stop words don't normally matter right because we're removing them and they're both positive and negative well they're not either they're nothing but we added something to it we added the curse words now the curse words what they're all pretty much negative right so if we remove the curse words we're doing what we're inflating the positive numbers and deflating the negative numbers so you would have to do is go back through this and just run it the same way but with out the curse words in the style words right and then see what your number comes out to i've already done that for you it ends up being 78 so the difference is uh basically nine percent is the uh bias or skew that the curse words by removing them that we induced so we have to take that into consideration so even though here it says 87 is our positive negative score well think about it if you didn't remove the negatives or the uh the curse words you really would be at 78 so that's the true number here then so if you wanted to measure this and you want to say okay this is for this window of october 1st october 12th what if i want to look at and compare it to the same thing in november november 1st november 12th well all i'd have to do or any time i could even compare it to the month of november i would have to look at because i'm just comparing percentages right in the end i could do that and then i could say okay it's you know 78 now let's take this the curse words into account maybe in november it goes up and it's 80 well we could say there's a two percent gain in the positive uh sentiment online in twitter for this grocery store chain and that again the data for this came from twitter and we used the replies for that period of time and um we scrubbed it or removed you know uh things like comment username uh you know labels like that just left the actual replies themselves uh to do this and it's very interesting to do this it's very uh enlightening you can see all kinds of you see what the people are searching for um what's neat is that you can also help identify and get ahead of the curve on negative things for instance let's say a store uh has an issue uh with stuff being out of stock or uh something worse uh it could be anything could be a pothole up front could be construction going on this blocking making people so people can't get to the store you might be able to quickly identify something that you didn't know before uh could be anything okay it could be stuff people are talking about that maybe there's a recall that affected people in a negative way if you can get ah if you can figure that out this is what's cool about data science if you could figure stuff like that out you get ahead of it and then try and fix it right your marketing people uh your marketing department your directors and stuff can get ahead of it and try and figure out okay let's put a positive spin on this somewhere let's go and do a giveaway or let's do some coupons or let's you know it could be anything any kind of a marketing effort to try and turn around a negative i hope you found this interesting is one of my favorite uh analyses to do with sentiment analysis sometimes i call it sentimental analysis it's really sentiment analysis um but i hope you found this interesting and they're enlightening and entertaining it's really cool stuff when you get down to it and think about it the implications for this are huge so there are bots that people have created um and of course the cia and nsa have stolen some of that technology and that code stuff and they use that to take um all the phone calls people make inside the us messaging uh any kind of way that you communicate with people electronically they store all that and they use that to go and predict things to see if they can spot you know bad things uh coming to fruition stuff people also have created bots to predict the stock market and they use sentiment analysis to do that in those bots they go and see okay what is the sentiment the cement here is basically good for this but this is not the whole economy this is a grocery store chain so you would have you'd want to look at a larger item or a larger uh niche you know something like just the world in general and just look at general which is a lot you'd have a lot of data there um and go through that and figure out what that what's negative and positive and you can go from that and actually a guy did really well he created a bot and during the last great recession 2008 where most people lost like 50 percent 52 40 of their monies at that point because of the you know the drive the stock market he went up 684 percent at the same time frame why he used sentiment analysis in a bot to go and predict where things were going what he should be investing in what he should be switching to well i hope you found this interesting and educational please take a moment to subscribe like and share and please try this out try this code uh try and make these things for yourself it's pretty interesting what you could do thanks again and have a great day
Info
Channel: Tech Know How
Views: 2,826
Rating: 5 out of 5
Keywords: sentiment analysis in r example, sentiment analysis, sentiment analysis in r: the tidy way, wordcloud, data science, twitter analytics, sentiment analysis r, sentiment analysis with R, sentiment analysis basics in R, how to analyze sentiment in r, R sentiment analysis, R sentiment analysis tutorial, learn sentiment analysis, creating sentiment analysis in R, twitter sentiment analysis with r, sentiment analysis r code, sentiment analysis r project, sentiment analysis in r
Id: engYXDjfQ18
Channel Id: undefined
Length: 24min 6sec (1446 seconds)
Published: Mon Oct 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.