Sentiment analysis with tidymodels for Animal Crossing user reviews

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silly and i am a um data scientist and software engineer at rstudio and in this video we are going to use this week's tidy tuesday data set on animal crossing specifically we are going to use the part of the data set that is about user reviews and we are going to train um a sentiment analysis model we're going to use the words from the user reviews to the text of user reviews to train a model about how positive or negative the review is this is such a common real world task that people need to do where you have some uh text that someone has uh submitted or or that you get as input and you need to understand how positive or negative the um the the affect is there text is one of my big interests so i'm really excited to be able to walk through how to show to be able to do this and this is um uh it is really exciting to be able to show how the tidy models framework is something that is flexible it can be applied to all kinds of different kinds of data including text all right it's animal crossing time actually actually this is um this is such a common um use case for uh text analysis um uh this is this is like a sentiment model basically that we're going to be building um and this is such a common thing in the real world that we often want to do with text we have users who have come and inputted text and um we want to know um and you know they're associated with some grades here like some scores and we want to know like we want to be able to tell from the text um um what words either either from like an inferential standpoint we want to like understand what what are people talking about when they're more likely to be happy versus upset or we want to be able to make a prediction like when someone types in some text somewhere we want to be able to have some uh be able to make some prediction about whether they are more likely to be happy about their product or unhappy about their product so let's do just a little bit of some um exploratory data analysis before we get down to the modeling that we're going to build so the model that we're going to build is we're going to take the text and we are going to um predict the rating but let's understand a little bit about the rating before we do that so let us look at the distribution of grades here these are the ratings that are have been scraped from the um the the review website so that that's a pretty weird distribution right well i mean if you have experienced the internet and ratings on the internet you that does not look like a weird distribution right you know that either people come over here to complain and say it's awful animal crossing is the worst or they have come over here to say i love animal crossing animal crossing is the best 10 10 stars 10 i love it 10 out of 10. so from that sense of course from our like lived experience we understand that this is normal a normal way that people use review websites but from a modeling standpoint if i look at this distribution that does not look like a set of numbers that i want to model using um like like regression you know this is not this is not um this is not a distribution of um just continuous variables that i think would be a good idea to try to model using um regression um so i think what i'm going to do is i'm going to turn this into a classification problem so i'm gonna take let's say let's take like eight nine and ten and let's call these the good reviews the happy people the people who like animal crossing and let's take um seven and below and or you know we could do seven eight nine yes there's not that many sevens that doesn't really matter but let's make a cut and um uh call one bin the good reviews one been the bad reviews and let's build a model using our text to predict um uh which which uh reviews um are go in which when which bins so let's do that um one idea one thing when you are doing um a text model is it that it is always a good idea to um look at your data uh always good a good idea to like read some of them so let's let's just like look at let's say like ten of these here or five five of these here and so let's look at like some five of the high scoring ones um these are people that are so happy um very happy that looks good okay so this one this top one notice i am a single player that does not have a shared island um and it starts over i am a single player that doesn't have to share an island and we've got this expand thing at the bottom so this one is um kind of weird uh it has this repeated text it looks like there is um uh here we've got another one firstly i would like to say and then does it repeat firstly i would like to say so we've got it doesn't look like it's the same amount of text though um and then expand so hmm oh and we also we have non-english in here as well okay so boy this is some this is a messy data um any reviews stating there's only one island so these are the happy people any reviews i wonder if it's the same number of characters so that would be an option removing the first number of characters from any uh uh review that any that ends and expands um because the the concern here is those words are gonna get counted twice we could distinct the words um anyway this is this is definitely it looks like there's some scraping some problems with the scraping of the reviews here and then we ended up with a with a imperfect data set as we always do here we've got this um and this expanded signal there we could just throw away all the ones that ended expand that doesn't seem like a great idea though um yeah so we've got for all those people complaining this that's like a different number of characters or i i don't know i would have to count them to be sure absolutely the oh no maybe that is the same number of characters hard to say okay all right so we're seeing some of the problems with the data as we often do um we can see in that expand is definitely uh um a signal there that we have that thing that is a problem there um so let's look at some of the unhappy ones people who are unhappy less than let's say the really unhappy people all right people are unhappy about the role of multiplayer we've got yeah this issue with expand again people do not do not like the issues around one having one island and okay all right so we can see we can see here some of the problems um i think i think i'm going to take out the word expand because when it's at the end um i don't think i'm going to try to do anything super fancy with removing the first number of characters because i i don't know i haven't counted them up yet manually but it looks is it the same or not i'm not sure anyway so this is a this is an issue here that we have that we have to deal with with this imperfect data because we've got those words going in going in twice i'm not sure how much of an effect it would have there but let's um let's let's talk about uh at least at the very least we would um remove uh remove from the text the expand at the end so that we don't have that anymore and we can make a new column called rating where we use case when and when grade let's say greater than let's say six or we'll call it good and the rest of the time we'll call it bad like so and let's call this um reviews parsed like so so we will have our our uh we've got these bad ones here we've parsed our text at least that we've taken the expands out of the end we could try to more carefully clean up this text but the interest of moving forward let's just let's just go ahead let's do one more exploratory uh a bit before we move on and let's find the words per review and see what that distribution looks like so let's take that these parsed reviews and we're going to tokenize them on nest tokens word text and let's count by a user name and let's call that new name that new column total words like so so what this gives us is um the words per for everyone's review like this and we can let's look at what um that distribution looks like we can first of all just make a histogram like so okay that is this is this is definitely strange they're um this is not like a natural distribution of um of people generating language um here like uh what i'm referring to is this this uh edge this cliff that it falls off and then it comes back up like something has happened that we have some very long reviews right like so not everything is getting cut off we have some very long reviews but certain certain reviews are getting truncated in some ways um so i this what this looks like to me is some kind of um uh scraping problem problem with the scraping and if i were if i were uh wanting to do the best job possible with this model my my next step would be to go back to the scraping and try to figure out what the problem was and see if i if it could be fixed um in the data generation process because this actually looks like a pretty difficult problem to me to solve um at this stage and it would be better to generate the process to generate the data a little um a little a little better if possible um but boy um when is that not true when is that not true that we have imperfect data right like that's uh that's just that's just life that's all the time so um and i you know we we do the we do the best we can with the data we have so let's build a model um you know we have these limitations in the data it has some uh it has some issues but let's see um even given that um what can we do with it because um that's boy that's life sometimes isn't it i'm waxing philosophical now about data and the real world uh so the first thing we want to do as we set up our model is we want to split the data so we're going to use the initial split function to um split this data into um training and testing data so let's use a strata so that the training and testing data are split about evenly between the good and the bad ratings so let's call this review split like so and we will make a training set training review split and a testing set testing review split like so so we have now we have train training data and testing data so we are going to build our model using the training data and we're going to test our model using the testing data so this is text data right and we're going to use this text all this text to predict the rating good versus bad now that this text data needs to be pre-processed pretty heavily to get it into a format that some kind of modeling can be done we need to basically like get some numbers out of there we need to convert this into some numbers that we can do math on so we're going to use recipes for preprocessing we're not only going to use the base recipes package but also text recipes which is a add-on package for text pre-processing um a package that is maintained by emil v felt so we're going to say we're going to declare our recipe the way we normally do with rating is the um thing we are predicting and text are the um the predictors that we're using to do that so we're going to use the data which is the the training data here and now we're going to start adding steps to our recipe so um the first step that we're going to uh no the first step that we're going to do is we're going to tokenize our text here so that means take our um take our strings and break them apart into tokens the default tokenization is into words but we um you know we could try some other things here um uh the basically the model we're going to build here is like a sort of a default a good first default model but there are other things that we could do to try to make this bottle perform better one of them is to instead of tokenize into words tokenize into engrams things like um bigrams trigrams um for the next thing that let's do let's uh let's remove stop words so these are words like um uh is two of um that often um are very common but don't hold a lot of meaning and this again is like not a bad default step but something that you could explore to see whether it helps or hurts your model next we're going to do do a filtering step so that we don't keep every single token that is in this data set um i am gonna i'm gonna kind of filter fairly uh heavily um so that this model runs quickly here in this demonstration i'm so what this means is i'm only going to keep 500 tokens after removing stop words you could change this you know and make this go up so that your vocabulary that you use in your modeling is um larger but for this for these purposes let's only keep the top 500 and now i need to um i need to uh i need to convert these tokens into um some weighting i get the two most common would either be to weight by term frequency ie counts or uh like uh proportions like a frequency or tf idf and i'm going to use tf idf here because it often um outperforms term frequency when it comes to um modeling predictive accuracy and then i'm going to normalize i'm going to center and scale all the predictors that i have after after converting to tf idf because the model that i want to use is sensitive to centering and scaling so this is going to be our recipe and i can i can prep the recipe to get a look at it prep prep the recipe and um let's take a look at the the prepped recipe here so remember when we declare the recipe we're just setting up the steps we're not actually um we're actually running it and computing what needs to happen the prep step is when that happens um and i did it to be able to see to be able to show this to you to say um we tokenize we remove stop words and then the centering and scaling happens on all these columns because at the end of this we actually have a ton of columns instead of having you know data that looks like this at the output we have data that looks like this where we have many many columns 500 columns because that's what how many i said i want um and it has words like um uh columns like um another beautiful click um change um i had you know 2020 you know has all these things uh in here okay so that is the first thing that we need to do to make a model is pre-process the data the second thing we need to do is uh declare our model specification so we are um training classification model so let's say logistic regression but we are going to train a lasso train a classification model with lasso regularization so let's um put let's we have to tune the penalty the the regularization penalty because we don't know what the right um uh penalty is and we're gonna say mixture equals one so that we get lasso um so we train a lasso model we're going to set the engine equal to glimnet here that's the what we're going to use as our computational engine for modeling and let us set this to let's call this lasso spec so this is our model specification um so we can let's put these two things together into a workflow a workflow helps you combine pieces of a model like a like a preprocessor in a recipe review spec and a model like no review rec that thing does not exist add a model lasso spec so this is a preprocessor and this is a model and we put them together into a workflow so let's call this lasso wf for lasso workflow and then we can look and see what we have here so we have a preprocessor and a model put together and we can see that it's it's a it's a model that is going to be tuned so we couldn't fit this right away on one set of data it would not know what to do um all right so this is the model that we are going to fit um to understand uh and be able to predict what uh what animal crossing reviews are positive and what ones are negative okay so we need to tune this parameter and figure out what is the right one for this for this set of data um and what do we need to do that we need first a set of um lambda parameters to know what to do that so let's let's make a grid um a regular grid let's uh use the penalty function if you notice here what we have is um it's an it's a function that just gives us um here it gives us it's a fun it's a function to give us um values for the amount of regularization as a as a tuning parameter so and it it you know it has the right kind of transformation so that we get them spaced out right and uh let's see since we are uh since i'm gonna run this let's do 30 different values i might when i knit this up for my blog i might go a little bit higher but let's um or maybe not maybe this will turn out good so this is going to be a grid here of things we're going to try so there's 30 values of the regularization penalty that we're going to try here the next thing that we need in order to tune the model parameters is some data for tuning so let's make some resamples so let's call this review folds um let's do there's like 2200 let's do like boot straps let's make some bootstrap resamples so we're going to take the training data and we are going to um uh in each little bootstrap resample we are going to um uh do stratified resampling so that the um the uh the analysis sets and the uh the analysis set and the assessments that have about balanced uh their their about balance in the good and the bad uh user reviews okay now i believe we have everything we need to um tune to tune a grid okay so we are going to use a parallel processing so let me make sure that i set this up um not that no i need to set a seed here um [Music] all right so let's tune let's do a tuning grid here so what do we need to do to tune to tune the first thing is what model are we tuning we are tuning that workflow that combines the um recipe for pre-processing to to transform the text data into um you know it's numeric representation so that we can do modeling on it we need the resamples the data basically that we're going to be doing the modeling the tuning on and we need the grid of tuning parameters to try this is all the different values of the regularization parameter we're going to tr we're going to train a model at so that we can compare across all of them and see which one works the best and then the other thing i'm going to do here is i'm going to i'm going to set some metrics instead of using the defaults so for a um classification model like this the default metrics would be um area under the curve for the roc curve and and i i still want that um and accuracy um but i um you know i would like to know if it's harder to find the good versus the bad um the good versus the bad uh examples the good versus the bad ratings so i um would like i'm setting some uh instead of using the default metrics i set some metrics here that i want to um that i want to be computed during the tuning um the tuning procedure so that i can look at it afterwards and use that to understand the situation with my particular um uh data set here so let's um let's while that's running which hopefully won't take too long let's talk about what we're going to do afterwards so i am going to use the collect metrics function afterwards and what that's going to allow me to do is for these three metrics that i um that i said i wanted they're all here in this metrics column i'm going to be able to get those out and we can see here i have for every value of the regularization parameter the penalty i have negative predictor value positive predictive value and area under the curve and so we can see it looks like negative predicted value is lower for most of these but let's um [Music] let's let's make a little visualization here so let's put um penalty on the x-axis the mean value for that um uh metric on the y-axis let's make each metric a different color and then i think i'm going to make a line chart and like this and i we i'm gonna make a separate little graph for each of these um metrics i'm gonna facet wrap by metric and those i bet i need to do scales equals free although actually maybe not those are all those are all on it might be easier to compare without it i do i do need to do a scale x log tent because remember penalty is uh transformed on the log axis okay this is quite interesting okay so um think about let's think about what these shapes are telling us so for the let me make it a little bigger so you can see okay so let's talk about a area under the curve first it goes up and then down and so if we were picking the best model by auc we would pick this this peak this one here and it looks like this peaks not in exactly the same place but pretty not too far right it's it's a little bit in further but not too far so we would pick this by air roc but look negative predictive value hasn't started to turn over yet and um let's see so we could we could explore this further by maybe making the um vocabulary bigger and seeing what happens if we try regular try to tune the regularization parameter with a bigger vocabulary that might be something interesting to try this definitely points us at uh so if we pick this value that's actually down here this this helps us understand like what we're giving up by like the trade-offs kind of that we have to make here um we could pick uh like we will get worse uh much worse you know positive predictive value if we get if we try to maximize negative predictive value and so forth so this is good this is good to understand that we have this trade-off here between um trying to detect these two different kinds of um comments these two different kinds of user reviews so that is pretty interesting i mean this is this is really common right that it's diff it's difficult to uh choose to identify both things at the same time well um so uh let's so let's take those results and let's select the best one so here's here's where we have to do it right like here's where we have to decide um let's choose the best uh area under the curve um and let's call this best auc like this so this is our best auc and then we can we can take this um best auc value which is a value for the penalty and we can finalize our workflow with it so we take the workflow which remember was tunable and then we um uh we the for the second um argument we put we say here are the parameters the new parameters that i want to update you with and let's call this our final lasso like so and let's see what that will look like so we're choosing our final model now so this was our best value for the penalty and now that penalty goes in here before it said tune there and now we say ha ha i have tuned and here is the penalty that i want to use so now that we have done this um let's let's explore this final model a little bit for example let's um let's look at variable importance so the vip package uh does variable importance um plots and variable importance calculations so let's take that final lasso that we just finalized let's fit it to the training data training data and um and then it's a workflow so we can pull out the um the the final fit the fit rest the fit model from it and then we can um uh pipe that to the variable importance vi um function from the vip package and we can say which lambda do we want to evaluate it at and we want to evalu or we want to evaluate it at the um the best auc um at this one here so um let's run that uh so that that has to fit the whole data set there so um uh that will give us an output and then um that's actually going to give us an output wait why not what happened oh i did i not have this attached let's try this again try it again so this will give us an output for all 500 of the tokens that we had and i and also it's like great relaxing fantastic daily awesome perfect nice well those things definitely are are positive right so um let's make a visualization where we can see the top ones um i 500 things there's too many things to put on a plot if you need to take away from this video that can that can be your takeaway um so let's uh let's take the top let's do 20 on each side and um the importance for the negative ones comes out as negative so let's do let's weight it by the absolute value of the importance um on group and now let's get some variables ready for some plotting so first let's say abs importance like this um that variable uh we don't need that um that stuff that's at the beginning anymore this tf idf text let's get rid of that and then let's um variable let's reorder this so that um they come out in a nice order in our plot and now we do aes uh so let's say x equals importance y equals variable you can do this in the new version of um ggplot which is so nice you don't have to be chord flipping all over the place you can just like put the x and the y and it will and it will know to say ah if you want a geom call and you gave me you know a numeric versus a character uh i i will do it the right way for you which is very nice like so and we are going to um we need to facet that facet wrap by sign and though you'll have to do we have to do free y i think i think this works um let's look at that so this has to fit again i probably should have saved that oh well so that will fit and then we'll be able to look at the variable importance um here and see what are okay so what are um so what are the most important things that push it to the negative side to be more likely to be like a negative review unacceptable worst worst greedy money boring second one second children frustrating copies so notice that a lot of these are about having the like having to get second copies about having one um uh only one island per console um uh and like the this is unacceptable that this is the gree the greediness that we saw when we were reading and looking at some of those example um reviews uh so the um uh the cop you know copies children like when when you have you know um children that you're playing with and um uh that multiplayer experience is not great um on the positive side we just it's just great and relaxing and people are playing it daily and it's just perfect and awesome and nice and the music is great and um uh oh bombing i think we saw an example of that like uh the review bombing so the people in the positive reviews are accusing the people of the negative reviews of review bombing so that's that's kind of interesting to see all right so variable importance love to see it so interesting and great and then finally let's do um a last fit so what last fit does is you take your final model and you pass the rev the split remember so this actually um it it knows where the data is but the data isn't even in it so this is kind of kind of kind of nice um so it's um so this is our final it's like you're we're really getting to the end right last fit final final final final underscore final you know we're really getting down to it here so what what last fit does is it fits to the um training data and evaluates on the testing data so it is a very nice convenience function for um uh this the thing you do when you get to the end of a uh when you've like figured out what you want to do for your model and you want to get to the end and it has predictions in it it has on your which are on your testing data look how big it is it has a final workflow that is fit and the metrics the metrics that it has here are um on the testing data so we could compare this to what we got from tuning and notice yep we did not over fit when we tuned the um the uh predictions uh like like i just said are oh i need to spell better are also um on the testing data we can look at this we've got these 700 so we've got the um the rating this is the real rating and the predicted class so you can see you know this one the model got right these the monologue got right but this one the model does not get right and so we can look at a um a confident uh sorry a confusion matrix and so first you send the confusion matrix the truth then you send it the predicted uh the predicted class like this and we get out a little um confusion matrix here where we say and we can look let's look at this okay so the bad comments the people where people are upset and say animal crossing is no good um because you know only one island per console we are doing better at identifying those uh you can see you know the proportion here is is better than the proportion over here um so this this is related to that difference we saw before in positive predicted value and negative predicted value um the the baseline level here was um uh bad because bad is um first in the alphabet before good but we could have changed that if we wanted to if that was important to what we're doing here whatever um so yes so it looks like it is easier to detect the um bad uh uh the the the bad user comments than it is to protect the good ones um but our overall um our overall uh accuracy is um not shabby you know considering that this is a this was definitely just a use sort of sensible defaults for a text model things you can do to try to do better than this are mostly happen in your pre-processing step steps so you can shoot do things like include not only unigrams but also biograms trigrams which stop words are the best make the most sense are the best fit for like your data um you know of course making this go up would be better especially since we're using regularization um is this the best waiting uh uh you know you know the the different things you can do in this are what you are most likely to get um the best uh output from um uh lasso is really great for text um it's a it is a workhorse other kinds of models that you could try here that you could see what might be good would be um uh support vector machines can sometimes work well and um naive bayes also can work well if you're interested in different kinds of machine learning models that are in the in the in the non-deep learning for text regime so at my house we have um three children and two nintendo switches so we have for sure experienced uh a lot of what we saw in the output of this model we saw that um people who who had positive experiences with animal crossing the positive user reviews loved uh you know so many aspects of the game play it's so delightful to you there's so many good aspects but um it seemed like a really big driver of the negative reviews was the um the experience around multiplayer the multiplayer experience uh when you have more than one player on one switch and certainly i would say i have experienced that as well or we have experienced that at our house so i hope that um this was helpful and i will see you next time
Info
Channel: Julia Silge
Views: 4,674
Rating: 5 out of 5
Keywords:
Id: whE85O1XCkg
Channel Id: undefined
Length: 42min 41sec (2561 seconds)
Published: Wed May 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.