Tidy Tuesday live screencast: Analyzing Animal Crossing in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson and welcome to another live screencast where i'll be using rnr studio to analyze data i've never seen before so um as usual the data set comes from the tidy tuesday project so wonderful weekly data project in our by the r for data science online learning community so let's see what data we have this week i saw events that's going to be an animal crossing data set now i have to admit i've never played animal crossing i don't know that much about it i've heard uh some really interesting vague things about it about raccoon landlords and um and turnip uh stock uh crashes so i'm gonna really um rely on the live screencast and hopefully some comments from people that are um that are joining uh that hopefully people as they uh as they join can help um give me some context we throw out some ideas of things that we can analyze from this tidy tuesday data set so i'm um yeah i'm i'm really excited to grab this day to see what we can find out info about villagers item crafting uh accessories we might be doing something with images we have links all right and um there's review data okay there's critic uh somebody one like run row per critic no this is critic scores reviews there's user reviews there's information on each villager and there's information on each item okay i hadn't looked at this uh in advance so the um uh so yeah so as i said as always um if you're watching this live please uh comment there's usually about something like a 20-second delay but i i i do have an eye on the comments as i'm doing my analyses and i'm really interested to hear ideas from people so i think let's get started okay so i'm going to pull out this data from the top using the tidy tuesday r package yeah here it is i'm going to do new arm markdown and i'm going to do let's see bring in the two's the tuesday data this data so animal crossing hmm i only get you this happened before that i only got one data set out should submit a um an issue i'm going to go ahead and use the read csv lines i'm also going to go ahead and those who watch for a while know that i like to set a custom theme i'm going to save animal crossing and uh yeah i'm going to get rid of the tuesday data and now i'm going to work with some of these data sets all right so there's information on each villager there's information on each item what should we start when and the reviews i'm actually let me see i'm really curious so wow look at all these things uh there's villagers there's items um items looks like a kind of an interesting data set i wonder i wonder what um what people do with items here we have okay we have categories they're or they're orderable they're um let's see uh by value i wonder what's in the recipe so recipes recipe eight this would be apple is this like do these make up parts of a recipe let's see bamboo basket bench candle holder they all require bamboo piece uh sounds like uh that sounds like they they're made up of this okay um games id let's actually yeah let me take a look at this this looks like rest i'm just trying to get a feel for all the things we have in here okay i don't know what game ideas i assumed they were all animal crossing but uh full character id link uh link to the image the let's see i'm just looking through here we can make so we could think we could think of a couple things we can do with this data we don't have anything over time we're not reading graphs or metrics we could do songs uh we could definitely do um text analysis where we look at critics and user reviews uh we could try for example predicting uh the grade based on the text i've done that in a few uh previous projects um and uh we could also make an application like a shiny application that allows someone to explore items like an interactive uh an interactive um uh dashboard other people let me see i could also i really like the items so far i've done a lot of text analysis yeah the text is fun text is fun okay i'm going to start with the text all right so let's do some text um text analysis all right so what i'm going to do is start with our critic data and looks like we have um text new horizons especially like its predecessors uh know that you're overwhelmed with the world uh the grades i wonder what were the grades that critics gave to animal crossing mostly yes the most common was 90 some hundred some 80s some things and some various things in between uh one issue here is i'm not sure i'm going to you i am then going to use this to predict their grades because like i guess you could say 90 but who gets a 90 versus an 80 versus 100 based on the text yeah maybe it's interesting there's not as much of a dynamic range as they might have wanted i wonder in the um i wonder let me see i wonder in the user reviews also critics only have 100 views i'm not going to do any machine learning on that 3000 reviews is a better distribution i wonder what the distribution of grades is what are some of the user reviews uh look at the first six limitation of one island per switch not for a cartridge is nonsensical beware if you have multiple they cannot so it looks like people really like their complaining uh look at this one island person oh wow they really don't like you wow a lot wait a lot of people it's just me or did everybody complain about the number of switches isn't that something that is pretty wild people really did not like those islands if you have multiple people that want to play this house they cannot each have their own account wow all six six mentioned that that seems a little odd uh okay i i don't maybe the the fans in the in the live stream can tell me uh but um uh whether really that seems like a universal issue um all right but yeah let's take let's take a quick look at our user reviews then uh if i'm using if i'm doing text analysis you uh people have been following me a while know that i'll be do using tidy text i'll say library tidy text user reviews unness tokens uh word text so uh then we can then we can find things like what are the most common um actually i'm curious about one thing before we do that which is have the grades been changing over time i actually missed that we had a date so if i said group by library lubridate if i said group by month equals floor date i don't know how many months there are here date month that's really handy the floor date function for summarizing i'm going to quickly count how many months we have not enough we only have uh we only have reviews starting in uh in march if i start from week okay now we have a couple of weeks and if i do summarize num review number reviews is n and average grade equals mean grade could um i'm just i use the number of views i didn't want to like look at it changing so it looks like um this really peaked actually the number of at least the reviews peaked in late march uh this era was around the time the quarantine was really uh starting up and i did hear a lot about uh animal crossing in late march uh and interesting that the early reviews were positive and they got kind of negative and they got recently positive again i wouldn't pay too much attention to this this week it looks like it's i mean this is only like one or one day in this week uh uh seven reviews in this week uh so the um so yeah maybe it got unpopular and then got popular again i could look at some things like what are the words uh that pop up in those uh so that's interesting uh one thing i learned is yes it was only released on march 20th so it makes sense that there aren't reviews uh from that time point uh okay so i'm gonna start by unnesting the words i like to use uh it just is really simple to remove stock words that come with the tidy text package um and the stop words being like my and me uh so then i can then i can ask questions like um let's see i'm actually going to add sense in review well actually let me see the same user does the same user ever um do multiple reviews no uh so i could just use username i could say username word uh i could use using things i didn't want the story here is i don't want to necessarily if someone used the same word multiple times um uh now i have a set of like user review words i know what i know on this day this user used this word this many times uh so that seems pretty um pretty interesting uh all right so the um see did i include the grade oh oh look at me i did not i also need grade because i know they're unique for each of these it'll just um i'll still i'll still get the grade yeah same number of observations 88 000 and now i have username date grade word okay so now i might be curious like what are the words that are positive or negatively associated i've done this in previous things like um like a wine rating review uh one uh let's see and uh if i do group by word summarize average grade this is a simple approach to what is the um what uh to how popular something a number review number the number of views that use that arrange descending nv reviews i can say how many review i can say okay the words um what words were associated with what reviews the word play was was lower grade than average the word uh game really we might be interested in um in doing some kind of filter where i'll say filter the nb reviews uh for for ones that are in at least i don't know in at least 50 reviews and then arrange descending average grade what are the what are the ones with the highest grades lowest grade the words like bombing uh the um let's see the bombing filters string detect oh i want actually user reviews text bombing i just i don't know if bombing is like a term from the game or if it's an i don't know australian term for okay people complaining about the quote rating bombing uh so the uh don't trust people review bombing anyone who is negatively uh yes so that's actually really interesting i wasn't sure what to do with this um uh i wasn't sure what i was gonna do with it this this review data but the fact that uh the fact that there's accusations of review bombing things like that oh it's gonna be kind of interesting we might get it we might get a little bit of information uh here so what i'm actually gonna do is i'm i'm going to visualize our um our ratings over time let's see and we'll say uh and i'll graph week average grade geom line also throw in a uh i should probably say filter and be reviews greater than out of 20 i didn't drop that last week and i can probably also throw in a gm point with the number of reviews as the size uh that seems to be generally helpful i want to say okay i'm just telling a story now about the animal about the quote review bombing of animal crossing and i want to say like x is time why is uh average grade sizes number of reviews so this gets across kind of the idea that it's like oh there has been an update there they're like uh it launched and then it really was very low for a while um and it it has gone a little back up but that's not the only story about uh the about the quote review bombing uh the other was that like there were so many ones and uh and zeros like i actually think that's that in itself is really interesting so let's actually take a quick look at that the um i made that graph earlier but grade gm histogram so like most of the reviews most reviews were very low or very high so it certainly was not a normal distribution we've seen like uh we see like these tens and these uh zeros uh we're very low or very high and and that's um i don't know if that's typical for a game i don't have data from other uh reviews other games um so the uh let's see all right so um looking through uh i'm just going to tweet you through the comments to get a little bit of context on this and um uh between these two stories i'm actually a little bit curious here's my you know i'm going to add a couple of things i'm going to say i'm going to call this by week and then i'll say um and then i'll say what was it percent 0 is um is mean grade equals zero i'm actually kind of curious like it's probably going to look really similar but percent one mean grade is is one so me it's a mean recall in r is a nice way to um is a nice way to get to turns into a percentage aha yeah we see a little bit of a little bit of a story of a bombing that dies out here uh where for a bit yeah it was for you it was like early on there weren't a lot of ones zero oh i said one i meant oops i said one i absolutely meant ten oh i'm not that interested in one uh okay and what i could do is like hmm i'm just thinking through this what is going to make this this like kind of interesting i'm going to call this just no i'm going to call this 0 and 10. i'm just thinking like have the grades been getting more polarizing is the question that i'm going to ask uh has the polarization been changing i could have actually asked with a standard deviation should i do that should i actually just i know i'm going to do this one because um what i'm going to say is gather the 0 and the 10 the type value and then graph as a function of week the value color equals type i'm going to clean this up it's not going to look amazing in its first iteration look at that um [Music] percentage that are 10 um [Music] and i'm going to do i am actually going to do a filter because i don't want what day did so it sounds like it was started i'm going to do week start is i'm going to actually say what was the first day the 20th okay then i'm actually going to say week start is all right that's oh i would throw a filter what is going on here weak this actually looks exactly the opposite of what it what i wanted uh what did oh oh i see week start is a number uh oops uh oops we start if it's like what i'm trying out is doing um uh is doing week start on one i don't know to see i'm making the week start on a different day i don't even know what five is but i know that makes it start on the 20th uh and so still so few reviews on uh in the march in the may first week that i don't know that it's really meaningful like it's still a fraction of the of the rest of them maybe i kind of want to make it end on a week yeah what i'm going to do is actually going to start the week on a monday uh because while that doesn't make the first week complete it does make the last week complete and that that means i'm not going to throw out data so what i'll say is um num is summarized or by week i don't need a filter anymore though it doesn't hurt and yet now this is sort of the pattern that i'm seeing it's worth noting that the weak star can change your results but i i don't like throwing out uh data so here we go this is the um uh this is the the review and then if i say most reviews are very low very high i create this and i throw in a scale x scale y continuous labels equals percent format percent uh then we can say uh i don't think the fact that their median is particularly mean is particularly meaningful uh i probably need to expand limits y equals zero uh it looks like basically um and i'm also going to do one more thing i'm going to say type is if else type is pct 0 then uh percentage 0 percentage 10 100 percentage 10 and i don't know if that is clear uh percentage rated zero percentage rate of 10. maybe a little clearer and the story is generally like uh there yeah so one thing we can see is it looks like there might be a resurgence in kind of a counter bombing uh that it looks like oh early on all the people rated tensorflow rated zero and then a lot of people rated it zero for a while uh and then they kind of met back up in the middle i can still throw in a geom point and do uh and sizes md reviews and then do labs x's time y is percentage of reviews sizes total reviews in week all right so the um all right so this is um so this is looking at it by week we see oh okay there's where the bombing and kind that quote review bombing kind of uh keeps going we also see kind of an increase in the number of reviews overall like lots and lots of zeros uh during this time almost half of the reviews were zeros and then there's kind of a counter bombi popping up here we're going to check the sentiments to get a feel for that something i'm seeing is the word is bunny d is in the chats from caleb darrow's is bunny day events where april 1st 12th serve you bombing and bad reviews sort of makes sense what is a rev bunny day somebody tell me what a bunny day is put that in the chat uh lty equals two and say y-interce x-intercept is a combination of i think i have to do as numeric of both the start and end date as date so they say 2020 01 0401 2020 04 12. so like this is the the if i want to show like oh here's where bunny day is um it's an annual special how could it be an annual special event if the game's only been out since march 20th uh well it looks like it was an easter event people didn't a lot of people said it was annoying people were disrupted by eggs yeah but one thing i'm seeing oh yeah it looks like a kind the the the bombing might have started the week the review bombing might have started the week that that was um that uh this started so i'm actually going to note that in here i'm going to say that i'm going to add in a um you know another way we could have uh i'm going to other than i'll say like title is reviews got more polarizing in second few in middle of game and then you have the revenge of the counter bombers near the end all right so that was actually a quick exploration of things over time uh and i want to now return to the things that are um uh let's see the yes i want to return then to the uh average grades so some of them are positive sentiment but also so like uh yeah and some of them are positive sentiment but some of them also have to do with like there's a review bombing going on um so let's take a look at how we might visualize that something i like to do is to um i'm actually going to i have to be in at least really want to include bombing it's kind of a of a no no when it comes to these these but i'm actually gonna do i'm gonna do it anyway i'm gonna say nb reviews average grade gm point scale x log 10 and now do a geom text a s label equals um label equals word um be just as one htrust is one check overlap is true i love this graph for saying what are the words that are positively or negatively associated i missed my second plus about that second word is animal uh the story that was seen that's coming out here is like there are a lot of positive terms there's also the one like bombing um the ones that are bad are ruined greedy money fix ridiculous profile uh player nintendo island switch etc uh family so think so it looks like it really does look like there's a lot of things that are about how greedy they are uh that that is that they allow only one person per account there's not really like there's not a lot of ambiguity ambiguity uh in that the um uh i was thinking about how to um how to visualize this uh the thing i think i'm going to do should i do a lasso regression i could do a last uh last regression uh funny thing is like what i kind of want to do here i'm not going to do a word cloud but you could just literally just do a word cloud of what are common words in zero star reviews and you probably or just i guess it's disproportionately zero stars reviews yeah basically i could just take this um by word summary i could say um i could just say by word filter um average grade less than two i wonder how many of those uh and 23 words and it's like or i could do top end i could take the bottom 20 words top end 20 average grade negative average grade uh and say and only graph those so that's like one thing i could start with i could again i could make a um a gg plot of this i could do a few i could do a few things to try getting to be like uh so like what reviews were associated i could say what reviews were associated what words were associated with low grade review and say something like subtitle 25 reviews there's a few ways i can um i can do this okay so it looks like yeah people are saying they're greedy it's ruined uh it's because my girlfriend can't play my what we saw with the word white before uh and so on so money bought uh all right so i'm actually not seeing like words like egg and uh or uh bunny popping up i actually do wonder if i looked at uh if i looked at egg or bunny if i said filter word is bunny didn't pop up in enough of these at least so i'm going to try 25 i'm going to try changing this up make it 25 make this 75 this graph that i was making earlier uh and i'll look at this actually i'm saying 25 on the most 25 uh 25 reviews i wonder if i did i'm really not seeing like the word bunny pop-up i don't think it's in the stock words i wonder it was only in six reviews uh okay so that's one thing we can we can learn uh all right and um let me see all right yes so the um all right so yeah since people are saying that uh yes people are saying nintendo was greedy they ruined it etc uh that's the general uh sense we're getting from this i could this is not controlled it's not doing i'm not doing text regression here uh if you want to look at text regression i recommend going to the um the wine ratings one where predict a rating based on um a weight uh where i predict a rating of um uh of the score of a wine rating based on words in it the truth is i think we we are we do understand most of what's coming out of this like oh yeah it's pretty clear that it has to with people be complaining about the number of islands uh that are in there hopefully nintendo's working on it uh so i'm not gonna i'm not actually not gonna fit a regression model one thing i um like a a lasso one thing that i am going to do is take a um let's see yeah it uh because really looks like it's all about the islands the switches one thing that i am going to do is look at not by word user review words i'm really interested in what words tend to appear together uh can appear together in the same review so i can find natural clusters appearing and that might help us understand in terms of the um uh in terms of which are about uh the review bomb uh i might also do ooh i could actually do topic modeling for this should i do topic modeling should i do topic modeling yes i'm going to try doing topic modeling all right so i don't believe i've done topic modeling in one of these reviews before and honestly it's been a couple years since i've i've done um topic modeling so i'm going to quickly remind myself how tidy tidying stm models works in the tidy text package let's quickly check check check check check check stm this caspar's nice okay this is gonna this is not going to be so bad all right what i'm going to do is um is take our review words i'm going to group by word filter and must be take only the words that have at least appear in at least 25 reviews looks like it's 600 words i think it's a a solid start and what topic is going to do is going to try some clustering it's going to say what are the things people talk about i have a suspicion that if we look at the things people talk about it's going to look like people like there is a topic of island nintendo one uh that that will be associated with review with um with uh with negative reviews and with that kind of repeat view bombing uh so that's what topic modeling does it says there are a few groups of words that tend to appear together and it says what are those groups of words and how and how much are those topics associated with words how are those topics associated with documents because right now we have words and documents we want to break it down into words and topics topics and documents so what i'm going to do is take our set of words and i'm going to cast sparse into one row for each username one column for each word that's a document and then an n for how many times does it appear that is a word topic matrix uh so i'll say user user word a review matrix i've not had any words that pass this threshold i don't think that makes it will make a difference in our results what does this data look like looks like a sparse matrix it's like oh how many times did this person use this particular this particular word um all right so then what i'll do is i'll say we use stm to perform a structured topic model on this so stm is a um is uh well i used to use the topic models package now we use um stm45 for it's going to estimate both the documents and so both the topic the distribution of topics to documents and of words the topics at the same time i'm gonna say what if i said there are only six topics later i can try actually i'm gonna say they're only four topics they're they're gonna be more than four i just i'm just like playing around with i wonder what the um uh the set of i think my internet is a little bit faster uh so do tell me how the quality is going um that's on that's on me for possibly being on the wrong network uh yeah tell me if it shuts off or anything like that in comments uh okay okay hopefully it's hopefully it does better just was complaining about buffering okay people say it's not been too bad okay great all right so what i'm going to do is take our review matrix and say four topics don't need verbose uh and i don't know what any type of spectral actually um refers to within uh this but can't hurt and i'm gonna fit a set of reviews you know i kind of like it being verbose and have it be verbose tells me a little bit how long things are taken duration one two three four look at that it's actually telling me the topics as it goes isn't that pretty fun i think it's pretty fun here goes our topic modeling i really don't think i've done topic modeling yet in any of these screencasts uh yeah so one of them is game time review it's game island oh i bet that's the fourth one game island switch play player i bet that fourth topic is the review bomb uh the positive one the first one is animal crossing fun can love people must be saying 10 out of 10 right there in the uh in the rating i really uh does it really need this many iterations can i tell it how many iterations to do if not i should probably make it a little fast because i'm probably going to do a couple of iterate of different uh tries of this topic model iteration max em it iterations i'm saying look uh em the em tolerance i think the story is it needs to get the that number below 10 to the minus five it's trying to like the relative uh change uh between these i'm going to turn up the tolerance i think that is just this is taking i might as well let it finish it's probably going to hit that pretty soon so this is like a model that's converging uh let's find let's see ah here it is it converge all right so the story of a topic model is that it prints out like this out it's four topics 3000 documents a 602 word dictionary that doesn't seem too helpful yet we could take it and actually look at the individual matrices but my favorite approach is going to be to look at um is my favorite approach is going to be to look uh in the time that the time text package has a tidy model for these topic models ooh there's a i need to um i need to upgrade my tagging text i think that it's been upgraded on on or no it just doesn't um uh there's a new definer maybe if i installed type of text it'll it'll fix it um but there's no i don't blame um tidy to explain me for having d player the newest version of b play r so i take the um so the story is here that there's a term anti and that is very rare like we mean basically non-existent across topics one two and three but it is actually moderately common in topic four same with the word console which is very rare in times one two three real common in topic uh with three percent of words in copic four i bet you topic four is the review bombing one so how can i visualize this well i could say group by topic and top end uh i want the top three topics for um top three betas from each topic and the story is then three what if why do three when i do six and then say i'm gonna graph term data geom call i'm gonna flip those two i'm gonna say beta and term and i'm gonna fast rap by topic and we say what are the most common to and say scales equals free why i think yeah free y uh the other thing i'm going to do real quick is say term is fct no reorder within which comes the tidy text package of term by beta and then say here it is rio to term uh oh within a topic and i forgot an extra step where i say scale y reordered so this is an approach that let me sort each of these terms okay so what this is showing is uh for one thing the spanish reviews landed in topic three um games the one that landed on like uh the island i need more than one switch you can only have one player nintendo sucks that kind of thing if i make this top 15 or 12 let's try 12. maybe it'll look a little better yeah game island switch play player nintendo console buy experience one out of ten yeah so that's it looks like there's a negative a clear negative topic that pulls up uh that review bombing landed and landed in topic three not this one uh all right so this is like a um so this was was like it was an uns an unsupervised way to try creating clusters of topics that are discussed in these in these reviews i think four is a little small i could do this using a varying number of topics with a little bit of effort but i'm not quite gonna do that i'm gonna say let's do it on six and let's say em tolerance is what is it ten to the minus four it's going to try something uh so i had a question earlier of what are the e step and the m step the answer is that those stand for expectation and maximization steps um without going to the details topic modeling is an iterative process you know that was actually so fast i can actually some while i'm continuing to talk i'll rerun it with a slightly lower tolerance it's always just fun to do that uh the story is that um the is you're for basically you're first estimating that the how documents associate with word with i mean how topics associated with words then how documents associated with topics documents associated with words documents associated with topics back and forth an e and an m step i don't know which among the e and the m step is dot consists of words doc the word associated with topics it's been too long since i looked at topic models but that's what this is um is trying to do uh so that that's this kind of expectation maximization algorithm uh and um let's see here we go oh i need to do topic model six and now i'm going to take a look at our six topic model where's aha now topic four there's no there's no reason that one topic isn't uh that one topic is like um a specific two but one topic specific to something like that number four is always um island switch game nintendo or anything like that that's not necessarily true uh but it did pop up this way this it looks like four and also five look like they're both about um this negative reviews uh six and four just so six is just describing things in the game it's kind of like a grab bag of other things crafting villagers tom nook the raccoon landlord uh three is where spanish ended up um de i probably should have removed the stop words and spanish views but you'd still get these other ones um so the um uh so i so i right now i'm just randomly picking the um uh the number of topics some people ask how to choose how many topics i've kind of just been playing with try this one try to try another try another there are principled ways to choose the number of topics i don't think i'm gonna go into them now i'd also spend a little bit of um uh it's it it's a little bit of an art and also like you can also apply some methods and i don't have just tons of practice in it uh i'm starting just by saying here i've clustered them into a couple of topics and one thing i try and look for is is splitting up topics that seem like they're very similar here these two topics do look similar the one mentioned is nintendo uh the other doesn't but four and five look like negative ones one and two look like positive ones three is spanish six is grab bag that's kind of the way i would describe each of these topics and then um that's the association of words with topics i could also have like if i picked one game if i want to say oh every part of me if i take one term i said oh the word switch does that appear in nope it's mostly in topic four and the word let's see progress is really let's see it's mostly topic five a little and four a little in uh six but the um uh the story is yeah these are the um these are two topics represent people complaining now why did why is it useful to uh to do topic models part of it is because we don't just we aren't just top tiding the word the word topic matrix we can also tidy the matrix with topics so how what is the document well if i actually say mutate user uh what is it username user underscore name yeah username is it's actually in like one two three indexes into the row names of the matrix we used so the matrix review matrix oh row names review matrix indexed by document that was trying to try to look for so now i've got a username column uh and um so now we can say okay how much is each document associated with a particular username that means that for one thing i could try picking what documents are so most associated with one particular um with a particular i'm actually going to say this is gamma as tight as topic model gamma it's a tidied version of the of the gamma topic model now if i take this i could say group by topic what are the most associated user uh usernames and i can say top end one by gamma the story here is that this gamma is going to be like oh this document had was 23 percent topic one this stocking was only six percent topic one that's the way these get um these get divided down so here you can say top then which documents fell into which uh documents fell into which topics it looks like for example documents 699 by toby was only uh was only about was basically just topic three which i think was the spanish topic so my guess is it's a short review entirely in spanish reviews four and five i bet you those are about um our negative reviews about uh the island switching issue and now if i took our um this join it to user reviews by username i can actually say aha i was wrong about the topic three it wasn't in spanish it was hold on just just remind ourselves for a second what is each topic three it does include spanish but it but it also includes the word bombing so three i wasn't tyler yeah so if i look at the text here are my six reviews the first one i've had this game since launch it's been over a month now so it's fun it's this uh et cetera it's addictive uh i love it so it's the things like it's that's just a positive review two is okay so this is the message this is the one that starts with two which is just this is the best game stop review bombing it uh that's interesting and then three that's a weird one stop review bombing with fake words wow this is a whole experience this uh this is a whole adventure yeah okay no i guess okay it looked like it had some odd parts in i don't know why it says stop i don't know why these words keep popping up here maybe it's a scraping issue because it doesn't include spanish and that seems to be the relevant part huh and then this one yes it does look in fact like this is item number four because it's just the word one item one island per console wow we really have our work set out for us in terms of topic modeling yeah this is indeed one island per console repeated over and over uh you usually don't let me be clear in text mining you usually don't see things like one island per console repeat repeated over and over yeah this is a topic so the fact that says the word island and console does make this a classic example of review four and then five is very similar it includes yup some various kinds of text so it's interesting as i bets so there's some repeated text here i don't know if they just copy pasted it again and again or if this is some kind of bug um but hmm uh and then all right then we have six which is about some um yeah i think there might be a bug in this i think there's issues in this scraping because there's repetition and there's these various things yeah worth knowing like notice the text repeats itself uh ooh i don't think this person actually did write one on for console again and again i think that they just i think the text always repeats itself okay that is really i should have read a couple of um all right here's the thing uh i should have read a couple reviews we're going to need to click there this looks like there's a real scraping issue here uh so i had read some reviews in the first place or at least not closely enough to see the problems so let's talk about that for one second with regard to our text uh it doesn't make a big difference because until now i haven't looked at how many times the word used um how could i fix it if i wanted to uh the answer is i could look yeah here's what i would do user reviews user reviews user reviews what i would do is check this out i would say if i take my user reviews and i say start text is string sub no string sub text 1 to 20. what i get is like okay let's get the first 20 characters why 20 yeah let's make it make it 30 uh just to like um uh it actually i'm um i'm gonna say either 30 characters or these the length of it whichever is shorter why am i doing that because if something is only 10k if there if the thing's only 10 characters long i don't want to cut it down uh so what i'm doing then is saying uh here's a string subset i'm just kind of hacking this through to try and truncate it when it repeats multiple times and i say if this starting text appears again in there in the rest of the string i'm going to want to do something to it i'm going to want to here we go one more while i get here i'm going to get here we go here i am trying to clean this uh the scraping data uh start text and now uh index yeah we're gonna do is say string match i'm trying to remember how i'm trying to actually remember how this works aha string remove you know what i can do okay yeah what i can do is i can actually say string remove from the start text everything oh sorry from text everything start text onward so i'm actually using uh what i'm going to say is is any character and then it's actually going to truncate the last character off of it which i'm um oh i know what i'm gonna do string replace this is actually a little complicated what i'm up to now what i'm doing is i'm turning this into a regular expression paste zero i'm saying okay i want to find cases that look like this and i want to string replace them with just the stuff that with the one character that comes before does that work fingers crossed that works uh there's a method here where i say string replace this regular expression oh oh it doesn't work it doesn't work oh my god because uh you can't i don't think it's vectorized over regular expressions is it i was going to do this i need to do a whole thing i need map to character i'm mapping onto the text and the regex string replace with the replacement does this work can you imagine this works wouldn't that be great that'd be grand uh and uh the the backslash backslash one means replace with this capturing group so what comes before here uh so i'm going to now pull new text see it does this ever oh no it's too long i'm crashing this a little bit uh you know it looks way better yeah i actually think i'm at least of the ones i'm looking at here i'm not seeing text repeat and i'm actually going to say filter string length new text is not equal to string i'm trying to show how i'm cleaning this data string length of text because i just didn't like uh i don't like the um on nuts nope it didn't do anything it did nothing it did absolutely nothing what if i just use string remove you know what i did i put the word regex in quotes anyone else catch that nobody commented it i'm counting on you folks actually i was mostly yeah uh new technology new text errored uh oh oh it's not gonna work oh man oh man none of this works i really thought i could okay comment this this is just a complete failure um the problem is that there's not there's gonna be things that don't look like a regular expression there i could figure out how to solve it i don't know how off the top of my head so i'm not gonna solve i'm not gonna solve that that's a little annoying okay uh we can't do anything with the repetition one thing we could do we could do is um is treat each um is treat each word as if it only occurred once that's not really in the spirit of topic modeling but it's something that could be done it's just it's very frustrating that some documents look like they have repeated words over and over okay i'm not going to um i'm not going to spend longer on this maybe it's relatively rare uh but okay the um uh the main story is here we go we we uh we still have our topic model gamma all right that was fun take topic autogamma oh boy it's really uh going a little slower let's um can i clear my session not my workspace my oh yeah here it is good i think that was what's causing the phone if i enter join this with our user reviews by username i've got dates and i've got grades on them why is that interesting i'll show you because what if we look at the association between grade and gamma by each topic so i can actually say here we are topic model gamma i'm actually going to do this in the join because it's so interest it's so uh relevant that i'm going to be doing other things with it how is each topic associated with the positivity of the grade um i'm actually not going to do a scatter plot because there's so many 0s and 10s i'm instead going to look at the so the correlation between the gamma and the grade so group by topic summarize correlation is core of gamma and grade so this is across um aha now we see indeed there are positive topics and negative topics topic two is very positive two one is pretty positive uh topics four and five are highly negative uh the higher the gamma the more the um the more it's a so the documents associated with that topic the more um uh the the more negative the more negative it is uh so the the more the lower the review is um so that that's pretty handy is there a way to visualize this um yeah you could do like you could say i'm not crazy about it because basically you have something numeric which is the gamma uh compared to an average rating it's like i don't know i just think it's not going to be like a correlation is pretty helpful i could have said a spearmint correlation as well if we suspect this is not normal which we definitely suspect it's not normally distributed a spearing correlation is the correlation of ranks instead of um of the actual values the numbers aren't that different except for topic three um where now it looks like there's a higher positive association maybe we should use the spearmint correlation doesn't seem that different and i have a suspicion if this when if the sphere of course is different in one case that means something but the story is yeah we have our positive topics on negative topics that's kind of kind of interesting and the last thing let's try looking at how it changes over dates so we'll do is we'll say let's see group by week is floor date i have that group by week done earlier and i'll group by the week and do it by the topic yeah topic and i'll summarize as well i can get the average rating never hurts to get the average grade uh but i'll also say average gamma actually does hurt because that's the same across topic it's not good uh what i'll say is i'll say what is the average amount of association between this topic one two three four et cetera and the gamma so i wonder as time goes on does the uh do some topics become more common than others every topic has some gamma but if i say color equals factor of topic i think that they might so topic five was one of our negative topics and the um uh topic five was increasing and then decreasing topic four was also negative one and that's it like a really common negative one uh so it looks like topic four is one of the most uh common overall i'm actually gonna throw an a and it's y equals zero and say average gamma document topic association it's not just a random association it's actually saying like on average how a percentage of the words in this topic in this document are drawn from the topic uh but um that gets the idea across the idea is that we can actually once we have a um uh once we have a doc uh topic model fit we can look at like okay when did when like how has it been changed over time how is that uh and yeah we can see our four and our five our negative our negative topics did rise during that during that time uh we're more uh we're uh representing a larger set of documents so that's one thing we can find uh and i wonder did the what what went up near the end uh six went up near the end that was a positive topic i think uh and two went up near the end but it also was high so it's like it started out being mostly topics two and three and then topic five kind of really showed up top end just remind ourselves we're talking about recreating this graph uh which i'm gonna add one more thing i'm gonna say show legend and fill equals factor topic i just like making this um i like maybe it's a little more colorful all right so that was um was there was uh some text analysis and some uh topic modeling on um on animal crossing reviews i didn't end up getting to the other uh thing so if i hadn't found an interesting story in the animal crossing reviews i definitely would have moved on to some of the other data sets but i actually think there is really something to hear in terms of um how much is it how much what are people talking about how much people talk about this how much is this review bombing about complaining about one issue in the game uh and how has that possibly changed over time out there were a few graphs to be able to make that told that story i think i started with like the the reviews over time the fact that it was mostly polarizing and that that amount of polarization uh did hit a peak somewhere in the middle i'm going to remove looking the fact that they never really mentioned the bunnies means i'm going to remove this this graph from it and um yeah and then i did some some text analysis i tried cleaning up this repeating text issue but realized i didn't have a good approach in mind uh and looked at what the negative reviews and saw the most associated with this ended topic modeling and found those negative reviews in uh topic modeling can be used as i did subjectively to understand what's discussed it can also be used as a machine learning method for dimensionality reduction getting six turning this high dimensional data set into a six or a ten or twenty dimensional data set okay so uh that's all that's all the time we have um i definitely enjoyed getting a sense of the animal crossing community and the reviews i didn't look at the other data sets i think there's probably a lot of cool things that can be done with the data on the villagers or on the um looking at the items and how they might relate to each other what the prices are like um building uh but uh yeah that that covers this um this data set uh from now on there's going to be a um i plan on having a screencast once every week so tune in next week at uh tuesday at 5 p.m uh i'll uh i hope you had a lot of fun i certainly did i'll see you next time
Info
Channel: David Robinson
Views: 2,793
Rating: 5 out of 5
Keywords:
Id: Xt7ACiedRRI
Channel Id: undefined
Length: 59min 36sec (3576 seconds)
Published: Tue May 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.