Tidy Tuesday screencast: scraping and analyzing ramen reviews in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Dave Robinson and welcome to another one of my weekly screencasts where I'll be using our to analyze data I've never seen before as usual this data comes from the tidy Tuesday project there's a great weekly project run by the R for data science online learning community that every week released some new data set so let's see what they release this week right we have Raman ratings right I actually don't eat that much ramen I like some some kinds of ramen but this is gonna be um fun I'm uh let me see so last light just like last week we have a call we have a rating data set last week was a wine rating data said this is ramen okay so I'm gonna go into the to my arm D vote up the library tidy verse I also usually set the theme the ggplot2 theme and download the ramen ratings data set so usually I start by viewing here we go all right we have brown variety style country and stars I wonder if I can get any other data in here because this has a couple categorical variables and then a numeric variable so I already think in linear regression to predict on the star ratings but I also wonder what other if I can get text or something like that to do a tax prediction like I did last week and then the wine data set hmm wonder it's just it say okay we can we help it we can make it C so this is the reading ah I might be doing this is fun I might actually be doing a little bit of um what do you call it a screen scraping I'm some web scraping to grab the the reviews let's see so let's actually let me just take a look at this for a second let's find a bad two star rating and this one gave us is you get zero stars someone hates Singapore Street noodles classic curry and what were they thinking nice noodles come out light and fluffy to the point they break too easily it says disgust ashamed ooh interesting alright so notice that you've got a review number here well here's the review number but I don't have a link okay I might get I might come back to this in a minute to see if we can grab the reviews themselves okay but in the meantime let's start by looking at leaky the brand variety style country so all of these are categorical variables we might use as predictors since we have multiple predictors I'm gonna gather them I'm gonna say gather everything besides review number and stars and call it category value so this gather from stars alright and now I'm going to count the category and the value I want only the top few from each category so I'm going to group by category top and 16-byte and what am I trying to do here I'm trying to create a bar plot of all four of these categories in the same plot so I'm gonna say is category Oh category um this actually should be category yeah yeah value so value and giome called facet wrap by category and cord flip there's one step that I'm missing here but notice that what this is gonna do is give me a four facets actually a few steps I'm missing it scales equals free it's gonna actually say free why here and I'm also missing the reorder ungroup mutate value calls FCP reorder value by n I'm trying to I didn't want to create four separate bar plots I want to see the all these at once what I notice is that the vast majority come from Japan United States South Korea but not a wide selection of countries in general simile styles most of them are there in packs bowls cups some tray box and then kind of other mixed in and then we have brand not quite it let me see 20 I just want to get it more of a sense of this we can see us yeah there's like a long tail of brands and of countries this means we generally want to lump each of these variety we're not gonna get anything out of we can tell that I do count variety ramen ratings count variety yeah the as few have beef or chicken or miso ramen but like not yeah not not that many it's such a wide this seems like a subset of the brand yes okay so it's not um we're not gonna get that much of a variety unless we tokenize it cuz notice some have the word courier such we might be doing that style so what will we find it we're finding that style probably needs to be lumped so say style is FCP lump and we can see there's kind of five and then everything else style by five mm yeah I also need a box is not even probably to style for and I need replacing a style equals other so here yeah we've pack Bowl cop tray other box got lumped into other okay so that's a lump we know that country should also be lumped looks like it's never missing that's useful but I can say country FCT lump country mmm 100 1 2 3 4 5 6 7 8 9 10 11 12 12 here we go now other Kant doesn't get ordered perfectly because it's in a multiple of these I've got a trick for that that I'll I might show in a minute but if a number's keeping an eye on all these at the same time brand oh I don't know what where to cut off grin but I'm gonna say FCT lump brand 10 so just how you however however you feel notice that other is definitely the most common brand this kind of this wide variety maybe I'm gonna do a bit more it's not gonna happen to be the most common but um but if if it's important if the grand isn't an important distinguishing feature then ya want to kind of want to separate these okay so um I didn't even look at how many ratings there are there are 3000 so fewer than the wine data set to I don't mean to keep comparing it alright I'm gonna start with I'm gonna start by saying here's our ramen ratings clean so if we've I'm gonna call it process it's not that the data was dirty it's just that we um we did some aggregation on it and this gives a feel for how that splits up I'm actually gonna show you how to keep other from being at a different level in each of these but because when it orders the factor it's not able to have given different orders for each facet I have a tool for that it's in the DRA lip package the professional package available on github I haven't seen anyone else um called with a tool for this so I still use my own it's called a reorder with in value by category and I actually also need to add scale X reordered there it is so you buy these two steps reorder within and scale X reorder President of the dr lib package we can um we can order this to show the most common categorical variables so i'm going to call this title is categorical variables after processing I call categorical predictors and say X's predictor why is count so notice yeah you can kind of get a sense of um lots of other countries even more other brands so brand yeah it's gonna be generally kind of a hard distinguishing then um and then style here style and a variety alright so I am going to yeah I'm gonna try a linear model to predict stars is similar to move that in previous sessions we're going to say is stars on the ratings process Raman ratings processed stars explained by brand country and style use a broom package is nice to tidy this up all right we're actually gonna do a visualize this probably a good idea to do a a coefficient plot so I'm actually gonna add confidence intervals to this comp end equals true and I'm going to plot it where we say estimate time genome point genome error bar H I want them horizontal error bars and what is it'll be what X min equals con flow X max equals comp hi and I neglected to do any ordering term equals FC to reorder term by estimate I also never want the almost ever want the intercept term in a coefficient plot that's pretty cool so this is what's the effect of a let's see estimated effect on what was the on Raman rating subtitle common brands countries oops this view title coefficients to that predict Raman ratings I don't know why I did capital R here common brands countries and styles did I miss a on brands countries and styles yeah all right oh sorry we're lumped into other less-common alright I'm also now that I realized that I probably want to use other as the [Music] we're used as the reference level so one thing I did we didn't discuss is that in this model what are all the terms relative to what is this brand relative to I because others one of those common categories I actually want to set I want for brand and let me see so what I do is oh let's do this and due to the processing for brand and for let's let's actually show this to this graph again what is the reference level yes we decide each time you do a linear model for brand it'll be FC tre level brand by other no question others the most common so ever we want to compare all of these two like the everything else country yeah let's do other it's pretty common so there's enough data and but for style we don't want to use other doesn't really make sense to use other style let's use pack cuz the most common so when they will find it's like yet cup is actually negative compared to pack do we have positives branda yeah there it is all right um one thing I can do here is if you rely realize now I want to separate each of these categorical predictors so I don't like having a Y anything there yeah so what I realize is actually do a little bit of extraction on the tidied data so I'm gonna say mm-hmm all right I'm gonna say extract brand country so fair how do I split this all right yeah the way to split is like I want something that is not uppercase and then uppercase everything else it looks like that actually is the only type of uh predictor I have here extract is a great tidier verb where I say give it a regular expression where I say some number of lowercase some number of uppercase and then an uppercase and then anything so like we start we have some lower case then we have some upper case and I'm doing that on the term column and I want to extract it into two columns one called category the other called term and so notice by doing and then say I broke it up into parentheses be the caption groups brand and term that actually that worked pretty well and that means that I can now say color equals category I like that because now I can say what are the what's the ordering of the style oh actually I like it even more if I faceted by its facet by category I only want them in one column I think I missed something oops I forgot to do scales free Oh nope it's gonna be free X if we Y in here even though it's on the x-axis because of the cohort flip here we go that's pretty good I can leave in the colors not all that necessary but I can leave I can sort of leave them in anyway and but then say theme Legend position equals none brand country-style often on a graph like this I'll throw in a giome B line LP y equals 2x intercept equals 0 0 has a very important meaning here to anything that overlaps 0 doesn't have Havas's of statistically significant effect so ace cook is a bad brand relative to the general market quali [Music] were lumped together and they say brands and countries they would go now I've created a coefficient plot like I said if you watch this each week last week I did a wine analysis where we had a very similar coefficient plot that's one of the reasons I'm rushing through this a little bit is I wanted to show like um yeah our show eggs we really said we've seen this before generally chose like Indonesia Malaysia Singapore have the best ramen followed by Japan Taiwan UK Vietnam Thailand kind of not really distinguishable from the other the quote other category it's like most other countries in the world us just barely edged up seems maybe not mmm give it how many hypotheses were testing not really not particularly significant I wouldn't call it that ok and cups are less good other other may be better it's hard to say because other was pretty rare on this style so like these were two graphs that we can kind of get a feel for how we would do a prediction I didn't do anything with with the accounts that is with oh sorry with with these really sparse ones the style if I'll tell you something if I were just doing this on a regular tidy Tuesday what I would do would be to unnecessarily readings processed uh nest hogan's for the variety count the UM the word so it's actually yeah we see like some words like noodle instant etc I could also group by the word summarize the mean of stars and say what is the average rating for things that have those words and then also include a ten keep writing starts some stars she probably filter not even a star stars before their yeah so we can see like in general ones that have chicken lower rated ones that have spicy I'm sorry when's that especially have low that's the word ramen has a higher average rating now we haven't distinguished this from these other factors like I said last week we did so in a previous session we did what we did wine ratings and we did lasso regression and that's a great method for seeing for all these words which have a positive effect which have a negative effect because I just did that last week of even if you didn't see the screencast I just did it so I want to try something new I'm not gonna go deeper into that but know that this is how that's how I would work with that column I wouldn't I wouldn't use it I wouldn't lump it because it's too sparse we saw there's too many different counts but I absolutely can say what words have I could absolutely what words when they appear since some of these words are pretty common across across noodles have a positive or negative influence all right instead what I'm going to try is web scraping Oh oops I save it as untitled what I'm gonna rename that alright so I'm gonna do scientist a house I think that in one previous screencast I've shown web scraping with the our vast package but it's been a while so excited to try this out again we're gonna do is we're gonna try getting the text of these reviews to augment our data a little bit so what we're gonna do is go to our Raman data said this comes from this page and it looks like this is that when we were working with review number and so on yes and this was probably scraped I'm actually curious that do we have the the text for screen scraping no we do not it not provided in github so what I would do is actually pull this data myself and get the reviews so we're gonna use a tool a really amazing tool the to two amazing tools really one is called Arvest one is called is selector gadget so selector gadget is you can install it with chrome I am probably with other applications as well what its gonna say is I want to I'm gonna select a section of this graph of his page partners web page part of me and it's gonna give me this hmm select it I supposed to highlight in green but that's okay it's gonna give me this sum this selector a CSS selector which is really a potent way to say um I'm gonna start with read HTML it's gonna call it ramen list tool for we read the HTML downloaded and parse it so if I looked at it it would be this some XML document but a our vest provides these tools for pulling nodes out of it in this case I want to pull the node HTML node app number sign my table found my table so here we go oh yeah so what it did is it actually pulled out that one node from from this and I can then actually feed that - it's called HTML table second work oh and let's make it a table D F surprise me that HTML table doesn't return a table death here we are so we have is Brahmin reviews so notice that's how we could have scraped this data ourselves with just a couple lines of code also a little like we do some janitor um we do some janitor clean names we fix my own spelling oh that's so cool it does I didn't know that the janitor replaces this with a view number we do a select would remove that last call I bet you this is how they rate with this overtime how original data was probably created that's pretty cool it's not what am i but we didn't need to do that we can actually um just I have it show up like this that's not what we're doing today we're doing is take a look at our ramen reviews we I want to get the links each of these links and know what what what it links to so they can start downloading their text I don't know how many towers I wish I wouldn't I can't just sit here and let it run so goodness I don't know if I'm gonna download all the reviews just a few gonna figure that out but let's get these review links so I click this button once to say I want this column this one decide I don't want these other links I get 3180 that sounds about the number of links that pop up in here yep that looks like that actually that worked so it says oh yeah I want the links in the table so I'm gonna say read HTML HTML nodes not node my table eh so this would say that I want just the ones the links in the table all this review links but pretty well and then I want HTML attribute of the review links href and I'm gonna say review number is as integer of HTML text of these review links so create a little table here we go now we have the review links wonder what got an n/a there you felt Heidi Oh hmm filter is an a review number branch and celsa Oh what was that like before I'm gonna add this is a column they'll trickers want to say like oh shoot I called yeah I called the same thing twice my bad so one thing I want to check is what did it used to say filter reviewed is an a review number oh we had a little bad bug there I'm not gonna I know I'm gonna do instead of as integer I'm gonna do read ours parse number on this which actually will extrude get rid of the parts that aren't I like clean like if there's a little extra text it'll it'll realize now I don't want to use that don't try and parse that now now we have all the review links oh I just want to make sure I got bad okay so I'm gonna grab one of these yum yum ooh dang I'm gonna grab this ones page I'm also gonna visit it because we're gonna now scrape one review it's a longer name yum yum Tim Tim Tom something alright so I'm gonna grab this looks pretty good what I want to do is grab a text out of the review I don't think I have anything here like a review or something that I'd want to grab I think it's all the same probably the same person all right so we're gonna do is say I want a selector gadget I want this paragraph I do not want paragraphs that are just like a picture but I do want this one I'll leave the picture and then I don't want these subscribe to blog via email I want to be in this main section mm-hmm and I don't want any of these notice that every time I do this is changing it wants o entry content paragraphs in entry contact intent is how you read that CSS selector and that's really handy so nice way of parsing out the section I want okay so I'm gonna go ahead and at this and say I want to take this page HTML nodes grab all the nodes that fit this pattern looks pretty good HTML text here we go and now we actually see it and we can see out of this many stars okay what I notice about this we're probably to render it more often is that there's a lot of there's like little text there's some like chatting and then there's some captions then the interesting part the part that actually has the like oh this is good this is nice or is this a spicy look the actual review looks like it comes in a line with stars we're probably gonna pull just that one out but in the meantime those pretty cool is notice I got um I could throw in a quick string subset say you must have at least one character it's an easy with a so stream subset says I must match this regular expression could have said must have the word stars but yes so I can actually so I would say get review text function URL and what I get is here we go read HTML grab each of these and this is a script for getting the review text from any given any URL a little got a little function for it so now I can say mutate text equals math link get review text Osprey is pretty cool we got a character Veck column now if I like I can unnecessary URL yeah okay I'm going to create a 25 I'm gonna say review text and map it to get review text then yeah I'll leave the uh nest why not also leave the link never hurts to have that wide add a message cuz I like to see progress happen as it might be a little bit dull and you can always worried that it's getting frozen or something like that speaking of getting frozen so that I often do with this is um it works really well with with tasks this actually I'm gonna get to it in a second first take a look at our output you know now we have our text anytime we have a text column I've already done a little of this we wanna we we always find ourselves count words sword equals true usually like to throw in an anti join on stop words find out more about this in the book by me and Julia silky um text mining in our well then we find is that there's a couple words here gonna exactly how to treat five times it looks like I'll do all of them have the word pork and yum five times I yeah I could definitely imagine I could imagine that oh no did you catch this I didn't catch this oops I was reading the same URL every single time imagine some of you were falling along at home and just showing at me is that here he had read URL and we just had the string fixed there that wasn't very handy oops so I was like these are these numbers are too clear I know there's some repeated text but why would poor kin-yuen be repeater like no no no oops okay so that this some gotta fix it we get twenty five review texts and now we see some of the common ones are like click to enlarge things from captions those are yeah a little bit less interesting and if we take a look at this I bet you we're going to see the line like watch me cook an instant recipe noodle time do we see that multiple times yes we see that here and here and here maybe in every review and we notice that there's kind of some some like what do you call boiler plate that pops up at every page that's not interesting in fact I think the only one where I really want if I really want to analyze some about the reviews themselves it's I mentioned before it's always in that last paragraph that has the word stars so if I took review text it's one one row observation per line and I can filter string detect text stars and say yeah every one of them has one of these finished click to enlarge I honestly I could have felt it for finish too and yeah I still want to move and I probably still want to do string move text text whoops I need to do mutate text equals string remove from the text the words finished and then a period and a space oops finished this means not greedy so some number of letters I could have done click to enlarge I didn't want to add that so there we go it's we we got and now we remove it to like this was nice or often they say whether this says what the reviewer added spring onion excetera and here I can say review paragraphs so now I can say of these review paragraphs what are the common words and I'm probably also going to filter string to tap at least one alphabetic character in our word a to Z bar codes at stars we saw added pop-up a lot mostly yeah we have these descriptive terms broth noodles flavor soft onion bean some of them probably a positive someone negative all right here's I'm gonna do I'm gonna increase it to a hundred then we'll do a little more analysis in the meat this is actually a common approach that I take is I keep increasing the number in the meantime I want to show you something really cool there's something called BIA so there's the per package has these functions like possibly it possibly does is say this function if you run something that's gonna run like a somewhat slow function on many thousands of amun steps the worst case scenario is it fails on one of them like because let's say one of these didn't have any of these any paragraphs in it for whatever reason just one and yet it breaks and suddenly they'll do all of your work you've done so far is gone so what you do is you work safely so CR for data science her safely I want to remind myself where the yeah there's this great section in our for data science that shows the function dealing with failure that's the one I'm actually gonna have a link to it where's my here it is dealing with failure an adding this text see here for more I'm adding this for people that in the end there in the future reading it possibly and other dealing with failure functions aha what did I tell you I got a 404 but I haven't set it up yet so this is super handy if you haven't seen it before what it says is I want possibly I'm gonna have a default no remember what happens with unnecessary saw that but here we're gonna run it again and mean so just I'll keep people will keep busy somehow what it's saying is possibly with a default of null so when there's an error I pleasure to sit quietly equals false and we now there it is possibly always succeeds I didn't give it a default value to return so here's gonna give back a a null when when it I want to do which is equivalent to an empty vector which is kind of good because this is giving me a character vector here's give me an empty vector if if it runs into any kind of error I've added it quietly equals false that it'll show the error here so I if I'm just watching and I'll see the error go by obviously we could have done about it with just it's just a link to him to review that maybe it was removed but yeah here this is how I would par seven wouldn't yes there's what I do here what I'm getting impatient is I'll grab this yeah you're watching this on a video you're welcome to skip forward waited a little wait until this is done but what I could do is say how are you lines I got you we're almost done all right and now I'm gonna show again review text and review text on nest ah so that that's good actually it I'm glad I didn't run this because what happened is it I have to say add an extra I have to say filter not map illogical text is null so notice there is no feet one here to say here is review number etc etc etc I could have done character zero couldn't I have that's what I would have should have done it's a character zero for next time now I know but before now I can say not this and unnecessary have a step that takes a long time it's nice to keep it in his raw form as you can but because then then I can do whatever frost I want if I find I later want to look at the first line of each review I can do that myself I can say they grabbed the this section of test text for now here's my review paragraphs there they are it shows how 90s we had 97 case of finding stars stars in each seems like some of them didn't have that finish have that line where they gave a star rating I wonder if I did a finished huh there we go so some of them don't have stars so this thing you know if I when you screen is creepy it's never our Webster it's never exactly not string detect tech stars pull text not scored no score okay so we do have ones that aren't scored alright um I can leave it a leave those in anyway here's my review papers have 99 scores time is it deciding whether I want to ascertain whether I want to fit a model like I said I've done and whether I want to download more data first we really addictive to download more data I can just do it all day instead of actually doing encoding I'm gonna do it it's hard to do anything with fewer with fewer views and run this oh it's not quietly equals false is it quiet myself quiet equals false so I'm gonna run this and I'm just gonna stare at my screen or maybe iPhone and you could fast forward I will download 250 reviews luck if you find yourself letting this run in the background and you're hearing me talk one thing I was thinking is that I have I should have puzzles ready or something like something quick I can do it a trick of the day that can do while I'm waiting for this code to run so it's not a common situation I found myself in in these screencasts but it would have been useful in this case oh look at that we caught an error all right really I hope you got your back all right so we have a review text now I can try running this line again good all right and we have 247 of those two and fifty we have a review paragraph so now we can work with these parallel we could do other things with them but now I can work with these paragraphs we find that here the mode that almost all of them use the word stars as well as bar code we saw they start with the word added and then we work from the our way down down from there now if I here we go I'm gonna leave that's gonna call this review paragraphs tokenized I'm also going to what am i doing in the inner join it with my original ramen data set ramen reviews by a review number how the reviews in this in the original one where character hmm probably cuz they use the similar approach here alright I'm going to fix that I'm gonna say review yeah I'm gonna say mutate a review number equals horse number review number on Raman ratings I'm doing something wrong Oh oops okay sorry for that the reviews here's we have a wait ah wait it wasn't ramen reviews with ramen ratings oops sorry sorry folks here it is Robins rate ramen ratings who use wrong ratings process does make a big difference yeah so we have as a grand variety etc but importantly we have stars so we can take this data and say group by word summarize and number equals n reviews is n distinct because someone can appear multiple times we're in review number and then I'll say average rating equals means stars here we go so I'm saying whether I want to use a lasso regression on this like I did last week I'm not going to just cuz like I said I've done a bunch of lasso recently I'm just gonna use the average rating but we're gonna try a different type of plot here I'm gonna arrange this ending by how many reviews something appears in go oh I forgot to say filter and not isn't a stars here though so I can say so I'll say I'll call this review words and this will add negate my words about how many times they appear stars bar and code you know not very interesting and so on but what am I so here's what I'm going to do in status wounded look at co-occurrences we I didn't do this in the wine dataset so trying something else with this data where I'm gonna say let's look at what are common sets of words that appear together so you can get a visualization of a network and I'll look for that I'll use the ydr package and I'll say pairwise core word by review number sorry comes true think I'm you can start with this thing to review number and word I don't know what will happen if I don't do that yes so for example some of these always appear together bar and code code and bar that actually makes it not that interesting this one because those actually end up being too common fork and lengths always appear together and if I see let me see sign if I left join this no I'm sorry if I energy review words by I call this word cores oh we just hit me we're including ones that are too rare so what I'm gonna do that if they occur if a pair of them if each or pair of words only appears once the correlation is perfect also appears too often so I'm gonna say review words filtered is review words but only filter we're numb we're reviews is less than 200 I don't want noodles added barcode or stars and reviews is greater than can start with that and now I'm going to say take that review words filter oh take this semi join it on review words filtered so only using common words distinct and pairwise here we go I like that more all right so then I could say word Coors so here my word correlations among and sprouts among and beam or don't know my way around these I'm these vegetables spices and such so what I'm going to say is take our word course head take the two hundred highest correlations and we're gonna use the GG RAF package also I graph we have created a correlation you can find out a lot more about this in coyote text mining calm in tidy text relationships between words and grams and correlations we can see that we can build these kinds of graphs of word oh those are diagrams of here this of word correlations is word network and we're gonna build it a graph that looks a lot like that so I actually going to link to it just make sure we have this we've used it at least a few times more on correlation graphs with words here go I'm gonna say take your top 200 graph to data frame I need to run I'm paragraph from data friend and Gigi Rath well juju Ralph is that rat has built a great web of autonomous Peterson and its bait it's based on I'm the grammar of graphics but a pro Tibet approaching network plots giome edge link genome no point here's our graph so far I almost always set seed it doesn't look like much yet does it give it a minute and I'm gonna add gym text AES label equals name be just one H just one I've actually do it a little differently little pal equals true [Music] geom node text Emilia it is a little out of our practice with this so here I show like Oh nicely hydrated veneers style mouthfeel seafood shrimp shrimp etc so on that's pretty cool but I can actually get a little bit cooler I can say I want to include all the filtered words as vertices hmm how many words are in here I try this once it's gonna have a lot of extra points I'm not so sure I'm gonna love that yeah I don't love this what I'm gonna do is say it is I always find myself doing this to say filtered pores and I don't have a faster approach yet I need to say take your filtered cores and then your nodes are your are my are my review words filtered but filter where word is in is either notice filtered correlation I can say it's either item one forward in filtered Kors item two vertices equals nodes so now created a subset of these vertices why did I do that because now it lets me use aesthetics at the node level say that size is the number of reviews it's in so now I can have the bigger points the points that will calm and end up show up bigger no that's I've got some overlap you know words I let some words that are a little hard to read and such will I'll be working with that in a second he was a great thing I can oh I I can also do a theme boy it is my most often common operation yeah and I would say it's a little bit of better of a scatter plot I love how this shows X pepper advances ashram after Scalia matches chewiness savory broad it's quite an interesting shape of the network the way this batch of things cosmos sprouts it looks like a group of ingredients actually makes me think that 200 connections is not quite the right number yeah I kind of like this this graph it shows one really heavy cluster and kind of spreads out from there now why show things to grab it's a little bit interesting but what could I learn from it well I can learn from it if I use another aesthetic that I set up earlier with this in mind average rating is like for this word how many um what's the average rating of a review that has this word it's a very it's much simpler than a linear model because I'm technically it could be confounded across many of them but it's pretty good for a graph like this because I can say size and color equal I'm yeah color equals average rating you need to change this um this AXA little bit but it's going to show here we go set EAN hmm so some areas are a little bit brighter some are darker but I'm gonna have I'm gonna set up a new scale color gradient to where I say low is low is red high is blue chef blue and midpoint is I don't know we saw the midpoint earlier was like three point six in terms of a median now I need a lower midpoint everything is blue right now I meant a higher midpoint here we go so notice now we're at saying like oh there's some now can you look for areas of blooming higher mean higher ratings or red mean meaning lower ratings so excellent fried mouthfeel quantity and quality and quantity both are positive looks like impressed Hardy are positive this whole area of salty salt life baked cheddar chicken pepper kind of appear in in worst suburb in lower reviews alright is quite a bad quite a negative review so that that's definitely um some things I can learn from this notice a little bit hard to like see all these points I have a trick I use I can't emotive of ever using this screencast it's a silly trick what I do is I've I create but under the other points I say would actually do is say let's create a separate point under the first one that is slightly bigger and that actually makes it look like each of these are outlined I don't have a better way of doing that I've done it for a lot of my graphs and just notice that it really kind of um makes it easier to see these points let me see yeah I need to add a couple of labels say labels coloured equals average rating size equals number of reviews network and I said title is network of used in ramen reviews subtitle own based on 250 ramen reviews words used together and their star ratings it's great I added just in time my rabbit sauce script again including the set seed which is nice good so I spend about 5 but about 20 minutes ago I spent five minutes letting that scrape running run it's important to get as much data as you can while still fitting within your time constraints I'm happy with how I did on that because I am I spent 5 minutes got 250 reviews could have gotten a little more but now I get to see these these discovery to see that rich and fried and Meowth mouthfeel hardened impressed these blue areas nice shoe these are kind of positive there's negative areas like salty salt life life chicken baked all right is the most negative this whole area of Cosmo among sprouts onion I think it's some common particular kind of ramen that might be really common and it's it's middle of the road it looks like around a 4 rating it suggests on average it's instant yeah it tends to be associated with negative reviews maybe some kind of AA farther on the red scale okay so that was our network plot let's review for a moment or we did we did some exploration of the the original ramen data in particular I was in territorial min fasted so we can understand what to use as predictive Asst we used we didn't follow up more on this but we use these predictors to create a coefficient plot based on brand country and style and I mentioned if I were going to use if I were going to use the other one was a called variety I would have tokenized that but then we move to the web scraping side we learn how to use the inspect the selector gadget tool to go into each of these and actually pick a CSS selector they would extract the text for me so iris tracted all the the table basically we created some of the data and then I created a function to extract the text and finally once I had all this text I text side stars to decided who's gonna make one of my favorite kinds of plots for working with texts and stars which was one of these network plots with you using color as a average rating if we if I'd had about twenty more minutes I would have let it run not just on the 250 reviews but on a larger selection I guess maybe would have taken an hour to run for all of them definitely would have been fun okay that was the the ramen and Heidi and titus analysis and with them tie text and our vest analysis I hope you had a great time I had so much fun I'll see you next week
Info
Channel: David Robinson
Views: 6,637
Rating: 4.9784946 out of 5
Keywords: data science, rstats, tidytuesday
Id: tCa2di7aEP4
Channel Id: undefined
Length: 60min 54sec (3654 seconds)
Published: Tue Jun 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.