David Robinson - Ten Tremendous Tricks in the Tidyverse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi thanks so much I'm really excited to be here this I believe the 6th time I've spoken for Jared I'm at conferences and meetups and it gets more fun every time as my first time in DC are so really excited to be here so my my talk starts with the experience of who hers who's heard of the tiny Tuesday project isn't it fantastic it's a project run by the by the arc for data science online learning community and every week they release a data set to the data science community and say do anything you can you can analyze it make graphs do analyses and what inspired me to do is start on opening the data set each week and then recording myself analyzing it so I've ended up doing 36 tidy Tuesday screencasts in the last year so I spend an hour with each data set dive in learn some things about it and post the UM the recording on YouTube so it's been really fun but so here's an example I'm so like how I'm exploited it's at a trying out building a network graph I think yeah and then I started working from there and it's been really fun to like to learn about myself and learn about some of the um how I feel what we're working with new data and really practicing that skill but it's also been a blast to hear from people that say what they learned from the screencast because it's not always what I expect one that I found out is that all of us are walking around with tricks things we know how to do with our or in an often in my case with the tidy verse that we might not think of as a as an obscure trick but that is a critical part of our workflow and might help other people so with that I decided I'd go kind of a clickbait route and say here are my 10 tricks that only a tidy verse user free will appreciate number 6 will blow your mind and really dig and really just walk through rapid-fire series of um of things that you may or may not know from that you can do in the tidy verse each of the times I'm actually really interested the first time I'm giving this talk I'm really interested in learning what does everybody already know what might be extremely obscure what's somewhere in between so I'm going to ask about each trick who has already heard about it sound good so these are ten tricks they're going to be combined into kind of themes first a couple of tricks I found around counting and summarizing data second one about ggplot2 visualization and for cats for categorical data and third three tricks from Tidy are that are some of its levy less are less well-known functions from the package so first I'm going to start with my favorite function count whoever's used count in deep layer before fantastic so the UM so had Lee Wickham has said that most of data science is counting and sometimes dividing and I find you can get a really long way just with counting data so in fact I did a little analysis of all of the screencasts that the African screencasts that have done and actually found the count was the fourth most used function them after mutate filter in group five so I really do love counting what would I do with count this was one of the a lot of the examples gonna show come from my tally Tuesday screencast just because it's a lot of code I have lying around where I've used these tricks this was one from the first screencast they did where I counted the number of times each no graduates of each major so in this case this was a table that includes that divided college majors down into categories and show here's the UM there are ten majors in agriculture and natural resources eight in the arts so some of you may have used count this way did you know it has three additional arguments sort weight and name sort lets us say well we're gonna we're going to count the number of times each category appears but then we're gonna sort the result in descending order to find the largest first wait means that instead of finding the number of observations I'm gonna sum up a column in this case I say the weight is total which means I'll sum up the total number of graduates column and find the number of graduates in each and third it the result doesn't have to named n it can be named graduates so uh but you can give a name equals graduates I think that's new as a deep hi are 0.8.0 fairly recent it's really in this trick you get three tricks for the price of one who's you sort and count before who's used wait who's used name it's awesome so is the first trick the second trick is another way to use a group by or count we've seen that you can count one of the existing columns for example here was a dataset of bridges from from Maryland Maryland and what year they were built so I had this cowboyed say how many times was each built but I might not want to count the number of how many built bridges were built in each year but I might not want to count than the it by year I might want to count it by decade so in the same process of the count I can create a new column kind of like a mutated then account so the UM and then it ends up with a data set that's actually aggregated by decade 1900 1910 1920 instead of being aggregated by year so in fact there's a trick ten times year this truncated division ten it's a way you can add great years in a decade which offers a bonus trick within this one who I already knew that you could create two new variables in a group by your count it's actually more obscure than I expected but it's a really that's a really a fun trick it saves you an extra step of creating a variable so the graph like this I would under a data set like this I'm end up taking this doing a G on line and sir the number of bridges built in each decade third is add count who's ever used the add count function in dir is always added I think about three years ago it's a little bit more is a little bit more obscure who has ever had to do this a group I mutate than done group where all you're adding in is N equals n it's really helpful I'm going to see a couple case where it's helpful the UM this is a three step process where I'd say we want to add this end column and say here the four in these four observations the total for that loop is four in the 2 for B it's 2 in this one for C it's 1 and that can be done in one step of add count columns so taking three steps and putting them into one where I find this really useful is to combine it with the filter so if I say add count and then filter add so this was a data set of space launches for particular us space vehicles and I wanted to and because I said add count to a particular variable then do a filter on that end I was able to visualize only the vehicles that had at least 20 launches so add count and filter is a combination I like to play with here a third way that I really like to aggregate data Islam with summarize is creating a list column this is a relatively recent behavior of deep I I'm not sure if it was introduced last year the year before but you can use summarize to create a list column so I'm going to pull out a tidy Tuesday data set of New York City restaurants and this was a this was actually of inspections of them and and showing what for this restaurant how many times was it inspected and what was the average score we're more we're a higher score means more violations now I could find the average inspection score with a group by and a summarize so a group by summarize mean average score and I've added a column called AVG score with that value but did you know that if you put a list as the as the output in that summarize you don't end up with a value you end up with a list column so this field in in out in the result is actually a list and every one of these objects is a t-test so we could put kind of anything into that summer into that summarize equals list and anything we put in there is going to be turned into an object in this list column so why would we want to create a table of many objects that would cover a much larger scope than what I have the time time to talk about today but here's a um one example is we could take those use the broom package to tidy them and then visualize multiple models just here it shows what types of restaurants have the most violations turns out to smoothies and cafes the fewest Indian Latin Caribbean Spanish Thai have some of them some of the most and you get individual confidence intervals for each of them so that's that comes from visualizing many models there's a chapter of this and I'm in our for data science chapter 25 that if you're interested in visualizing multiple models is really worth checking out who has created lists columns with summarize before trick number five is a story of a couple of tricks in one that combine to fire or to form something I do a lot especially in these um in these screencasts it's actually my favorite plot so when I ask someone what their favorite plot is some people say something like oh I really like a really particular compliment ARDS the poliana k-- war and some could praise a plot like this but the truth is I really like I'm sorry I really like a plot like this a bar plot that's sorted so the um so this actually in fact I showed this plot earlier what I said what functions in India are do I use the most and it's a combination of four steps account an FCT reorder a reorder who has used FCT reorder or built in read reorder before yeah it's so amazing for visualization it turns this column into an ordered factor which allows you to visualize it in this kind of ascending way I phone a GM call who has used to um call as opposed to GM bar so if you've used giome bar before you have to see you might have to say stat equals identity more recently you can say GM call instead of um like GM column for that and for and who's used cord flip code flippers great because I think it's so much easy to read and you can fit some more text when you have a horizontal bar plot so that combination of tricks forms a graph that I make a lot that these are just five examples from some of the from some of the the screencasts that I've done of here's a bar plot here's the here's the frequency of certain categories you can stack them you can do really a lot with this graph who has used FCP lumped before continuing our tour of forecasts so what's great about lump is that you might have a lot of levels and you want to combine some of them into other so this is from one of the our studio this is wonderful from one of our studios fantastic sheet sheets for factors of four cats and they show if you have a vector of like a couple of levels you can FCP lump it and take all the less common levels can combine them into other so I'll show in an example from a screencast I'm going to stand by this one I'm having a little trouble with it with that clicker so this this would data set of horror movies that included their reviews as well as their rating like PG PG 14 are pg-13 and if I just did a boxplot it would look like this it would look like it'd be out kind of out of order there are a few levels there there's only one PG rated horror movie in this data set and and there be these couple of levels and it wouldn't be really useful so but with two steps FCT lump and then as we saw in the last trick FC kiri order we end up with a much more interpretive graph that combines all the rare levels into other so really like FCT lump forth before box plots bar plots counts things like that who's used log scales before so this is a tidy verse trick in the sense that it is a ggplot2 function but it's really also a data science trip that i think is um is really important and sometimes underappreciated so i've actually somebody said that if I had to give import advice to new data scientists I might start with something like well you must start with a specific scientific question then if a nope it's actual the number one advice is try putting your axis on a log scale why is it important to put out an accident on a log scale I'm going to show a couple of data sets where it's really pivotal one is this some is this is a was a wine data set what we said what is the distribution of prices of of wine and if we looked at it out on a regular scale it would look it would see everything interesting is crammed into one tiny part of the graph put it in a logarithmic scale one on which every every fixed amount of space on the x-axis represents a multiplication of the value from ten a hundred a thousand and suddenly becomes kind of more of a bell curve shape this is a distribution statisticians called log normal I mean if the log of the of the data is approximately normal why is it important for it to be normal well what if we were trying to predict a wine rating from the price of wine we made a scatter plot like that who said here's the price here's how I'm experts rated this wine everything will get crammed together and the prediction wouldn't be very effective in fact if in fact it would quickly go off past a hundred whereas if we if we do a prediction of the wine rating based on the price of wine we have what amounts the closer to a two a correlation till we see like a linear correlation and a better prediction and indeed in this sum in the screencast I found it was considerably more accurate to predict wine rating from from the log of price rather than from the price I want you to start visualizing it with log scales you can see that so much real world data is log normal I somewhat suspect a lot more real world data is log normal than normally distributed examples are GDP per capita across countries this was a visualization from a from one on plastic waste and we saw that on the x-axis is log GDP per capita of a country y-axis is log co2 emissions and if you um if you had them in regular scales everything would have been crammed into the lower-left population across cities or countries in come across people ISM is log-normally distributed revenue across companies a lot of day usually these days if I open a dataset and it's um I almost so expect a numeric column to be unless it's something artificial like a test score I have you I often expect it to be log normally distributed and throw some scale X log 10 onto the graph moving on to a couple tidy are functions who has used the crossing function from tidy our crossing is amazing and um and what it does is is a little is fairly simple what it does is it takes vectors and finds every creates a table with every combination of them here there's um a which is 1 2 3 then a couple other vectors and it turns into all 24 combinations of a B and and C the inputs has anyone use expand dot grid expand doc rid is the basic are equivalent crossing is a little more convenient for working with other Khidr functions because among other advantages it returns a table and doesn't convert strings to factors it's really hard to explain in two minutes what crossing is so useful for it's really worth a talk or a blog post on its own but some of the things I would do with it are try every combination of of two num of two numbers so that I can so you can visualize do you plot a mathematical curve when it's used alongside augment from broom it's really good for for showing how a prediction comes from particularly input parameters in this case this shows that over time the probability of getting a season cue has become more dependent on audience opinion and there's a really basic idea of Tidy simulation we can use cross and create all combinations of possible inputs and end up doing some tidy simulation on top of those all I can do for now is advertise the existence of the function it's real it's been really powerful you end with two more tidy art tricks for cleaning your data who is you separate from tidy are actually about half the room and this is from another one of our Studios cheat sheets separate takes a column that like sound like a ratio and splits it out based on a regular expression or based on a on a string so it's hard to appreciate how important this is until you've worked with some data where a lot of stuff is crammed into one column this was a data set I had to work with it during my graduate work and that that all everything to the right of name is all one column split up by pipes with lots of really interesting information back when I was in graduate school actually went back to my code and found this awful line for split for extracting one value out of there there one that I really needed today what I would do is I would say separate how tell it all the columns I want to be divided into and give it a regular expression a pad a string pattern they would say here's how it split it and get it in five separate columns so once you work with data in the wild that has these combined together separate is a really handy trick the last trick I'll introduce is extract who's used extract from tide er before extract is fantastic when you when you have some data that's not necessarily separated by a delimiter but kind of trapped within some pattern of a of a column so here these were as I did a sort of Bob Ross episodes and we had these um codes these C's in an episode so what I what I would do here is extract from that episode column the tube I'd say there's a regular expression for s something E's something and um and from that I get two columns of season and episode number it's even really helpful that by saying convert equals true I get those columns as integers it actually figured out that those were integer columns and and converted them and so that I wouldn't have to do that myself so steps like this can take three or two or three or four steps of data cleaning turn them into one so that concludes my ten tidy verse tricks now um why do I like to focus so much a trick what would tricks what is so great about them is the tidy verse nothing more than a bag of tricks my take is that the tidy verse is greater than the sum of its parts I think that when you combine all of these ways of doing something make every one of them a little bit easier easier I'm ensure that they all work together fluently these benefits um they're not just a little bit of communes here a little bit of communes here they really accumulate and keep you in your data analysis flow and that's one of the reasons that um that I've really liked learning tricks from people when I pair program with them when I talk to them at a conference like this when I see it when I read other people's code and I've really loved sharing these ten tricks with you today thank you [Applause]
Info
Channel: Lander Analytics
Views: 20,029
Rating: 4.9772081 out of 5
Keywords:
Id: NDHSBUN_rVU
Channel Id: undefined
Length: 20min 41sec (1241 seconds)
Published: Mon Dec 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.