The Lesser Known Stars of the Tidyverse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a little bit about me first I've been an AR user for around six years I first learned at Rice University where I went for undergraduate I'm currently a data scientist at data camp I started just two weeks ago for those of you who aren't familiar with them they're a data science online education company and they're sponsoring this conference you can go learn more about them at the booth and a couple things I enjoy talking about if you find me at lunch or in the coffee breaks a be testing the talk I gave at Jared's Meetup was all about online experimentation I did a lot of it at Etsy where I was before and currently helping data camp build out their platform building and finding data science community this is really why I enjoy being at conferences like this because all the talks are recorded which is great for those who can't make it so you could see these talks but really it's the energy in the room it's the connections that you make when you see people and the coffee and the lunch that makes it so much fun to be here and finally of course are alright I have three goals for my talk today the first is to keep you hip to the lingo so I'm talking about the tidy verse but but what is this tidy verse you know what how did it come about and who are the people behind it and the second is to stop you from doing this we've all had those moments where we know exactly what we want our data set to look like how we want to reshape it or the model and it just won't work hoping to help solve that by teaching you some useful functions you can start doing right away in your workflow and finally point you to some resources I'm hoping at the end of this talk you'll be excited to learn even more about the tidy verse after understanding better its its breadth and how the functions can work with your data analysis process and I want to help you start that journey so that will be the closing part of this talk all right the tidy verse tidy verse is a set of our packages that are designed to work really well together and to bring you through the data analysis process this is what the tidy verse universe looks like most of us are probably familiar with these steps first you have to even get your data into R then you work on changing it maybe you have to munch some dates probably you'll try it out visualizations plotting your data hopeless ggplot2 and then the final step is usually to communicate it to other people you might do that through a shiny interactive dashboard or an arm markdown document that you make to a Google Doc that you share with stakeholders and what we see throughout this process is written below the big steps import transform are just some of the tidy verse packages that can help you through this now most of you here are probably familiar with this man he is gonna be speaking had Lee Wickham for those of you don't know who's the author behind such packages as deep liar and ggplot2 and if you've been around in our for a few years you might have heard the term the Hadley verse talking about these packages so a question could be is a tidy verse just to have the verse it just had Lee Wickham the answer is absolutely not yeah this is the first time I think I've had having in the audience for this but what I wanted to show with this is that Holly's done a great job starting this but here are some of the packages in the tidy verse and these are the people behind them in some cases they're the creators of the package in other cases they're the current maintainer or active developers without whom the package wouldn't look like how it is today and that's what I really wanted to emphasize is it's grown you know much beyond one person and really now there are lots of contributors lots of packages in the tidy verse and it makes it a really exciting place to be the main part of my talk is going to be this demonstration I'm gonna walk through and explore to our data analysis of cattles data science and machine learning survey this was something they did in 2017 they surveyed about 16,000 people ranging from those just starting in data science to people already in the field asking questions on their demographic information what are their favorite learning platforms or what machine learning methods are they excited about next year and some steps of a data analysis work fall walkthrough going from everything to viewing our data set initially inspecting some different columns making plots and maybe everyone's favorite steps doing something really cool and new alright so first up you just type the name of your data set after you've imported it into your console and who's ever had this happen where it just takes over the whole thing and you're like oh no what what have I done what mistakes have I made and what's happened is you have so many columns and it's printing all of them out and this is really not a useful way to view your data set because it's just completely taken over in the solution to this is to use the function as table from the table package Tibbals our modern data frames and they've changed a few behaviors about data frames but one of the great ones is that they will only print the first ten rows and the columns that fit on the screen so we can see here is those rest of those columns that were the rows of those were all printing out now we just see the column name along with its type so this is you never have to worry about that never-ending console screen again I guess unless you have maybe 10,000 columns don't do that next up examine your n A's as we heard an earlier talk missing values can be really important so how do you take a look and see what those missing values might be one way you can do it is first you do its combination of sum and is dot n a is Don n a will change the variable country to true or false depending on whether it's missing then we sum it and because true evaluates to 1 and false to 0 that gets us the number of Na s and using D priors summarize we can get this so that now we have you know just a one row data set that's the number of n A's and it's 0 but how can we do this for every column I don't want to be running through each column name doing this here's where per can come in so per to me initially was a very intimidating package it's all about functional programming it has a lot of stuff but I just want to share it with those who aren't familiar how how you can start using it and what here I'm going to do is map this function sum is na over every column the map underscore DF says I want to dataframe back so you have map underscore CHR I want to character back and that's how you tell that's and we can see I do get indeed back a data frame of one row and 228 columns and what's in the row here is a number of n A's in that column and the way I've said this is the dot that's in here says this is where the column should go this is what I want where I want you to the function and that little tilde just stands for an anonymous fun and this is a really nice way to do something over every column what's a problem here let's take a look at gender select it's said that there were zero anaise well what do we see here as humans there are 95 people where it's an empty string and we recognize that as an A but R doesn't and are missing so our missing values aren't actually an A to solve this we can use D pliers and a F and this says if it's argument is what we want to change to n/a in this case we want to change an empty string to an A and now if we once again look at our gender select variable we see just as we would hope the 95 entries RNA next step let's examine some of our numeric columns but how can we do this quickly we're gonna use a combination of deep higher selective and skim our skin select if if you're probably you might be familiar with deep layers select select if says I only want to select the columns if a condition is met in this case that the numeric and then skimmer skim gives us a really quick overview this is all this a lot of the summary information you'd probably want starting with how many are the missing what's the mean standard deviation what are the percentiles and finally probably my favorite feature the histogram where we can and we can immediately start seeing some things like oh we might have a data quality issue because age the minimum age is zero and the maximum is 100 probably we didn't have zero olds taking our survey next let's go back and examine a single column this is the column work method select and I'm just doing a count and I'm seeing okay which you know let's summarize the data a bit and see what are the more popular work method select what's an issue we have here we can see that this must have been a multiple selection question so each row is you know represents like a person to answer and the problem we have here is some of the answers have to work method selected like we see the third wrote data visualization common logistic regression so if we want to understand what's the most popular work method this is not a really good format for our data to be in and to the helpless we're gonna use the stringer function string underscore split stringer is all about working with strings in this case we'll say take that work method select column split it based on the comma with that make a new column work underscore method and let's take a look at that and now what we have is a column of lists so that first row is a list is a character list of five entries and that used to be five things split by four commas but now we've got a list but that's not the format we wanted it to be in so how can we spread it out so now once again each row will only have a single entry and with that we do tidy our son nests so we take this data set which has a list in each row and we spread it out and we end up with over 68,000 rows because now it's much longer and with that we can start trying to visualize so let's make a plot let's do a plot of you know as we wanted to ask what are the most common work methods and we have a critical issue with this plot which is it's a mess this is not helpful right what what what the heck are these XS what are the labels so we're gonna do two things to solve this the first is a cord flip and this is really nice you the other option would be you can make your x axis labels vertical instead of horizontal but I personally like the cord flip a lot and now we can read it and and that's okay that's more helpful but we still have a problem they're not ordered so let's let's order them and with that we can use for cats was one of my favorite packages it's around working with factors and here we're saying okay let's reorder the work method column by n by the number of people and now we can see that oh great this is much easier to read we see we start data visualization logistic regression are the most popular verses like GM's are all the way at the bottom not very popular our final step in a date analysis process right as we heard is it's maybe just a 5 or 10% a time but it's doing something cool and new what's the issue we have here we have no idea what we're doing so who's running together you're so excited about a new method and you're like I don't even know where to start or I just ran into a bug and why isn't my code working there's a lot of options you have here such as reading the manual but another option is that you can ask a question see and you if you want to make ask a good question like say on sec overflow our studio community or twitter a good way to do is to make a minimal reproducible example which is all about helping others help you and we're gonna do this with two packages Tibble and rep ryx and first just to clarify a minimal we producible example is let's say you asked a question it says why is in fact to reorder working would be very difficult for anyone to answer that for you so they wouldn't see your code they wouldn't know what dataset you're working at but what if your code is a thousand lines and your data set is proprietary well you can't really share that either so a minimal reproducible example is about reducing your code to where it's to the shortest possible amount of code and data that still illustrates the problem you're having and this is an optional part but one way if you need to make a kind of fake data set to do this you can use Tribble to make a toy data set and Tribble is how I always dreamed of making data sets in art which is just on the first row you list your column names a little tilde before it and the next row is just all the entries and here I'm just saving that as an original data frame next up we can use the package reference from Jenny Brian to find any problems what do I mean by that let's say we have our code so on the left here we've said okay this is the code that's having a problem let me try rhetorics i copy it and run ref rex but oh no i'm having errors so if someone was running this in a clean our session they would have errors and not the errors i want them to have but but the errors because I haven't loaded the packages so in this case if we look at what errors are gonna come out it's oh I don't I can't find the function business I can't find the function ggplot we go back to our code and realize we haven't done the library calls so rep Rex will help you find those issues before you post and someone goes back and says hey I can't run your code then we can use refworks to post your question or issue and that's really nice say in this case we're seeing an example where you posted it on on github like maybe you want to file a bug report with a package and the nice thing is if formats your code and includes any plots and makes it really easy just to go and submit your issue just by copying it all right so this is just an overview of what we went through today about 11 functions like I said there are so many more not just packages but functions within the packages and that's why I wanted to go over for example deep wire which is a very popular package but many people don't realize the depth of the functions that it has and at the resource at the end I'm going to point you to some places where you can find out things like that first Resource is Hadley wickham's and Garrett Crowell Aman's are for data science book which will probably be given away at this conference it's available for free online and it's a really good resource it's appropriate for both beginners so if you're just starting out in our but also even for more advanced our users who are looking either - maybe wholesale switch into the tidy verse or I've been using the tidy verse but aren't familiar with certain parts such as may be working with dates and then they can go to this book and just jump to the chapter or section on that next step is the our stats Twitter hashtag so I'm a big Twitter user and one of the reason I like it is it's a very friendly community in this case I was saying I have a what happened was I had a report that I was writing at Etsy and I had about 12 plots in it but I was going to share this with a lot of people so I wanted to make those plots use the etsy color scheme and I knew about theme underscore set which is a way at the top of your your script you can set the background for your plot and that will change all the plots later but that didn't work for colors and I wanted to know how can I change globally change the colors on my plot without having to go and every single plot add the color scale that I wanted and our sets Twitter came to the rescue so LD Sessler here said did you know about this package GG theme our here's the functions in it and here's a blog post I wrote about it and this was a really nice way to get it get an answer to my question that the googling hadn't hadn't done for me next up our studio has lots of cheat sheets you can find them at their table in the back this is per I like it again as both a reference but per is often thought of mainly for its map function but this is a two-sided cheat sheet and there's so many more functions that you have here so whether it's you want to dive into a package or you want a one stop shop for working with strings you can go to our studio cheat sheets and find those now of course I'd be remiss if I didn't mention my company data camp I actually had this in my presentation even before I worked with them and that's part of why I was so excited to join as I mentioned it's a data science online education company and it's a little bit different we have a couple of things we have projects and problem sets but one of our main things is courses and these courses are for hours so they're a lot shorter than say a Coursera course which might run over a few months we have people like you know Charlotte Wickham Hadley has a course I'm working on a course right now in categorical variables and the way these courses work as you watch videos and then you do these interactive exercises right in your browser so it's a nice way if you're just starting to learn like our Python or sequel to not have to download anything and to get immediate feedback all right in conclusion I wanted to say again that you know our has we're great we have as Kristine Zhang here is sharing we have all these awesome stickers again you can find some of them in the back cool package names but so maybe maybe that's how we can help attract people initially but really staying for the friendly community and happy workflow and that's part of why I love so much being part of the art community I work with our ladies global organization advancing gender equality in the art community and we've had such an outpouring of support from a lot of other people in terms of sponsorships and I really think that just shows that ours is really the people helping lead our do not want to be elitist they don't want to exclude others are is meant for people who in some cases are never going to be professional developers but and that's kind of what the tidy verse is about is even even if you don't want to go deep into I'd say like advanced programming which you definitely can and are the tardy versus get you doing powerful things very quickly with that I want to say you can find my slides very soon on this Lake tiny dot CC New York our talk I blog about some of these topics like finding data science community making our faster on my blog hooked on data can find me on Twitter Robinson underscore yes and with that thank you all so much [Applause] you
Info
Channel: Lander Analytics
Views: 15,258
Rating: 4.9736261 out of 5
Keywords:
Id: ax4LXQ5t38k
Channel Id: undefined
Length: 18min 4sec (1084 seconds)
Published: Wed Aug 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.