Modern R: Welcome to the tidyverse

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so you're coding for loops like they're going out of fashion you're writing your own functions like some kind of bill gates but every once in a while you run into a problem and the solutions that you find involve this funny little parentheses operator where they mention the d player package and you go running for the hills well my friends it's time to face those demons head on today i want to introduce you to modern r and the tidy verse by the end of this video you'll have built a familiarity with the tidy verse and some of its packages and you'll have learned what tidy data is and how to manipulate data into a tidy format the lesson structure then will first introduce you to the tidy verse i'll talk a little bit about tibbles and tidy data and then i'll introduce you to the pipe operator that's that funny little parentheses guy and then we'll talk about some common data manipulation vocabulary filtering joining grouping that sort of the tidy verse and i love this an opinionated collection of our packages so as data scientists we're all familiar with the fact that we spend probably 80 percent of our time just cleaning data getting it into a format that we can work with so the idea behind the tidy verse and tidy data was to create a data standard that would facilitate exploration and the analysis side of things where we should really be spending our time and it came about from some of the early work in hadley wickham's reshape package and he has a great quote here tidy data sets are all alike but every messy data set is messy in its own way so really it's a philosophy of data and the philosophy of tidy data can be broken down into three basic points each variable will form a column each observation will form a row and each type of observational unit will form a table it's very simple we put each data set in a table we put each variable in a column and there we have it and it doesn't seem like anything new or groundbreaking and you're probably saying but dylan i already do this but it's very easy to lose sight of all this we're going to work through some examples and i think you'll find that you're guilty of quite a few of these practices now why do we need tidy data why do we need to learn this new style of our programming what was wrong with the old way of doing things well nothing but there is a general advantage to picking one consistent way of storing data you'll frequently be using the same tools to accomplish the same tasks and that familiarity will lead to a lot of efficiency in your work and your programming there's also a very specific advantage to be had by keeping data in columns the vectorized nature of r allows us to tap into some of the speed enhancements gained by inputting things as vectors but this isn't all to say that tidy non-tidy data is bad alternative representations might have substantial performance improvements or space improvements and specialized fields definitely have their own conventions for storing data that will definitely be different from tidy data but i promise you that tidyburst will change your life so are you taking the red pill or are you taking the blue pill let's see how deep the rabbit hole goes first step obviously is going to be to install the packages in the tidy burst so go ahead and download those i've got the code written here for you and then you're going to attach the tidyverse packages and you'll see here once you've attached the family of tidyverse packages it'll tell you which specific ones it's attached by default and also informs you of any conflicts with functions that you may have already attached to your environment a good place to start is with the tibble as this is one of the unifying features of the tidy verse and they are essentially the modern data frame they tweak some of the older behaviors of r's native data frame and provide some enhancements that make things easier to work with and certainly prettier so here i have an example of the familiar iris data set as a tibble and we are printing it in our console and instead of getting the entire data frame overflowing our console we get just the first 10 rows and only the columns that fit the width of our window so a nice snapshot we also have some additional information about the size of the tibble how many rows we are we are hiding and a little bit of information about the different column types so i've got doubles double double double factor clean and concise the way we like things tidy now the best way to learn is by doing so what we're going to do is work through some of the more common examples of messy data and this is where i think you'll find that where you might believe you were following some of these practices you're probably guilty of at least one of these common problems so we'll start to learn some of the vocabulary of tidy data by working through examples the first of which the case when column headers are actually values and not variable names you can go ahead and print table 4a here it is included with the tidy r package and when you print that you see it's a very small table three by three with one column for country another column for the year 1999 and another column for the year 2000 so these two columns here the headers are actually values of the variable year and that makes each row then actually represent two observations so we have for afghanistan and row one an observation from 1999 and an observation from the year 2000 what we need to do then is pivot these columns into rows and that is exactly what we are going to do using the function pivot longer to lengthen our data so this function takes as its first argument the typical and then as its second argument the columns that we are going to pivot into rows and then a name for the new column containing that information and then another column that's going to contain the associated values that used to be stored in those columns what about the opposite situation so here we have another table table 2 also included with the tidy r package and in this case we have observations that are spread across rows so here you can see under the column year i have two entries for 19.99 and the type of data i have got a a case number of cases and a population count so here we actually have one observation spread across two rows and we have that for each year so in this case we need to pivot our data wider and we do that with the pivot wider function which takes a table as its first argument and then simply the names of the columns we're going to pivot from and the values that we're going to take and the output that we have here we can see now we have one single row for each observation and the two variables cases and population nice and tidy now so let's break away for a moment from cleaning messy data and talk about these tidy tools that we've been using pivot longer and pivot wider are two of the most important tools in the tidy verse there are many many more and they include tools for manipulating data for importing directly into a tidy format and all of the packages and tools in the tidy verse all share a common philosophy of are and of tidy data and they're designed to work together naturally in fact most tidy functions as we've seen take a table as their first argument this is going to come in handy soon a lot of the principles of tidy data are closely tied to practices from relational databases but they're presented in a manner that's accessible to statisticians and data scientists with a standard vocabulary of verbs pivot is the first verb that we've learned pivot wider and pivot longer they used to be called spread and gather so do note that sometimes these verbs change another tool that we're going to become familiar with is the pipe operator now the default behavior of the pipe is to place the left hand side as the first argument to the function on the right hand side and i have an example of this what it allows us to do is chain functions together so that we can avoid ugly nesting and here's the example i'm presenting we have x is nested in the function f with an argument foo and that's nested in the function g with the argument bar now instead of having that unwieldy code we can chain these together with the pipe operator and send x to the function f send the output of f to the function g and we get a nice logical order of programming here that's easy to read okay so let's pick up where we left off and get into maybe the more complicated example here as it kind of combines both exam both previous examples we have data spread both across columns and across rows now this isn't included in any of the r packages so you won't be able to follow along but you can see here this mexican weather station data set has columns for each day of the month and then rows representing the maximum and minimum temperatures recorded at the weather station for those days for the year 2010 and all of this is spread both in rows and columns so we're going to need to pivot wider and pivot longer and in this example we're going to do it with the pipe operator so to read this we can say that we are piping the weather tibble into the pivot longer function and we no longer need to specify the tibial in that function because that's what the pipe does for us and then we pipe the output of pivot longer into mutate and this is a new verb that we're learning which allows us to create variables or modify existing variables while preserving the rest of the structure of the table so we'll create a new variable that's going to store the data for the day columns and then i'm going to additionally create a variable for date so we can condense the year and the month and the day all together and then another verb that we're learning here we're going to pipe into the select function which just allows us to pick certain columns that we we want to retain we can get rid of some of the the redundant ones and now we pivot that or sorry we pipe that into the pivot wider function so first we're going to condense our days and then we're gonna widen our maximum and minimum temperatures and what we get looks a little bit like this where we have one row for each date and the maximum and minimum temperatures in their own columns and in a manner that is very easy to follow i think that's one of the benefits of the tidy verse really is that it's legible it's easy to read our next example here is very topical given the current pandemic this dataset that who dataset comes from the 2014 world health organization global tuberculosis report so it contains the number of cases of tb broken down by country by year and the manner in which they were diagnosed and sex and age so we can see some of those columns we have country and we have iso 2 and iso3 those are all somewhat redundant columns representing where these cases came from then we have the year column then we have this whole list of columns with these very cryptic headers and as it turns out they represent different bits of information so the new cases represent whether it's a new case or an old case as it turns out this data frame only contains the new cases and then the next couple letters there represent how the case was diagnosed from pulmonary smear positive pulmonary sphere negative and then the m or the f is the sex of the case of the patient and then the numbers represent the age group so we actually have about four different variables stored in each of these columns that we need to kind of split apart it's going to involve a sequence of pivoting and splitting and pivoting and here we have the code necessary to tidy this so first we're going to pivot longer our columns into rows and we're just going to call them key and then we're going to take this key pipe that into mutate and overwrite it by replacing pretty much just inserting an underscore here in the word new rel so that it fits with our separate function separate and then we pipe into the separate function so we're going to split basically split the key into several different words and then we're going to select we're going to do an inverse selection here so we're basically excluding the fun the columns by putting a negative in front of them and then we're going to separate the sex age column and when we run all that you can see the output that we get here is one row for each uh year each case each sex and each age group so one observational one observation for each row nice and tidy now for this next example we are going to consider the case where we have multiple observational units stored in the same table so consider this version of the billboard data set there is a billboard data set in the tide er package but this view of it is a little bit different so this data set contains all of the top 100 billboard hits for the year 2000 so you can see here we've got some tupac songs and together who remembers them um we have information about the song and the week and the rank in the top 100 for each of those songs for each of the weeks so there's there's actually two types of information here we have information about the track the artist the song the duration the time of the of the song and then we have information about its rank in the billboard top 100. so to store this in a tidy format we actually need to split this so that each observational unit has its own table we're going to do that in two steps so we'll first pull out all the song information and we do that by piping the data set into the mutate function creating a unique id for each song each unique song and we do that with the group indices function which looks for the unique combinations of artist time and track and then i'm only going to take the distinct records of that and then we'll select the the columns relevant to the song so the id the artist the track and the time we'll save that in a variable an object called song and then we'll pipe or we'll extract all of the rank information so we pipe the billboard data set into mutate we create the same ids as we did before but this time instead of selecting the distinct records we're going to keep everything but only the information about the rank we can have a look at those two tidy data sets and we have now a nice clean tibble with all the song information with some 504 boys and somalia rip and then we have the rank tibble which has for each of the songs it's corresponding rank and we can see that the common column that links these two is that id column so in the rank anything with id1 is the tupac song id2 is the together song so on and so forth now in actual practice there are very few tools that can handle this kind of relational data and so we would probably actually join them back together if we're going to work with this data but this is the tidy format and a join is very simple we're going to do a right join to be specific which takes on the left all the songs and on the right all of the ranks and it will keep all the rows for the ranks and replicate the songs where it needs by a common column here in this case id so we write join song to rank and this is the output that we get almost the the original data set we are just missing that week that column for week and a new verb join okay and now our last example we are going to have the we're going to consider the case where we have a single observational unit that's spread across multiple tables or multiple files these might be spreadsheets or csv files or tables so for our example say you're a scientist you're investigating the recent prevalence of names that rhyme with aiden and you've got data that's spread across several data several csv files of yearly baby name tables from the social security department in the u.s this is the the example that i have so using the directory function we can list out all of the files in a specific folder and i've got them i've got about six files in this baby names folder and then we're going to pipe that into the map function or family of functions from the per library and the map functions will map a vector map a function to a vector of values and in this case the values are file names and file locations and the function is the read csv function so it's similar to apply we're going to apply that function to each of the entries in this list of files and i'm using the map dfr function which outputs a table you get other functions that output lists and different types and so here we can see the output of some of the files that i've loaded i've got the most common names for well the top part of this is for 1995 when i was born the percent of people registered in the social security database with that name and the sex of that particular name so for 85 you can see the most popular female name was jessica we're going to come back to this data set but you can see how straightforward it is to link or to load um several files of the same observational unit together in one shot so this is really a subject for another lecture but we can also apply the same principles of input tidy output tidy to both modeling and visualization we can do that with the ggplot2 function package and the broom package i'm not going to really get into these it's a lot to cover but i will just give you a quick example of some graphics using this sort of philosophy of tidy data and clean graphics so i'm taking that baby name data set and i'm just getting it into a format here that i want to present nicely in in visual form so i'm taking all of the names and piping them into filter and just getting the top 10 and an additional couple extra ones so i'm gonna filter anything that's in the top ten any men any boys named baby and any names that rhyme with aiden so i've pulled in another list of names that satisfy that aiden jaden braden i don't know who are naming their children these but we are going to filter those then mutate i'm going to combine all of the names that rhyme with aiden and then group them by name year and sex and then summarize them so i'm gonna uh sum all the percentages here so summarize group by and summarize are a new couple verbs that we're learning here which allow us to do the equivalent of aggregating and then i'm going to ungroup those arrange them sort them by year by sex and then reverse percent and then i do another mutation here where i just add a variable a dummy variable that will help me order these in my plot and then i'm going to group them once again so that i can make a faceted plot and then create one more dummy variable so that i can just make it easier for me to organize the plots with ggplot so i've got that all saved in my names for plot object and then we're going to use the ggplot function to visualize these and you can see here the syntax is somewhat similar to what we see with piping but instead of piping we're kind of adding different components of the plot so first i create the plot then i add a a column geometry and then i'm going to flip the coordinates and then i'm going to split these up into different facets by a grouping variable and then just do some tweaking to the axes so the output of all of that is this pretty little plot where for 1985 i've got the top 10 most common baby names including my name and the rhyme names that rhyme with aiden for the boys and i've got the same for the year 2000 new year 2000 and for 2018 and we can see the rise of those names that rhyme with aiden in 85 i was just edging them and then in the year 2000 they start to creep into the top 10 now this is a group of names so there is going to be that to consider but by 2018 names that rhyme with aiden eclipsed every other single name on the list for men people are crazy all right so jokes aside um i hope that you guys learned a lot in this video we set out to familiarize you with the tidy verse and some of its packages and to build some experience recognizing messy data and tidying messy data into a tidy format we did this by manipulating it with some fundamental verbs we learned filter we learned mutate just at the end there i introduced you to grouping and summarizing and arranging and we also talked about tools for combining multiple data sets with the join family of functions so i think we nailed all of those objectives but if you are interested in learning further all of this information comes from the classic now tidy data paper by hadley wickham in the journal of statistical software and a book that they have a free book called r for data science which contains a lot more examples to work through and a lot more information about this whole philosophy of data i encourage you to check them out as always thank you guys for watching be sure to check us out on all of our social media channels at kbrad please give me a like and a subscribe on this video and if you are a student studying marine science and looking for more experience about this kind of thing and more practical experience in the field check out our field course we do lots of scuba diving spend lots of time in the classroom learning how to analyze that data and it's all great fun so thank you guys and enjoy look forward to seeing you next time
Info
Channel: Cape RADD
Views: 3,170
Rating: 5 out of 5
Keywords:
Id: a-Yb518580c
Channel Id: undefined
Length: 24min 39sec (1479 seconds)
Published: Thu Apr 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.