Manipulating and exploring data with dplyr

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to this video tutorial on using the deep liar package so this is the first time I've done a video focusing entirely on a single package but deep liar is extremely useful for subsetting transforming and getting summary statistics for your data much more easily and quickly than you could do using the base our commands so let's start by installing and loading deep layer into memory so if you haven't installed deep layer before then we use the install packages deep layer which I won't run because I've already installed it it may take quite a long time to install so you might want to pause the video now while you install the package and then when the package is installed we're going to load it into memory using the library function T plan and then for this tutorial we're going to load in the MT cards stock data set that's available in our okay so the depot package has a large number of functions which are extremely useful for exploring your data we're not going to cover all of them but we're going to cover the probably the six most important functions in the package that I'm going to refer to as deep layers verbs so these are select filter arrange mutate summarize and group by so these are the six most useful functions in the package I understood there they're all verbs they're all sort of describing an action and that we're going to perform which is useful for remembering which one does which task so we're going to start by using the Select function and quite simply the Select function takes the data frame as an input and then any columns which you want to return from our data set so it's way of filtering or subsetting and columns so the first argument is the empty car actually there's some remind ourselves of what the empty cars dataset looks like so we have a data frame where each row corresponds to a vehicle and then we have columns and miles per gallon there were cylinders displacement horsepower and etc so let's say we were only interested in looking at the horsepower column or we wanted to subset and the horsepower column and then we supply and the first argument of the Select function as the data frame and the second argument is the horsepower and column and that was returned the data frame with just horsepower column and the row names if you want to select more than one column let's copy and paste this then you simply supply them as additional arguments so let's say that we want horsepower we also want miles per gallon and the number of gears and if we're on control and set then again we get a data frame but just with these three columns that we've specified we can also do this using the column index so for example if we want is columns 1 through 7 then we can supply this as an argument and then we get our data frame with columns 1 through to 7 if we were using base R to do this we could reword tubes we would do it something like this which gives the same result we'll find out a little bit later on the one of the benefits of using the Select function within D plier rather than the base R and well one is that it's a little bit more verbose it's sort of a little bit more plain English exactly what we're doing and and also you can chain D plier verbs together to perform more complicated operations in one go and we'll explore that a little bit later on so that's how we select and subset individual columns what if we want to subset and select individual rows and so for this we use the filter function so we use the filter function the first argument is again the data frame that we're working with and then the second argument is how we tell filter to select which rows to return to us and so this is usually a logical expression which filter will evaluate for each row and decide whether it whether to return it back to us so for example we may want to return just the rows for which the weight variable is greater than 3.5 and if we run this then we find that we get every column in the original data frame returned to us but only the rows for which the weight variable is greater than 3.5 if we want to filter by more than one condition then we just apply these conditions as additional arguments so empty cars weight again greater than 3.5 and so we also want to only want to return vehicles for which the weight is greater than 3.5 and the carb variable is exactly equal to or and then if we run this again all of the columns are returned but only the rows for which which satisfy these two conditions are returned so the filter verb is a way for returning only those rows that satisfy certain conditions that were interested in the third verb is a range which is extremely useful when you want to put your data frame into some kind of order dependent on the variables so let's say we want to arrange empty cars dataset by the gear variable and now our entire data frame has been returned but all of the rows have been ordered such that the gear variable is listed from low to high if instead we wanted to arrange order our data frame such that they're ordered from high to low gear then you simply put a minus sign in front of the column that you want to order by and if we run this and this time we find and the low gears at the bottom and the high gears at the top now just like the other verbs if we want to arrange our data frame by more than one variable we simply supply these as extra arguments so let's copy this and paste it again so let's say we want to arrange by carb let's put that in descending order and then also weight in ascending order and then we run ctrl + Enter and this time it's important to note that the the data frame is first ordered by the first variable that you supply so it's been ordered from in descending order in in with gear from five gears to three gears and then within each level of gear we are we've ordered our rose by carbs so because he within five we're going from 8 to 6 to 4 to 2 to 2 and if there are ties then we find that the weight variable is used to break those ties where weight is in ascending order okay so you can use the arrange verb to order the rows of your data frame in a meaningful way the fourth verb is neut 8 which sounds a bit funky but we basically use this to create new columns from existing columns in our data frames so it might be that we want to transform a variable and save it as a new variable so the function is mutate first argument is the data frame the second argument is the name of the new variable that we're going to create so we're going to divide the weight by the number of cylinders followed by an equal sign and then the definition of the new variable so we're going to take the existing weight variable and divide it by the cylinder variable and if we run this we get our entire data frame returned but now I hope you can see that we've got a new column a new variable which is the result of dividing weight by the number of cylinders if we want to perform or generate more than one new column at a time we can do this so let's just copy and paste this existing command and and what's particularly cool is that we can actually define new variables based on the variables that we've created in the same function call so I'm going to create a new variable called the inverse weight so and this is going to be equal to 1 divided by weight cell so in this one function mutate is going to take the empty cars data frame it's going to create a new variable called weight cell which is going to be equal to the weight variable divided by the cylinder variable and then in the same function call it's going to take this variable divide it by 1 sorry excuse me it's going to take 1 divide it by this variable that we've just created and it's going to call it inverse weight cylinder so we don't have to do this in two separate function calls so if we run this we have our entire date frame with our two new variables weight cell and then the inverse of this weight cell column one important thing to mention is that these variables have not been saved onto our empty cars debt frames so if I call empty cars these variables are not there so if you call the function just like this on its own it'll display these variables for you but they won't be safe so what you should do is save your mutate call to a new data frame so for example Mt cars - and then if we call MT cars - now we see our new wait cylinder variable that was defined by this function our fifth deep layer verb is summarize so summarize is useful for generating summary statistics from your data frame that you can define in any way that you like so we start with the summarized function the data frame is empty cars and then the next argument is the name of the summary statistic that you're going to produce so we'll call this min HP the equal sign and then the definition of how this statistic should be calculated so this is sort of similar to in mutate function where we defined the new variable the equal sign and then how the variable is going to be defined and I want this statistic to be the smallest value for the horsepower variable now when you're exploring your data is useful to look at lots of different summary statistics get to get an idea of how your data are distributed and we can define these additional statistics by separating them with commas so let's define a second one called average horsepower this is equal to the mean of the horsepower variable max HP you can see where this is going it's going to be the maximum HP value let's go on to a new line to make it a bit easier med HP is going to be equal to median HP and then finally the IQR HP is equal to the interquartile range horsepower now if we run this whole function we get the summary statistics that we've asked for defined across our entire data frames so we get the minimum value for horsepower and the average the maximum the median and the interquartile range and we could have gone on defining summary statistics if we wanted to for each individual variable but these are the ones that we've defined for now now this might have seemed quite a lot of work to get summary statistics out of just one variable in our original data frame the horsepower variable when you might be asking yourself actually will why don't we just use the base summary function which gives us all of these summary statistics and more for every single variable in our data frame and you'd be right it's much easier to just type this and get summary statistics for every single one of our variables then to type this just to get summary stats for a single variable the summary function gives us summary statistics across our entire data frame ignoring group membership because we haven't informed our that we want statistics based on individual groups so for example you may have subjects belonging to three different treatment groups and you want these summary statistics for just those three groups so the summary statistics within the groups and we can combine summarized with the final verb that we're going to talk about from the the d ply package which is the oops group by the so the group by verb does pretty much what it sounds like it will partition our data frame into rows which have membership to particular groups of a grouping variable so let's take our empty cars data frame and the second argument is the name of the variable that we want to group by so we'll group by gear if we wanted to we could group by additional variables so for example if you had an experiment which was factorial so you had people assigned to placebo this is drug treatment and that went within each of those you stratified by men and women for example you could group by drug treatment and then by gender so that you would end up with four different partitions of your data we're just going to group by the gear variable and we're going to say this as a new object called grouped and now this is where the power of the summarized verb comes in because if we copy and paste this summarize function call from up here paste it down here and instead of applying this function to the whole empty cars data frame we're going to apply it to our grouped data instead and now if we run this we find that we get our summary statistics as we've defined so our minimum average max median and interquartile range and stratified for cars which have three gears four gears or five gears and so it's a very nice way of getting summary statistics for certain groups of data now I mentioned that we can group by more than one variable so let's create a new object called groups two we're going to use the group by function and apply this to antique cars and we're going to group by gear and the number of cylinders and now again if we just copy and paste this summarize function and apply it to our group to to data we get our summary statistics as we've defined them at each combination of gear and cylinders so there's a particularly useful at getting summary stats for combinations of grouping variables which actually doing this in base R is a lot more involved now those are the six main verbs that I wanted to talk about in deep layer so I hope that you can see that they're useful in selecting data transforming data and summarizing your data but actually there's one powerful tool that comes with deep layer which makes these verbs even easier to to use particularly when you're stringing them together and that is called the pipe operator which is a percentage sign followed by the greater than sign closed off by another percentage sign so what does this mean it means take the argument on the left hand side or take the object on the left hand side a pass it into the function on the right hand side as the first argument so let's see this in action so let's take the empty cars data frame and pipe this into the group by function and we're going to group by gear so what's going to happen is the pipe operator is going to take this object the empty cars data frame and pass it to this function as the first argument so this is the equivalent of writing okay so instead of putting the empty cards as the first argument we simply pipe it in to the to the first argument position using the pipe operator you may be thinking that this is a lot simpler than this and for this case it is but when you want to string multiple operations together as you'll see in a second piping in two functions becomes very useful so this has the effect of grouping the empty cons data frame by gear well now let's take that group's data and pipe it into a summarized function which will report the median weight and so now what's going to happen is the empty cars dataframe is going to be piped into the group via function which will group it by gear and then this group data will be piped into the summarized function which will produce the summary sadistic median of weight for each level of gear if we weren't using pipes this would be the equivalent of typing which is slightly shorter in terms of code but I hope that this is a little bit easier in which we say that we start with our data frame we perform one operation on it we pass this to the next operation and if you read the pipe operator as sort of saying and then in your head I find it a lot easier to keep track of exactly what I'm doing so take the MT cars operator sorry the empty cars data frame and then group it by gear and then summarize it by taking the median of the weight and so when you're stringing multiple operations together it can be much clearer and easier to use the pipe operator and actually to make it even clearer the sort of standard practice is to separate these operations on two different lines so that you run through them from top to bottom so for an example let's make a new object called subset actually let's call it subset cars because the subset is a function and we're going to start with the empty cars data frame we're going to pipe this into and then we're going to go on to a new line we're going to select columns 1 to 11 it was no we're gonna select columns 2 to 11 we're going to pipe these selected columns and filter them to pick out just the rows which satisfy the condition that the weight is larger than or equal to 2 and if we run this we call subset cars then we have a data frame with the columns that we selected and only the rows which satisfy the condition that and the weight variable is larger than or equal to 2 so the fact that we start with the name of our data and on separate lines we perform separate data selection functions or data transformation functions makes it easier for us to sort of read from top to bottom exactly what we're going to end up with well let's go a step further let's create an object called ranged cars because we're happy with the data that we've selected in our subset cars take the frame but actually we want to order them in a particular way and so it's easier for us to see if there any meaningful relationships in the data so we're going to start with subset cars we're going to pipe this into a group by function and we're going to group by gear and cylinder then we're going to pipe this into a mutate function to create a new variable and like only this going to be the weight divided by SIL this is going to be piped into an arrange function where we want to arrange the entire data frame based on this weight cell variable if we run this and call ranged cars then you can see we've got our new variable and that the data frame is arranged such that this variable is in ascending order don't worry about the fact that this it looks slightly different from our usual data frame D player likes to turn data frames into what it calls Tibbals which are a slightly different way of representing data but for the most part they're very similar to data frames in that they have rows of data and columns of variables and so most of the time you can apply the same functions to both of them and then finally let's pipe our arranged cars data frame into a summarized function where we just want the so you don't actually have to name the variables so let's let's let's not name one we'll just ask for the standard deviation of this new white cylinder variable that we've created and because our data frame is still grouped we get the standard deviation for the weights and a variable at each level of our groupings so deep wire has a lot more useful functions in it these are the the sort of the the bread-and-butter that I think most people can use to quite quickly subset explore get summary stats for and transform their data and so play around with it on on your data particularly using the pipe operator which if you're you know performing several steps of data selection or transformation makes it much easier for you and other people reading your codes to keep track of exactly what you're doing and so that allows you to express these operations in in plain English and so I hope that was useful I will see you in the next video
Info
Channel: Hefin Rhys
Views: 15,274
Rating: 4.975831 out of 5
Keywords: rstudio, programming, statistics, dplyr, select, filter, arrange, mutate, summarise, group_by
Id: GftIZjRv9eI
Channel Id: undefined
Length: 26min 7sec (1567 seconds)
Published: Mon Aug 28 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.