Clean you data with R. R programming for beginners.

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome back today we're talking about how to clean your data and let me tell you what i mean by that when we are looking at data a lot of people want to jump right in and analyze their diet to do some sort of statistical analysis that's the wrong thing to do right what you want to do is have a systematic approach it doesn't have to be exactly like this but this is a reasonably good approach in this you start off with exploring your data i've done a video on that so this video is part of a series right next you clean your diet and i'm going to tell you in just a few minutes what exactly i mean by clean your data then manipulate describe and summarize visualize and then analyze right that's what you want to go through all of it's super duper easy we're going to go through cleaning your data today so don't go away if you want to learn about our programming then you have come to the right place on this youtube channel we're creating our programming videos on everything so what do i mean by cleaning your data right well there's a couple of things that really we're talking first of all make sure that each variable is categorized as the correct type of variable and you might need to change a little bit of that and i'm going to show you how to do that in just a minute next you might want to select the correct variables the variables you want to work with and filter out the rows that you filter out the rows that you don't want to work with or filter for the roads that you do want to work with the observations and sometimes you're working with enormous data sets and that's going to become a very important skill next find and deal with missing data i'm going to show you how to do that in this video next find and deal with duplicates super duper easy going to show you how to do that and then how to recode values super duper easy as well so don't go away stick with me you're going to love this okay just so that you know when i work in r i always use the tidy bits the tidy verse is a collection of packages these things expand the vocabulary that you have within our number of functions that you can use it's absolutely fantastic it's definitely the way to go i've got other videos about that so watch those videos if you want to but first of all you have to install the database you only ever do that once once it's on your computer it's there then when you use r you say library titles i've already done that in the session and make sure that all of those additional that additional vocabulary is available to you right there and then right now let me just say this before we carry on r has built into it a number of data sets that you can use to practice your analysis and i'm going to be using those data sets in this lesson in other words everything that i'm doing you can do at home and actually you should do at home because that's the best way to learn right so everything you see on my screen try to replicate it at home using the same data that i'm going to show you how and let me just quickly show you if you type in data open and close brackets you can see a whole series of data sets that r has that you can use you can use any one of those right if you take any one of those and you type view and in this case i'm going to be talking about the star wars now the star wars data set actually comes and becomes available once you've installed the titles say view star wars voila there's the star wars data sets it's a lovely one to play with because it's actually got it's got some missing values it's got all the things that we might want to look at in this analysis okay so we're going to work with the star wars now we said that the first thing we wanted we want to talk about is variable types right and how do we first of all explore the variable types how do we know what types of variables we've got well simply start off with glimpse right so glimpse star wars and here are all of our variables and then so this is a nice little summary of your data set and then next to the names of the variables we've got the variable types and you'll see these are little sort of summaries chr stands for character then you know there's integers dbl let's double basically that is a numeric variable and it includes uh fractions between integers and what we don't have here interestingly is factor now a factor really is a categorical variable you know like you might have for example size might and it might be small bigger and biggest for example these these are categories into which your diet may fall now some of the time you get a data set like this and we've got let's say gender it's been defined as a character variable right so it's just these are just words that are written here r can manage that reasonably well right written as masculine and feminine but we might want this specifically to be recognized as a categorical variable as a factor because and it's not the case with masculine and feminine but sometimes you can have what's called a an ordinal categorical variable where the order matters where there's a natural order between small medium and large the order actually matters and the way things are ordered often is important in the way your data results pop out uh the way your your tables get created there's all sorts of reasons why the order can matter something might be a character variable and while that might still work in your analysis it may be in your interest to turn that into a fact and there's very often times that that is so let's have a look at that all right so let's go back up to our data we've said glimpse we've seen the types of data that we've got we've recognized that there's certainly a few that are as characters but we'd like them to be factors right we've already seen that gender is is a character variable here and a glimpse glimpse star wars but you could also say class and then star wars gender and that'll just tell you what class of variable it is and it says their character that's just another way of doing the same thing now you can do something quite interesting you can say unique right and in the argument you put the variable that you're interested in and it's if you push enter there it tells you what can be found in that in that particular variable right so in this case we know that these two kinds of observations masculine and feminine and then na stands for not available so that's missing data right there so we know that that that's what the case is now here this next line of code star wars that this is how we change this variable from a character variable into a factor variable and it's very simple we say star wars gender so this is the variable that we're interested in okay and we're going to assign onto that so we're going to write over that we're going to replace that with as factor star wars gender so as factor is a function and the argument in that function so what is it that we're going to make into a factor we're going to make star wars gender in other words the gender variable in the star wars database right so as factor star wars gender it will take this variable star wars gender the gender variable in the star wars database it'll make it into a factor and it will assign this output to star wars gender right does that make sense so and we could do something i mean if we if we put star wars gender and we put another number in here one it would make a new variable called star wars gender one and would assign that to it but we want to write over and replace star wars gender with this new vector that is a factor and now if i push class star wars gender you can see it is called a factor down there all right so that's easy peasy lemon squeezy right now we were saying to you that it's important sometimes to have a variable be a factor and not a character variable because we're interested in levels so let's just quickly have a look at that we've got levels we can say levels star wars gender and in this case feminine is level one and masculine is level two now in the case of gender we don't really mind this no no one is better than anybody of course there's no up or down but let's just say for argument's sake for the sake of our tables or for whatever reason we might want this to be the other way around how would we get there easy peasy lemon squeezy we say star wars gender factor so we're going to again we're going to write over the star wars gender and we're going to that little arrow means everything to the right of this arrow gets assigned on to that what we've written there right factor okay and that's just saying we're going to work with the fact we're going to do something here and we're going to say star wars gender is the vector that we're interested in so open close brackets then comma levels now keep in mind this could all be one long line i'm just going on to the next line so that it's easier to look at right the aura doesn't mind that rstudio manages that just fine then we say levels and then we've just got this concatenation so c stands for concatenation and we put in the levels in the order that you would like them right so we've got masculine first feminine second and then if i push command enter it does that and if i now say levels star wars gender you can see it swapped it around it's masculine and then feminine okay so we've changed the levels simply by using this factor function right over there all right now the next part of cleaning your data selecting variables you might have a very large data set i've got other videos that look into select and filter but i'm going to do a quick summary here and i'm going to show you some tips and tricks that i haven't done in other videos so pay attention this is a lot of fun right when we're working in the tidy bits this is how i do it i always start with the data frame that i'm working with right so in this case star wars this little pipe operate here right percent greater than percent and and by the way you can get that very easily with shift control m we'll just create that for you and then that's mint and then takes everything from the left hand side pipes it into the next line i put i usually just stick everything on the next line you could carry on on the same line if you wanted to right then select that's a nice lovely english term it's nice it means what it says it is what it says on the can select these are the variables we want to select we're going to select name we're going to select height now here's a nice little trick we can say select any variable that ends with the word color because if we look at all of the variables that we have there's at least three eye color skin color and hair color that end with the word color and then we say and then and then i'm just going to say and their names again so we can look at what that turns into and voila there we go we've got we've got name height hair color skin color eye color if we remove that and then names we will we'll actually just see the data frame that we've selected so this is the data frame that we've selected voila right now we don't want all of the observations right so all we've seen down here is the first few we've seen the first few but it says there's 77 more rows we might not want all of them in our analysis so what do we do well we filter we filter for the observations that we want right now if we take hair color where's here here's hair color right we might say look we don't want everything within hair color but we our starting point might be well what's in there what are the possible values of hair color that we might want to filter for and we spoke a little bit earlier in this video about the the function unique so if we say unique star wars hair color it's going to tell us these are all of the possible observations that you could have in that variable right so we've got brown black blonde etc etc so let's write this code again over here and we sort of say star wars and then select the same selection criteria name height ends with color now we want to filter for hair color and within hair color we want either blonde or brown that's what this in this little uh interesting set of characters there concatenation blonde or brown now there's another way we could have we could have done this right you could we could have said hair color is equal to blonde or hair color is equal to brown the reason i'm just trying to and we most of you will know how to do that kind of filter the reason i'm showing you this is this really helps you if there's quite a few things you want to include you can make quite a nice long concatenation i mean another way we could have done this is we we could have made a little object that included all of the names that we want to include and then we sort of say in and put that object in there okay now then we've got an and height less than 180. now this is what i wanted to show you you've got to understand when you do filters the difference between ores and ands right in this case we've got blonde or brown right you need to understand that here we're saying blonde or brown and blonde or brown means when it looks at a particular observation it will include that observation if in within hair color it sees either blonde or brown now because we're wanting to include observations with blonde and brown we feel as if we should be saying and right because we want blond and brown we feel as if we should be saying and but we shouldn't we should be saying all because we're saying include this observation if the observation includes either blonde or brown if we said it had to include blonde and brown it would mean that we wanted both of those two criteria to be met simultaneously for the road to be accepted and that would never happen so blondo brano that's an or now we've got an and and we're saying the height variable must be less than 180 so we're saying the hair color variable can be blonde or brown and only include that observation a particular observation that we're looking at if the height is less than 180 okay so let's have a look at what that translates into voila we've got hair color and you'll see we've only got blonde and brown included so everything is blonde all brown and in each case the observation is less than 180. so if you take one observation at a time luke skywalker is the criteria for less than 180 centimeters met yes is the criteria for either being blonde or brown met yes and so we'll accept that entire row that entire observation gets included does that make sense of course it does okay let's keep going now we're going to talk about missing data now missing data is tricky can be tricky and people often take the easy route out which is just like all just exclude all the missing data and that is a mistake if you just exclude all the missing data you don't know what observations you may or may not have excluded that might be important for some other reason and let's just let's just get into this missing data understanding what to exclude really requires of you to understand your data to understand your variables right let's have a look at missing data quickly and and i'm not going to get into a lot of detail in missing data but i'm going to i'm going to show you a few tricks first of all if let me just show you this if we wanted to take the average height in in out of all of our star wars characters let's take this out as a starting point and we said mean which is the function and we applied the object is is that variable that uh star wars height and we pushed enter and we see n a it doesn't work dar wasn't sure what to do with it so just said it's it's not not available right and that's because in that variable there are missing values and r doesn't we didn't know how to calculate the mean given that they were missing values you could simply say n a so not available remove rm for removes and it would remove not available observations from that vector and make that true in other words make it true that you're removing not not available values and voila if you look at the bottom over here now suddenly we've got a mean okay that's the one little trick that's an easy trick right but we want to have a data set within which we're removing uh the correct observations at the correct time let's take a look at that quickly our starting point is we've got star wars and we select a set of uh variables right name gender hair color and height next thing we could do i don't recommend you do this but i just want to show you this so that you understand and notice we've got 77 we've got 77 plus 10 we've got about 87 observations now we could just say n a omit star wars and then select those and then n a omit will just remove all of the missing values from the entire data set and now we've got 63 observations or plus the 10 so 73 observations but we don't know what we've lost we don't know which variables have been removed which sorry which observations have been moved removed and might they have been important so not recommended at all unless you really understand if you know your diet and you know what what's happening when you do that right let's take a look at a slightly more nuanced way of doing it and the first thing to do is to understand where the missingness is right to to know within your data set where are the values missing and here's a little trick right we've got star wars and then we select the variables that we want fine super duper simple this next line of code is we filter out now if we didn't have this exclamation mark we would be filtering out just for complete cases in other words we'd be filtering for just observations that had no missing data and this little dot here by the way we've sort of got a function inside a function we've got filter which is a function and then the object inside that function is another function and we're saying the object inside of that function is the data set that's being piped in from the top okay i know that sounds a bit complicated so we've got filter for complete cases of that data set now if we pushed control enter here it would do the same thing as what we did with n a omit in other words it's it's emitted any observation that has a missing value in any of its variables and we've left with just our 73 observations okay interesting if we stick an exclamation mark before complete cases it's going to do the exact opposite of that right so that's what the explanation mark does it says look do the opposite of this next function that is interesting because now it's only given us observations where somewhere in that observation there's a variable that has a missing value every single observation here has at least one missing value somewhere in one of the variables that we've got isn't that interesting because now we've got a sense of where the missingness is and we can start looking intelligently at why something might be missing and what do we mean by missing is it missing because that particular observation couldn't be captured was it missing because it's erroneously put as missing but it's actually something they always say and imagine for example in this particular analysis we wanted to look at gender and uh height and hair kind of we're doing some sort of an analysis everybody has a height all of the characters must have some kind of height so where height is missing we could assume it's genuinely missing that we just haven't got that data and okay that's tricky hair colors perhaps not quite the same right we've got n a so there's missing data here but it might be if you look at the these are characters c3po these are droids you'd say well droids don't have hair so it's not that hair color is missing it's that these characters don't have hair in the case of height we might want to remove the observations where height is missing but in the case of hair color we might want to replace missingness with none like has no hair for example so there might be instances where depending on the nature of the variable itself you want to do something different with it that's why i'm saying with missing data don't just have this the sweeping approach of just deleting anything that's missing some of the time you need to be a little bit more nuanced okay so let's have a look at what we can do so we said uh for height we might we might decide that it would be appropriate to delete observations where height is missing so how would we do that let's have a look at the code star wars and then we're just going to select for the variables that we want filtering by this not completed cases so that we can identify what we're doing with the missing value and then what you can do at the end is you can remove this line of code and you've got your completed data set well let me show you what i mean so here's the function drop in a with respect to height and if i run that voila now when it's looking for all of the cases where there's missing data the ones with the height was missing are gone now if we were to remove this filter let's say and then view so we can look at it and here's our complete data set and you'll notice that in the height column nothing is missing but there is still missing values in the other two so it's removed only the missing data from height okay voila now let's look at the second thing that i spoke about which was we might want to replace n a with something else where we think it's appropriate and we said with hair color that might be the case right so star wars and then select the variables that we want same old story filter again for our incomplete cases so complete cases actually with exclamation mark in complete cases and then now mutate if you haven't learned about mutate is quite lovely we're going to mutate just means we're going to either create a new variable or we're going to write over and change an existing variable okay and in this case we're saying mutate the first argument is are you going to create a new variable or write over an existing variable we've said hair color which means it's going to take hair color and replace hair color with whatever we tell it whatever is after the equal sign right and after the equal sign we've got this function which is replace n a so it's going to take any n a well first of all replace n a with respect to the variable hair color and replace it with the word none which is inverted commas you could put a number here you could put anything you wanted right so let's run that and see what happens and voila you can see under hair color we've got none none none where before it said n-a-n-a-n-a okay and that could be very important in terms of your analysis right so let me just teach you a little bit about the mutate function if i said uh mutate if we changed hair color if we didn't want to write over it but we wanted to say call it hair color 2. equals and ran that can you see it's created it's left the hair color with the missing values and it's created a new variable called hair color two with none so this part of the code is still the same it's still saying use the original hair color variable and put the word none where there is a n a right but you're going to take this what's created by this and you're going to apply it to a new variable that's been created called hair color 2. all right for most of the time it's fine just to replace the existing variable with the new variable and voila not bad at all right let's now talk about duplicates duplicates is if there's two rows that are exactly the same now i can't show you duplicates in the star wars data set because there are no duplicates so to do so in order to demonstrate how to deal with duplicates i'm quickly going to make a data frame right to do that i'm creating a variable called names so names and i'm assigning two names this concatenation of four data points so we've created names we do the same we create ages and then we combine them into a data frame called friends so we say make an object called friends and it's going to be assigning to that data frame made up of name and age voila and if i click on friends you can see there's a little data frame but peter is 22 and peter's 22 is a duplicate we might not want that right so first of all identify our duplicates well there's a nice little function called duplicated now if if we just typed in duplicated by itself and i'm going to do that for you if i just typed in duplicated and friends what gets produced at the bottom over here is basically a logical vector false false false true in other words the fourth observation is a duplicate of something else right so is the first observation a duplicate no false second no false third no false fourth true it is a duplicate so that's basically how logical vector works okay now if you take a if you take a logical vector and you put it inside the system that we have for subsetting okay and if you haven't seen this before i'm going to show you how this works quickly if you say we've got a data frame called friends and you put square brackets square brackets with a comma in the middle anything you put before the comma will tell r what rows you're interested in and after the comma will tell our what columns you're interested in we haven't needed to do that until now in the analysis because we've been working with the tidybuds which has got those lovely that lovely vocabulary of select and filter this is the base r way this is the old way of selecting and filtering right now if before the comma is the rows that we want and if before the comma we stack duplicates okay you'll see down here it's identified row 4 peter 22 that's the duplicate variable and if we put a exclamation mark in front of duplicates okay again we're telling it what rows we want it's going to provide for us a little data frame here we've got names and ages of the observations where which are not duplicates right and so we've got our clean little data frame of not duplicates peter john and andrew and we've removed the duplicates right that's one way of doing it if we want to do it using the tidy verse system much easier we say friends and then distinct and voila we get the exact same answer and the nice thing about doing it this way or i would often put that on the next line you can continue with your pipe operators and put in your next line of code and you know or whatever it is that you that you want to do okay now we're nearly there hang in with me we're nearly there recoding variables this is the last part of cleaning right if you want to recode variables that means that in star wars let's have a look at this line of code star wars select name and gender right we might want to do an analysis which needs masculine and feminine to be coded as one and two for example so how would we do that super duper easy we start off star wars and then select name and gender again those are the two variables we want and then mutate right so mutate we're going to change something all right we're going to create a new variable or write over an existing variable what are we going to write over in this case gender and what are we going to use to write over it well we've got recode we're going to take our our gender variable and wherever we see the word masculine we're going to replace it with one and wherever we see feminine we're going to replace it with two okay so let's see what happens there voila and you can see we've got ones and twos instead of male masculine and feminine and again if we said we if we made this gender coded for example it would create a new variable called gender coded and it would leave the old one in there and then of course we can if we want to carry on and we could look at it like that okay so this is all of our gender and we've still got some missing values i mean we haven't done any anything with missing values at this point okay so that is cleaning your data i hope you found that useful subscribe to this youtube channel if you haven't already i've also got teaching videos on statistics and art programming at learnmore365.com so go and have a look at that otherwise have a great day don't do drugs always do your best don't ever change and we'll see you again here soon stay and watch another video take care [Music]

Info

Channel: R Programming 101

Views: 1,019

Rating: undefined out of 5

Keywords:

Id: sV5lwAJ7vnQ

Channel Id: undefined

Length: 27min 31sec (1651 seconds)

Published: Wed Dec 15 2021