Explore your data using R programming

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello all you are programming enthusiasts my name is greg martin welcome back to our programming 101 today we're going to be talking about how to explore your data before you do anything with your data you need to understand the diet you need to understand the parameters dimensions of your daughter how big that data set is that data frame you need to understand how many variables you've got and you need to understand the characteristics of those variables before you do anything else right super duper important and super duper easy right so don't go away let's do this together if you want to learn about our programming then you have come to the right place on this youtube channel we're creating our programming videos on everything [Music] now the first thing i'm going to say is this i use the functions and the packages within the tidy verse you can watch one of my other videos about why the tidy verse is so important but i really think that you should get into it if you aren't already you say install tinyverse it's on your computer for good then every time you use our say library tidy version it calls and brings onto your computer all of the functions and capabilities and increased vocabulary that come with the packages in the title and i'll point out some of them as we go along next importantly our comes built within it a whole lot of data sets that you can use to practice your analysis right this is very important so if i put command enter over that data open close brackets these all the data sets that exist within our now importantly some of the data sets like these ones here all of these data sets at the bottom the ones coming with ggplot these are data sets that are that get additional data sets that come once you've installed the tardives and we're going to be using one of those additional data sets in the analysis today called star wars the nice thing about star wars it's a little bit messy so we get this missing data there's all sorts of problems with it so it's a lovely example of how to explore and clean our data right now with anything in r you can always put a question mark and then either the function or in this case the data set if you push command enter down on the bottom on the right and the help you're going to see some information about that data frame and we can see that it tells us about the various variables that we've got here i'm not going to get too much into that at the moment but that's a useful thing to remember first command first function we're going to learn about today is dim and that's dimension so if i push command enter there it's just telling us two things firstly it's telling us that there are 87 rows or observations and that there are 13 variables in this particular data frame okay useful that gives us a first glimpse as to what's happening here next a very common function that's used to understand the structure of a data frame is str which stands for structure for structure star wars now here we run into a problem and i'll illustrate that right now command n10 look at all this oh this is very messy i mean we've got all of these long spaces here and if you scroll all the way back up to the very top we start seeing what we'd expect in a structure and these are your variable names and the type of variable in the next column over there and then the first few instances of that variable on the right hand side of that the reason why we've got all of this weirdness down here is because these are actually lists and i'll point out what that means in just a minute but some variables each of the observations within that variable isn't one little data point but it's in of itself actually a list and could be anything it could be a like a variable of itself it could be an entire dart frame there's a whole lot of things so it gets a little bit messy right so i find the hdr function it gets a bit messy but if you use the glimpse function right it's the same kind of thing but if we say glimpse now glimpse comes with the tidy boost by the way and we push command enter now it's much neater right it starts off by saying observations 87 so we got this information from our dim function variables 13 then it lists the variables quite nicely tells you what kind of variable it is in the next column by the way integer is you know numbers one two three four five it doesn't have anything in between double over there dbl is actually synonymous with numeric and again that could be any fraction in between integers 72.4 et cetera et cetera here we've got character chr we don't have any factors in here factor is really just a categorical variable but our managers for the most part character variables in the same way that would factors the difference is factors sometimes have levels right especially an ordinal factor so small bigger biggest there's an actual order to that there's certain levels to it that can be important in certain analysis and you'll notice in this data set there aren't any of those now we've got three variables that are lists right so in this data set we've got all the characters in the star wars universe so we've got luke skywalker and then luke skywalker in the films variable it's going to list all of the movies that luke skywalker was in right and that'll make a little bit more sense in a second when we have a look at the data set the next thing you can put in here i haven't got it there but we can put in view with capital v star wars and that's going to bring up the data set here so we can actually see it and it's quite nice and neat in this particular view and you can scroll to the right and and can you see here right luke skywalker with that observation or that row all of the characteristics of luke skywalker height mass hair color skin color these are all the variables this by the way is what we call a sort of a neat and tidy format where your columns are your variables and your rows are your observations right so luke skywalker height 172 and i think this is in centimeters etc etc eye color blue and so each of these parameters gender male you can see it's just one data point there gender of luke skywalker is equal to male but as soon as we get to films he's in more than one film and that's why this is a list right and that's why when we did str it was a little bit messy same with the vehicles that he uses starships etc all right kind of a nice data set you'll notice that there's a couple of n a's which stands for not available or that's missing data and that's going to be useful we're going to talk about that in just a second right so that's our star wars data set the next thing we can do which i haven't got yet you can do head star wars okay and that's going to give you the first six rows if you just want to have a quick look at that and you can also say let's do the same thing tail star wars it's going to give the last six rows that can be useful at times especially if you if you're dealing with a very big data set and you kind of want to know what's happening on either side of that now usually in r and it's actually kind of good practice if you want to look at a particular variable you would say star wars and then you might say dollar sign name and that's going to give you that particular vector if you say attach star wars when you're using the star wars data set in other functions you don't have to be saying star wars the whole time you can simply put in the name of variable it's a little bit lazy it's probably not best practice but for the sake of this tutorial i'm just going to do that it just makes it look a bit neater so i'm going to say attach star so star wars has been attached now the next function that i want to teach you is the names function so if you put names star wars it's going to give you the names or variables that's a useful thing to do right off the bat know know what variables you've got know what you've got in there also look at the names it becomes important because you can see what's useful here is they're all just one word so skin underscore color that's kind of useful at times just to know that there's no gaps in between letters within the variable names it's not the end of the world if there are gaps but it's useful to know because as you write your code that can be important the other nice thing about being able to say names star wars is whenever you make reference to one of your variables in your code what i do which is kind of a good idea is type in names and then just cut and paste the name of the variable into your code instead of typing it you don't have to always do that and if it's simple that's fine you don't have to but the point is when your code goes wrong when things fall apart very often it's just a typo you've typed in the name of a variable incorrectly and you can circumvent that problem by just cutting and pasting the names of the variable in this case these are all nice and simple and small but sometimes variable names can get a little bit messy and it's and it's useful to be able to call up the names of your variables and sometimes you think when you're thinking about your analysis you want to remember what variables are there what's in there i can't remember names dot frame bam that's the list of the variables that you've got excellent the next thing that we've got is length star wars and that's just going to tell us again we've already got the dimensions we've said that there's 13 that there's 13 variables length it's just telling us that again incident and then let's start looking at specific variables so i want to look at hair color that's one of hair color there it is hair color is one of our variables and if i say class hair color it's going to tell us that this variable is a character variable okay that's interesting that's a character variable you might have expected it to be a factor and when we do our data cleaning we might want to change it to a factor it depends on what you're going to be doing with the analysis but the point is you've got that option at the moment it's just seeing this as you know a string of characters there's no levels to it and nothing more has been done with it okay so it's kind of quite simple so that is class you can put in behind class here you can stick in any of the variable names it's going to give you that we also saw what the class of this variable was earlier when we did glimpse okay told us there was a character but just so that you know you can do that there you can also behind class you can actually put in the name of the data set itself you could have put star wars in here and it would say that this is a data frame uh keeping in mind not everything in r is a data frame right there are other types of data and i'm not going to get into that today length now remember when we said length and we put in star wars it said that there were 13 variables when we say length and we put in a variable name right command enter it's going to tell us the number of observations or the number of rows okay so just remember that then now this is lovely unique unique hair color is going to tell us all of the unique values that sit within that variable so if we go into star wars here and we went to hair color you can see it's got blonde none brown brown gray brown baa-ba-ba-ba if we want to know what are the possible values that exist within this variable you put in unique and we can see that this is the collection of possible values that can sit there now the reason this is important to look at is firstly there's n a so we know that there are some missing values very useful to know and we're going to talk a little bit about missing values in just a second and this is where it becomes important to understand your data understand how it was collected understand the nature and qualitatively what is this variable because we've got something that says none here and we've got something that says unknown there and the temptation would just be to sort of say well missing and none and unknown it's all kind of the same thing and we're just going to delete all of those rows and be done with it and it might be depending on the kind of analysis you're wanting to do that might be the right thing to do you might want to say well we just want to analyze hair color where the the character actually has hair and we know what the hair color is otherwise we're not interested and we discard everything else fine but keeping in mind these three things n a and none and unknown mean different things and that might be important in terms of the way you analyze your data n a means the data is missing it wasn't collected we don't know anything about it this that character may have here that here may have a color but it's just missing we haven't got that information none might mean i mean i didn't collect the style but you'd be interested in does none mean there is hair but it has no color or there is no hair and unknown means we don't know it's not that the data wasn't collected maybe it is not possible to know the color of this hair because the character always wears a hat so we're not saying it doesn't have hair we're not saying we didn't collect data on that data point we're saying with respect to hair color we don't know what the hair color is those three kinds of observations mean different things and it may be important in the way you analyze your data so look at and understand the different parameters in your data right now this next line of code up here view sort by this is all kind of quite long that might be difficult to understand and i'm going to build that line of code up one step at a time so you can understand how i got there let's start off with the middle part here if we said table hair color right r would produce because there's a lot of variables it doesn't look much like a table but basically against each of these possible values it counts up how many observations in the data set have that value right so we can see that brown has got 18 black there's 13 characters with black hair there's four characters with white hair so this doesn't look like a table just because it's all squashed in there but literally this would be a little table right now let's say we wanted to sort that table from biggest to smallest well firstly if we wanted to sort it we could type in sort open brackets and close brackets in other words we our starting point is we want to let me just show you how this works just in case you're not familiar with this we want to sort something so sort is the function and inside the brackets you put the argument which is what is it that you're trying to sort while we're trying to sort that table right table hair color and now it sorted it from smallest to largest we might want to add an additional argument i'm just going to cut and paste the correct spelling so decreasing equals true right and now you can see it sorted it and it started the largest number going down down down down to the smallest numbers and then the final step is you can say view with a capital v and wrap all of that becomes the argument within view what is it that we're viewing and the reason that's nice is it pops it on into this view which is much easier to look at i find i mean this is you know this just when you're exploring your diet you want to get a mental sense of what's going on here this is easier to look at and we can see okay we've got none brown okay now we can do the exact if we remove view and instead we put a bar plot which is and we put that exact same code in there okay i'll show you what i mean if i change the word view and i typed in bar plot it would draw a bar plot from smallest to larger you can't see all of the labels just because this is squished in a little bit but you can see how that can be useful just to get a sense of what's going on remember we're not doing any analysis now we're trying to get a mental picture of what's going on with our data okay that is looking at in this case a categorical well in this case the variable is a character variable but everything i've done for this character variable you would also do with a factor or a what we call a categorical variable so if this wasn't categorized as a character variable if it was a factor everything that i've done here would still apply oh hang on hold the phone before we do a numeric variable what i did here with view okay to to get there to get there i just want to quickly show you how to get there using tidybus sort of notation because i think working with the tidy person in this case d player and the pipe operator is much more intuitive you see up here it's kind of hair color wrapped in table wrapped in sort wrapped in view this is the way coding was always done using base r originally this has how our coding was done it's important to understand that because when you work in r you have to collaborate with other people you've got to be able to read different kinds of code and you need to understand how that kind of wrapping process works but we've all moved on we're now working in the tidy verse and now we've got what's called the pipe operator which means we can say let's start with star wars and pipe it in to a function called select so we select hair color so now it's just selecting the hair color variable and i'll show you as i go along i'll delete that and i'll say let's just run that bit of code there and it's got star wars and it's selected just hair color right the pipe operator again which basically means and then we want to count so it will run that code and now it's created an additional column n which is the number and it's created an account for each of those possible characteristics it's got to count how many times those observations appeared in the data frame and then we can arrange them arrange in descending order if you just put a range it would do it in ascending order and if you want it to be descending you've got to actually say so and then remember the pipe operator literally just means and then the next thing you do is view so can you see how this is much easier to read and understand and change and work with than what's up there again and ultimately what you get is the same thing right so we're looking at the same kind of the difference is the n a is in there and when we did it up here the n a did not appear okay if you wanted to remove the n a you could and we'll get we'll get into that when we start talking about cleaning data right let's quickly talk about missing values as well because part of exploring the start part of understanding the data is saying what is missing it's tremendously important and you need to be able to look at the data and understand not only what's missing but if you can figure out figure out why it's missing it matters the reason is if the data is missing because it's randomly missing in other words you know there's no systematic bias built into the way the data was collected it's just a random fact of the matter that you don't have all of the data points that's one thing if the missing data is specifically connected to some sort of systematic bias that's very important especially when you start analyzing your data in other words and it affects the extent to which you want to remove the missing values or whether you want to keep them in there so and we're going to talk about that when we talk about analysis but certainly you want to understand what's missing there are very sophisticated ways of visualizing missing data and i'm not going to get into that too much in this video i'll make another video that's specifically about missing values but a very quick and easy way to understand what's missing with respect to a particular variable and by the way what i'm talking about when i'm showing you here you really need to do this for each variable create a little sense for each variable what's going on before you start analyzing your data and it's worth it it'll save you time again we've got view again we'll come back we'll stick the view in right at the end right because um well why don't why don't i build this up again so you can sort of see how i got there if we take the star wars data set right remember that if you put in square brackets and there's a comma in the middle there right what goes before the comma are the rows that you want to select and what goes in after the comma are the other variables or the columns that you want to select if you leave either of these blank it'll just select either all of the rows all of the columns if we put in some argument over here between the square bracket and the comma that will tell r to select certain rows does that make sense and we can use this text right here is in a hair color where this is true select that row in the selection process if i just put in is in a hair color r produces a vector which is literally false true true false false false and each of these trues and falses is with respect each row or observation of the data and we're saying to our look at this vector trues and falses and extract out just the rows where that is true and if we leave this blank it's going to use all of the variables all of the columns will be included so if i push enter now you'll see it's selected all of the variables and it's just selected five rows in other words these are the five rows where it is true that hair color is missing right and if in front of this if in front of that i typed in view and i wrapped this whole thing around view we would see that up here which is easier to look at and these are the five rows of data where hair color is missing and you can say well what is is there anything about the observations or the rows where hair color is missing is there anything peculiar about them we can look at the other variables are they all associated with the same skin color well no not really they all have the same eye color no not really oh they're all born in the same year we don't see a pattern they you're looking for a pattern you might you might say hang on hold the phone a lot of them are droids can you see how once you've extracted out those where there is a missing value you can start looking for whether that missingness is associated with something else and there are more sophisticated ways that you can look for those associations but at this point i'm just saying eyeball it get a sense of it get it get an understanding of what's happening with your data in as much as you can that intuitive understanding it's really important okay that's missing data we've still been dealing with a factor the next thing in our star wars data sets what about if we're dealing with the numeric variable right so here we've got height okay let's and we want to look at that again we can say class height it's an integer okay length it's going to say that there's 87 observations fair enough now what's quite nice is you can say summary height and it's going to provide you with the minimum the maximum which is the range the interquartile range which you know 25 of the data sits between the minimum and the first quarter another 25 between the first quartile and the median the median is the middle value another 25 percent between the median and the third quartile and another 25 there the mean is of course the average right so it gives you all of this you could use any one of these as an actual function right so you could put min you know height and it would give you that value but just doing summary is nice it pops it all out it also tells you how many missing values there were so that's kind of quite useful a nice quick way of getting it but the other thing about this doesn't tell you about the shape of the data you get a sense of it because the median and the mean aren't quite exactly the same but they're close enough what we might want to do is do a box plot right and uh here when you do a box plot the big thick line is the median the central value divides the data in half half of the dot is above it half of the data is beneath it the box itself is the interquartile range fifty percent of the data sits within there so if half of the data sits in this in this narrow spectrum you can see then we've got quite a wide range there's a lot of outliers then the whiskers go up to 1.5 into quartile range so this is the interquartile range 1.5 times the interquartile range gives us the edges of the whiskers and then everything outside of the whiskers we refer to as outliers so that's a box plot kind of useful also to get a sense of the shape of the data but we might want to complement that with a histogram histogram takes all of the values and puts them into bins and then counts up the number of observations in a particular bin right so we can also get a sense of the shape of our value because this is a numeric variable we're interested in whether or not this is a normally distributed variable if the distribution is normal and we'll talk about what we mean by normal in other videos but the point is for a lot of statistical analysis one of the conditions of of applying a particular statistical function to a numeric variable is that the distribution is normal if that's not the case there are ways around that there's other things you can do there's non-parametric tests that you can do but for for the sake of this video this video is about how to explore your diet and how to get that sense of what's inside your data set look i haven't covered absolutely everything but i hope that this was at least a good start the two major kinds of variables that you're going to deal with are categorical variables and numeric variables and we've covered more or less how to look at each of these and get a sense of them and that's definitely the first step to your analysis the next thing that we're going to talk about in in another video is how to clean the data that's perhaps a more extensive that might be more than one video anyway listen i hope you found this useful stay and watch another video subscribe to this channel if you haven't already click on the like button when you subscribe click on the bell notification if you want notifications on future videos great to see you here don't do drugs always do your best don't ever change speak to you soon take care bye [Music] [Music] you

Info

Channel: R Programming 101

Views: 3,348

Rating: undefined out of 5

Keywords:

Id: 3iz-2iM4RFE

Channel Id: undefined

Length: 25min 38sec (1538 seconds)

Published: Fri Dec 03 2021