Descriptive statistics and data visualisation. An introduction to statistics and working with data

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is greg martin by the end of this video you're going to know how to describe summarize and visualize your data you're going to be able to produce the right tables plots and graphs for different kinds of variables and believe me this is the first step to good statistical analysis so let's get stuck in in a typical data set we've got something like a spreadsheet and in this spreadsheet we've got columns and our columns are our variables and in this case we've got five variables and our rows contain our observations now and we've got four observations here what do i mean by observations let's talk about someone called james james has got certain characteristics that we're interested in he's a 27 year old male he weighs 75.1 kilograms he's been categorized as short and together we call all of this information about james and observation and we store that information as data under the appropriate column headings or variable headings and of course we can add as many observations as we want and this is what makes up our data set now the most common two types of data that we work with are categorical and numeric data let's take a look at a categorical variable like gender each person or each observation in our data set can be categorized either as male or female and you can think of these categories as buckets into which the values from any other variable can be placed and then compared so we could compare the average weight or the average age of men and women in this group for example now let's take a look at a slightly different kind of categorical variable in this case i want to take a look at height right and this is what we call an ordinal categorical variable now in the case of height again we've got categories or buckets and of course again we can categorize the observations in our data set but in this case the order matters right so it's short medium tall that matters and that's why we call it an ordinal categorical variable there's a natural order to the categories now let's talk about the two types of numeric variables these are numbers they fall by definition on a number line but they can fall on that number line in two different ways they can be discrete or continuous right let's talk about age age is typically given as a discrete variable in other words each observation falls definitively on a value on the integer number line 32 33 34 etc weight by contrast falls on any number including fractions between two integers right so in this example barra is 98.3 kilograms but now we want to talk about how it is that we're going to describe all of these variables a trick to understanding how your numeric variable values are distributed along the number line is to imagine them actually sitting on the number line where there is more than one observation for a particular number they get stacked upon each other and as you can see this turns into an interesting shape which we call a distribution and this is an interesting idea that we use again and again and again in statistics so here we've got a data set and i've shown the distribution of the numeric variables by representing each observation as a ball let's take a look at the most useful ways that we can describe this data firstly the minimum and the maximum values and these also happen to be the parameters for the range and the range tells us a little bit about the distribution or the spread of the data and if we divide all of our observations into four equal groups each of those groups will of course contain a quarter of all the observations and we call the two middle quarters the interquartile range and this again is telling us something about how the data is spread out now the next three are interesting because they're trying to tell us something about the middle of the data welcome to the course on basic statistics [Music] [Music] so sit back enjoy buckle up let's do this [Music] first of all we've got the mean the mean is the average then we've got the median and the median is the value that splits all of the data into two equal groups and then of course we've got the mode which is the most common value now where the distribution or the shape of this data is pretty symmetrical as in this case then those three values will be pretty much the same however if the distribution of values has got a long tail to the one side right the left side in this case and we say this is left skewed then certainly the mean or the average is disproportionately affected by the outliers and these extreme values and similarly if you have a right skewed distribution and remember the tail is to the right so we say that it's right skewed the mean is way too far to the right so it's not a good measure of centrality clearly when the distribution is skewed the median is a more robust measure of centrality and finally the standard deviation tells us about the average distance from the mean in other words how spread out the data is so if this is the mean then one standard deviation on either side of that is the average distance of the observations from the mean and it turns out that if your data is normally distributed like this is then about 68 of all of the observations will occur within one standard deviation on either side of the mean and about 95 will be within two standard deviations so what have we got we've got the mean the median and the mode they're telling us about centrality about the middle where's the middle of this data and then we've got the range the interquartile range and the standard deviation and they're telling us about the spread and how spread out this data is now we've got terminology we've also got terminology that tells us about the shape of the distribution we've already talked about the idea of the distribution being symmetrical or skewed to the left or the right if there's one peak like in this particular data we might say that this is unimodal if there were two peaks we might say that it's bimodal and there are other distributions that we can talk about but the best way to get a sense of how to describe the shape of the distribution is to visualize the data so let's get into that for a bit a short pause to this video to say a big thank you to springer nature now this video has been sponsored by springer nature springer nature did not create the video they had no editorial involvement with the video they take no responsibility for the video that's the stuff that i'm obliged to contractually say to you there's a few things about spring nature that i want to say to you that i don't have to say but i want to say anyway springer nature published some of the most influential scientific journals in the world today something that we're not cognizant of in our day-to-day lives is that the science and technology that underpins the lives that we live at the moment so much of the lives that we love the medicines that we take the fact that things work the way they do so much of the quality of science that we depend on is underpinned by high quality scientific publication it's only of the quality that it is because of publishing companies like springer nature that insist on scientific integrity in terms of what they publish springer nature are a real force for good in this world and i feel i'm really grateful and really honored that they're prepared to support my work and this youtube channel so thank you very much and now on with the video now the first way that we can visualize the distribution of a numeric data set is by imagining buckets that represent different intervals along the x-axis and you can choose how big the buckets are let's say in this case we're going from naught to 10 10 to 20 20 to 30 etc and then by counting up how many observations fall into each of those buckets we can create what we call a histogram next let's talk about box plots right remember when we were describing our data we divided our dietary into four quarters right and the middle two quarters were described as the interquartile range well we can draw a box that represents the interquartile range with a line in the middle representing the median so voila and we've got a box plot so with our box plot the interquartile range is in the box itself and that'll have 50 of the data the median which is the value that splits all of the data into two separate groups that's represented by the middle line in the middle of the box the whiskers are extended out to 1.5 times the interquartile range and any values outside of that range we call them outliers now let's take a look at a categorical variable like height each observation so james barracera etc etc has been categorized as either short medium height or tall right and we can summarize this these cat this categorical variable by counting up the number of observations that land up in each category so for example we could say that there have been four people that have been categorized as short we can say that there are two that are medium height two that are tall and all together we know that there are eight all together now if you want to know what proportion of the total are short right we call that the relative frequency then of course what you do is you divide the number of short people by the total and boom shakalaka you've got 0.5 which is a half and if you want to know what the percentage is well you simply multiply by a hundred and boom and there you go we've got the percentage for each category if we want to visualize this data of course we can use a bar chart where the height of the bar is either the actual number of observations or the relative frequency or the percentage an alternative is to use the pie chart there are different reasons for using different graphs at different times i'm not going to get into that today now let's think about two categorical variables right gender and height the first thing we do is we create a two-way frequency table two ways because we've got both of our variables involved now and we use one in the columns and one in the rows and again we can calculate the relative frequency or the percentage which can be represented in brackets next to the value so previously we had it in a separate column now we've got it in the brackets next to the value itself and you can do that percentage by rows or by columns right so in this case i've done it by columns so i've calculated what percentage each cell is relative to the column total for the column that it's in so for example in the mail column three men are short that's three out of a total of five men all together so sixty percent of the men are short and so on and i'm going to show you two ways that you can visualize this data so firstly you can have a stacked bar chart where the height of each column is determined by the actual number of observations or you can stack them by percentage so that each column towers up to 100 making it much easier to visually compare proportions if you have two numeric variables how are you going to visually represent them well one thing you can do is you can create a scatter plot so a scatter plot is where each point corresponds to the x and the y coordinates of a given observation or in this case a person so for example we've got sarah she's 34 years of age and she weighs 63.5 kilograms so this is her point on the scatterplot and of course you can add a trend line and then remember by convention we usually plot the independent variable on the x-axis and the dependent variable on the y-axis and what do i mean by that let me let me just explain that to you because it's kind of quite important it's all about the direction of causation and so what do i mean by that well in this case we think that a change in age might affect weight in other words as you get older you might gain weight we don't think that a change in your weight affects or has any causative impact on your age weight by contrast might be dependent on age or might be affected by age there might be some sort of causative relationship between what your age is and what your weight is and so weight is a dependent variable and typically by convention we put those onto your y-axis now things are about to get interesting how would we plot two numerics and one categorical variable well first of all just imagine plotting your two numeric variables one against each other on a scatter plot in the way that we've described already now for each of the points on the scatter plot we use the categorical variable to assign them into a different group or a different in in this case maybe a different color okay then of course we can also draw a trend line independently through each of those categories one for males and one for females and voila we can see two graphs superimposed upon one another and we can see the difference between the genders with respect to the relationship between age and weight now let's talk about two categorical and one numeric variable first of all let's consider just the numeric variable by itself in this case weight and how would we how would we plot that or we'd create perhaps a box plot now we want to use just one of the categorical variables let's start with gender to disaggregate that data and redraw our box plot what we've got here is we've got the same weight data being used to draw this plot those observations that have been categorized as female have been used to draw the pink plot and of course we want to represent more than one categorical variable we want to include height here so what do we do we disaggregate the data once again by height so now we've got weight and it's plotted but the plot has been disaggregated by first gender and then height and so we can see the difference in weight between males and females in tall people and the difference in weight between men and women for medium-high people and of course we can see a difference amongst short people and the difference they seem to be most pronounced if you want to learn more about statistical analysis and other research methods then go to learnmore365.com i hope you enjoyed the video hit subscribe if you haven't before hit the bell notification if you want to get alerts to future videos also consider becoming a member of the channel and get information about how to get jobs in global health and public health have a great day take

Info

Channel: Global Health with Greg Martin

Views: 16,161

Rating: 4.9652176 out of 5

Keywords: statistics, data science, plots, graphs, data visualisation, research

Id: txNvZ3Zndak

Channel Id: undefined

Length: 14min 24sec (864 seconds)

Published: Thu Apr 01 2021