R programming for beginners – statistic with R (t-test and linear regression) and dplyr and ggplot

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to teach you how to do statistical analysis using R I'm going to show you how to do a t-test I'm going to show you how to do linear regression and I'm also going to show you how to interpret the results of those tables and hang on hold the phone there's actually more I'm also going to show you how to use B player that's a package you can install into our it lets you do very sophisticated data manipulation and I'm going to show you how to use ggplot2 and that's for data visualization so there is a lot to get through so sit back enjoy buckle up let's do this welcome back to this local health YouTube channel if this is your first time here I'd love to have you subscribe not just briefly what is our our is a programming language and we use it to do statistical and quantitative analysis now I know what you're thinking you're hearing the words programming language and that sounds extremely scary and intimidating it's not it's actually very easy it's simple to use follow the step by step guide that I'm going to give you and I promise you you won't go wrong I'm actually going to be doing this analysis using a data set that you can download and install on to your computer so you can practice along and do exactly what I do at home on your computer so why is the whole world moving to our the big thing is it's free completely free absolutely free are there any hidden costs absolutely not it's free you can download it and use it today before I carry on just a quick thank you to the University of Edinburgh for sponsoring this video I couldn't do it without your support the University of Edinburgh offer a number of fantastic online distance learning masters programs including a masters in public health and courses in data science technology and innovation I've had a look at their curriculum and believe me it's a fantastic place to learn highly highly highly recommended so if you want to find out more just click on the link in the description below right let's get stuck in so this is our studio at the top on the left is where we're going to write our code beneath that is the console where you're going to see the outputs in the top on the right is the environment where we'll see any objects that we create it's over here and plots that we draw are going to pop up at the bottom on the right right over here to install a Gapminder data that we're going to be looking at type the command install packages and then Gapminder and put it in inverted commas and that's going to download the data on to your computer you only have to install these packages once but whenever you want to use them you've got to call them and to do that you use the command library so type library Gapminder know inverted commas this time now you can but you don't have to type the client data and put in Gapminder and this is going to make it visible as an object in the environment on the right if we click on that object we can see the data set pop up on the screen so we've got country continent year life expectancy population and GDP per capita now the way our works is pretty straightforward what you do is you apply functions to objects so we can apply the function summary to the object Gapminder and R will provide us with a summary of each of the variables in the data frame in this case we've got the minimum the maximum the median the mean and the interquartile range and of course we can apply functions to subsets of our data or just to one variable for example so we can for example type in the name of the data set and then a dollar sign followed by the variable that we're interested in in this case we'll have a look at the mean of the GDP per capita and we can see the output down over here whoa and here's a nice little trick we can create new objects by signing them with this little arrow sign in this case we're going to assign this line of code to the character X and now we can see a new object in our environment and if we call X we can see that it produces the mean of GDP per capita as expected if you're a little bit lazy like I am and you don't want to be typing in the name of the data frame every time you want to refer to a variable you can simply use the attach function so type the command attached then the data frame after that you can just simply refer to the variable without having to use the name of the data frame and the dollar sign so for example we might want the median of the population and there is the median value for the population we can also use some graphics functions like the histogram for life expectancy for population now that looks a little right skewed so we might want to do a log transformation of that data and voila I'm not going to get into why we do log transformations in this video I'll do another video sometime and talk about that we could also do a box plot and look at the distribution of our life expectancy data disaggregated by continent let's zoom in on that and to look at the relationship between two numeric variables we might want to do a scatter plot so for example life expectancy as a dependent variable on the y axis and GDP per capita as the independent variable on the x-axis now that graph isn't particularly linear so we might want to do a quick log transformation on that and voila now we're going to be doing some much more sophisticated data visualization using ggplot2 in just a few minutes so stay tuned right now we've got a data set and we might want to start by doing some data set manipulation and to do that we use a package called deep layer so let's imagine we've got the starter set we've got eye color we've got height weight and age and we might want to know what is the average BMI body mass index for people that have blue eyes now once we've installed the deep layer package and we do that exactly the way we didn't install package earlier we've got access to a whole lot of new and interesting vocabulary we can use in our and we've also got access to something called the pipe operator and let me show you how that works we start off by typing in the name of the data set that we're looking at and then we put in this pipe operator and it looks a little bit like a pipe because it's a percentage greater than percentage and what the pipe operator does is it takes whatever to the left of the pipe operator and pipes it into the next line of code let me show you what I mean just watch so firstly we might not want to use all of the variables in our data set we want to select the variables of interest in this particular case its eye color and weight and height now think of the pipe operator as using the term and then so we've done select and then we want to filter out rows that match some criteria in this case people that have blue eyes and then let's use mutate to create a new variable called BMI which is equal to weight over height squared and then let's create a summary and we'll create a summary variable called average BMI which is of course equal to the mean BMI so let's practice our new data manipulation skills using the deep eye vocabulary and the pipe operator on our Gapminder dataset I've already installed the package deep layer and called it into the session using the library command so let's start by typing in Gapminder and then let's select the variables country and life expectancy now we can see just the country and life expectancy so we've used select to narrow down the columns that we're working with so now we want to focus on just a subset of the rows so we want to filter out particular countries for example to do that we type and then which is the pipe operator salted by rows that meets certain criteria in this case we went countries that year the South Africa or and the vertical line is the of saying or Ireland and then we want to aggregate our data by the two countries so we say group by a country and we can now create a summary value for the average life expectancy in each country by typing average life equals mean of life expectancy and here we can see that average life expectancy in this data set for Ireland is 73 years and for South Africa is 53 years so we can see there's a difference between the two countries in terms of life expectancy of about 20 years we want to know is that difference statistically significant before we do that I'm going to teach you a little bit about statistics just in case you're not familiar with some of the concepts right let's imagine that we work in a factory this factory produces jars of fizzy drinks we've got a hundred thousand fizzy drink jars these jars are filled with either purple fizzy drink or green fizzy drink now the extent to which these jars are filled is variable there's a lot of variation and we want to know is there a statistically significant difference between the two groups in terms of the average extent to which they are filled with their respective color drink so firstly we've got a sample of our drives we don't have all hundred thousand jars with us we're going to take a sample of jars and have a look at those and make inference we're going to infer something about the larger population of jars and we can see that in our sample there is in fact a difference in terms of the average extent to which the jars are filled with their respective colored drink because this is a sample of course it is possible even if we randomly select the jars that we happen to by a chance select jars that demonstrated this difference so the difference could be due to chance or it could be real when we start off by assuming that we're wrong we start off by assuming that in the population in the hundred thousand jars there is in fact no difference in terms of the average extent to which the jars are filled that's our null hypothesis and if the null hypothesis is in fact correct then our sample difference was just due to chance so how can we possibly know well we apply something called the t-test and that's going to check this assumption and the t-test produces a p-value the p-value is the probability that that null hypothesis is in fact correct that there is no difference and if that probability is very very very small then we can say we can reject that null hypothesis we don't believe it and we can believe the alternative that in fact there is a real difference between the average extinct to which these bottles are filled of course the t-test also gives us a 95% confidence interval and that'sa the range within which we can expect the true difference in the means in the average extent to which these bottles are filled can be found so let's do this exact same exercise but on our Gapminder data right so we've observed a difference in the average life expectancy between South Africa and Ireland is this due to chance let's do a t-test to do this we're going to create a new data frame and that's going to be filtered for South African and Irish data and we're going to call that data frame 1 or DF 1 and to do this I'm going to delete the extra code and we're left with code that takes Gapminder and then select our two variables of interest which is country and lat expectancy and then filters by the two countries of interest which is South Africa and Ireland and now data frame 1 ODF 1 is in our environment we can click on it and take a look now apply the t-test to data equals DF one and compare the average life expectancy in the two countries what does this tell us we can see that average life expectancy in South Africa and in Ireland is 53 years and 73 years respectively so our observed difference of 20 years and our question is this if there isn't actually a difference in the life expectancy between these two countries if the difference between the average life expectancy is in fact 0 so that's our null hypothesis then one of the chances that from our sample we would get the difference that we observed well the chances of that happening the probability of that happening is the p-value in this case four point four times 10 to the power of negative nine that is very very very close to zero because it is so unbelievably unlikely we can reject that null hypothesis the assumption that in actual fact the difference between the memes is zero and we can accept the only alternative which is that the difference between the means is not zero that there is a real difference and we have our 95% confidence interval for where we think the real difference is it's likely to be between 15 and 22 years now I want to show you how to produce some fantastic graphics and data visualization using the package ggplot2 just a quick note you know throughout your career you're going to need to use and communicate data and that's why I think using a free package like R is so important because it means wherever you work in whatever job you have you always have access to a package that you know and understand so I was really excited to learn that the University of Edinburgh to provide a course in quantitative analysis using our and I think that that is actually a huge selling point for their mph program okay back to our and let's talk about ggplot again we install the package ggplot2 and tell our that you want to use it in this session using the library function and again we're going to start by using the deep client pipe operator to pipe data into ggplot so we start by saying we're - cream we're using in this case Gapminder and then filter by it GDP per capita less than 50,000 and I'm just doing this to show you how easy it is to use the pipe operator to have a lot of control over what data gets fed into your graphics and then pipe it to ggplot now I don't have time in this video to teach you everything there is to know about how to use ggplot but what I do want to do is give you a quick demonstration so that you get excited about ggplot and the data visualization and you can do the exact same data visualization that I'm going to be doing at home using Gapminder on your computer to practice first we tell it what aesthetic to use in other words how the variables are going to be represented on our canvas and in this case we want GDP per capita to map out against the x-axis and life expectancy to map out against the y-axis and we want to add to that using the plus sign what geometry needs to be applied to the canvas and in this case we want just a point or a dot to be drawn so we use the gon point so when we run that we land up with a graph that's very similar to the one that we had earlier and so why is it that I'm so excited about ggplot because this seems a little bit more complicated well the truth is and stay with me now we're going to make some small changes and turn this into a very rich data visualization let's make the continents in two different colors and now you can see all the continents are represented by different colors it looks a little busy so let's make the points a little transparent by using alpha equals 0.5 that's a little better let's make it 0.3 and we can make the size of the dots proportional to the size of the population so here we can see that the bigger dots are representing the more populous countries and of course we can do the good old log transformation to the GDP per capita data and you can see that the doctor is looking a lot more linear now and let's add another layer to our canvas let's put in a line that tracks the various continents using geo move now it's a little easier to see the differences between the continents we can make that into a linear model by saying miss equals LM let's zoom in still looks a little messy let's divide out the various continents into separate facets using faceted wrap and it's zoom in on that given that our continents are in different facets now we don't need to use color to distinguish between them we can use color to represent two different years and voila now we have one graphic that includes life expectancy GDP per capita population size continent and date five variables and one graphic that doesn't look too busy or too messy we've got a lab there now again I don't have the time in this video to go into a lot of detail about simple and multivariate linear regression but given that we're working with the Gapminder data let me show you how easy it is to just do a quick linear regression model and look at some results and try and explain them to you so very briefly a linear model tries to represent your data using the best fed straight line that line is going to have a slope and will have an intercept with y-axis at some point so we type LM that stands for linear model and we start with what we think is the response variable in this case life expectancy and then the explanatory variable in this case GDP per capita which we think might explain some of the changes in life expectancy and this gives us an intercept and a slope now if we ask for a summary of that model we type in the word summary in front of the model we get the residuals and the coefficients and a couple of other bits and pieces of information but I'm not going to get into in this video in this case the p value is the probability that the slope is in fact zero that's the null hypothesis and it's very very small so the slope is not zero when unlikely to be zero so this coefficient is statistically significant and very quickly just to show you how easy it is we can do multivariate analysis simply by adding additional explanatory variables to our formula in this case we'll add population and voila we have significant p-values for both explanatory variables and these of course because it's regression modeling these control for each other and this of course would be even better if we did log transformations of those datasets because we know they do right skewed but for the sake of time I'm not going to do that right now don't go anywhere stay and watch another video I've got other videos on research methods and data analysis if you're interested so check them out subscribe to this channel if you haven't already leave your thoughts suggestions and questions in the comment section below once again a big thank you to the University of Edinburgh for this court until next time thank you
Info
Channel: Global Health with Greg Martin
Views: 1,022,972
Rating: 4.7377334 out of 5
Keywords: R programming, r programing, r programming for beginners, statistics, bio-stats, quantitative analysis, dplyr, ggplot, ggplot2, t-test, linear regression, data visualisation, statistical analysis, data science, global health, public health, data analysis, r tutorial
Id: ANMuuq502rE
Channel Id: undefined
Length: 15min 48sec (948 seconds)
Published: Thu Jun 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.