Using R to Analyze COVID-19 | R Programming Project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is a coronavirus dataset that i found on kaggle from johns hopkins university and we'll be using the covet 19 line list data.csv file so if we take a look at the columns we can see the gender of a person who got infected their age as well as if they recovered or died the link to the data set will be in the description but if you're on this website you can scroll down and click the download button so on my desktop i'll create a folder called covet underscore r and i'll put the csv file into it if we open the csv file in excel we can take a look at what some of the data looks like so yeah we have a bunch of columns some of the data is missing but that is okay and if we scroll all the way down we can see that we have 1085 entries so that should be plenty for our analysis now i'll open up rstudio and we'll get started importing the data set so in the top left i'll click file import data set from text base and then i'll find the csv file that we just downloaded so i'll go to my cobit underscore r folder on my desktop click the csv file so we can see in the data frame that's what all of our data will look like and i'll click import so we can see some of the data here that we're viewing but i'll copy this line from the console which actually reads in our csv file and i'll create a new r script and i'll paste it in for the purposes of convenience i'll rename this variable as data so that i don't have to type this long name every time i'll save this r script in the folder where the cse file is located and i'll just call it script it is usually a good practice to start your r scripts with this line which just removes all of the data and all the variables that you've previously loaded and then i'll import a library called hms in order to download this library if you haven't already you can run this command in your console so install dot packages and then hms in quotes this will take just a few seconds so now we can use a describe command that we imported from hmisk so we'll just say describe data and we will run this so after running this we'll have a lot of output in our console but if we scroll up we can see some informative things about our data for instance we have 27 columns and 1085 observations if we scroll further down and we see that 183 entries are missing gender but the others aren't and if we scroll down to death we have 14 distinct values for some reason because some entries report death as 0 or 1 but some just report the date of the death so death is zero if the person didn't die one if the person did die but some report just the date so we have to clean this up because this is inconsistent and difficult to work with using the dollar sign we'll create a new column in our data set called death underscore dummy and that will be equal to as dot integer data dollar sign death is not equal to zero so we're saying if the death column is not equal to zero then the person died if it is equal to zero then they didn't running the unique command on this new column we can see that the only values are zero one so we fixed the issue and cleaned up our data so how would we calculate the death rate using this information and this new column that we just created well we need to see how many people died out of how many total people that were infected in this data set so we will sum all the death dummy entries and then we'll divide it by the number of rows and we get a death rate of about 5.8 percent moving on according to the media the average person that dies from the coronavirus is older than the person who survives so can we prove that this claim is actually correct using your data so our claim is that people who die are older than people who survive let's see how we can show this so we'll make a variable called dead and this will be the subset of our data where the death dummy is equal to one so it'll be all the rows where people are dead then the live variable will be all of the rows where people survived we have 63 people who died and 1022 that survived so now let's calculate the mean age of both groups so we'll run the mean command and do dead dollar sign h and then mean live dollar sign h and let's run all of these and we get an a as a result why is that well let's take a look at our data so if we scroll back and we find h we can see some of the entries have an a so r doesn't know how to interpret that so we'll put in the following option n a dot remove equals true this will just ignore every entry where the age is unknown so after running this we can see the age difference the difference between dead and live is about 20 years but is this statistically significant how can we check this for this we will employ the t-test command so let's type it in t test the first entry will be alive dollar sign age and then the second entry will be that dollar sign h so now we'll type in our alternative hypothesis which is something we do in statistics i won't go too much into this but you will see what output is so this will be two dot cited in quotes and then we need our confidence interval so for this let's just try a confidence interval of 95 percent or 0.95 so let's actually run this command and see what our output is so let's take this step by step so here are the two means we just calculated 48 and 68.6 but if you look at the 95 confidence interval we see that there's a 95 chance that uh the difference between a person who's alive and dead in age is from negative from 24 years to 16.7 years so on average the person who's alive is actually much much younger we can change the confidence interval to 99 and we'll get slightly different values but you see the point now let's take a look at the p-value so that is the probability that from the sample we randomly got such an extreme result so this is 2.2 times 10 to negative 16 which is basically zero so there's zero percent chance that the ages of the two populations are actually equal in our population remember this is the sample normally in statistics if the p-value is less than 0.05 we reject the null hypothesis and thus conclude that our result is statistically significant so in this case we have concluded that indeed the people who die from the coronavirus are much older than the people who do not die from the coronavirus now we can try a similar experiment with gender so are women more likely to die from the coronal virus than men is it vice versa well let's find out let's copy all of the code that we just wrote and just change some things so our claim is that gender has no effect but let's see if that's true so make two subsets men and women and men will be substance where gender equals male and woman is substance where gender equals female so now let's calculate the means so the mean will be men and then dollar sign death underscore dummy and then the mean will be woman dollar sign death underscore dummy let's run these four commands we can see that men actually have a death rate in this data set of 8.5 percent as opposed to 3.7 percent for woman so that is a pretty large discrepancy again is this statistically significant like in the previous example we'll use t tests to find out so we'll see if these means are actually an accurate representation of the reality in the population so we'll just fill in t-tests with men death dummy and then woman death dummy and then we'll keep the same alternative hypothesis and the same confidence interval and let's run this so we can see uh that the means are the same as we calculated and we can say with 99 confidence that men have from point seven eight percent to eight point eight percent higher fatality rates than women now let's take a look at our p value this case it's point zero zero two which is still much less than 0.05 so we can again reject the null hypothesis and conclude that this is statistically significant what is statistically significant well the fact that men have higher death rates than women in the sample and that that is representative of the population in the description you'll find a link to the code as well as a link to an article that i wrote about the same topic thank you for watching this video and please consider subscribing for more videos like this you
Info
Channel: Tech Tribe
Views: 33,749
Rating: 4.9546027 out of 5
Keywords: R programming, data analysis with r, how to program in r, how to r programming, r basics for beginners, r basics for data analysis, r basics for data science, r data analysis, r data analysis project, r data analysis tutorial, r data science tutorial, r programming example, r programming for beginners tutorial, r programming for data science, r programming project, r project, r tutorial for beginners youtube, r tutorial for data science, r tutorial introduction to r, using r
Id: D_CNmYkGRUc
Channel Id: undefined
Length: 8min 5sec (485 seconds)
Published: Tue Mar 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.