Logistic Regression in R, Clearly Explained!!!!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
step quest gettin freaky step quest kind of sneaky step quests hello I'm Josh Starla and welcome to stat quest today at long last we're gonna cover logistic regression in our note a link to the code which is chock full of comments and should be easy to follow is in the description below for this example we're going to get a real data set from the UCI machine learning repository specifically we want the heart disease data set note this is the same data set we used when we made random forests in our if you're familiar with that data you can skip ahead to about 3 minutes and 44 seconds in this video we start by making a variable called URL and set it to the location of the data we want this is how we read the data set into our from the URL the head function shows us the first six rows of data unfortunately none of the columns are labeled wah-wah so we named the columns after the names that were listed on the UCI website hooray now when we look at the first six rows with the head function things look a lot better however the stirrer function which describes the structure of the data tell us that some of the columns are messed up right now sex is a number but it's supposed to be a factor where zero represents female and one represents male CP aka chest pain is also supposed to be a factor where levels 1 through 3 represent different types of pain and 4 represents no chest pain CA and Thal are correctly called factors but one of the levels is question mark when we need it to be in a so we've got some cleaning up to do the first thing we do is change the question marks to n A's then just to make the data easier on the eyes we convert the zeros in sex to F for female and the ones to M for male lastly we convert the column into a factor then we convert a bunch of other columns into factors since that's what they're supposed to be see the UCI website or the sample code on the stat quest blog for more details since the CA column originally had a question mark in it rather than in a are thinks it's a column of strings we correct that assumption by telling our that it's a column of integers and then we convert it to a factor then we do the same thing for Thal the last thing we need to do to the data is make HD aka heart disease a factor that is easy on the eyes here I'm using a fancy trick with the if-else function to convert the zeros to healthy and the ones to unhealthy we could have done a similar trick for sex but I wanted to show you both ways to convert numbers to words once we're done fixing up the data we check that we have made the appropriate changes with the stir function hooray it worked now we see how many samples rows of data have na values later we will decide if we can just toss these samples out or if we should impute values for the NA s6 samples rows of data have n A's in them we can view the samples within a z' by selecting those rows from the data frame and there they are here are the NA values five of the six samples are male and two of the six have heart disease if we wanted to we can impute values for the NA s using a random forest or some other method however for this example we'll just remove these samples including the six samples within A's there are three hundred three samples then we remove the six samples that have n A's and after removing those samples there are two hundred ninety seven samples remaining Scapa now we need to make sure that healthy and diseased samples come from each gender female and male if only male samples have heart disease we should probably remove all females from the model we do this with the X tabs function we pass X tabs the data and use model syntax to select the columns in the data we want to build a table from in this case we want a table with heart disease and sex and bam healthy and unhealthy patients are both represented by a lot of female and male samples now let's verify that all four levels of chest pain CP for short were reported by a bunch of patients yes and then we do the same thing for all of the boolean and categorical variables that we are using to predict heart disease here's something that could cause trouble for the resting electrocardiographic results only for patients represent level 1 this could potentially get in the way of finding the best fitting line however for now we'll just leave it in and see what happens and then we just keep looking at the remaining variables to make sure that they're all represented by a number of patients okay we've done all the boring stuff now let's do logistic regression let's start with a super simple model we'll try to predict heart disease using only the gender of each patient here's our call to the GLM function the function that performs generalized linear models first we use formula syntax to specify that we want to use sex to predict heart disease then we specify the data that we are using for the model lastly we specify that we want the binomial family of generalized linear models this makes the GLM function do logistic regression as opposed to some other type of generalized linear model oh I almost forgot to mention that we are storing the output from the GLM function in a variable called logistic we then use the summary function to get details about the logistic regression BAM [Music] the first line has the original call to the GLM function then it gives you a summary of the deviance residuals they look good since they are close to being centered on zero and are roughly symmetrical if you want to know more about deviance residuals check out the stat quest deviance residuals clearly explained then we have the coefficients they correspond to the following model heart disease equals negative one point zero four three eight plus one point two seven three seven times the patient is male the variable the patient is male is equal to zero when the patient is female and one when the patient is male thus if we are predicting heart disease for our female patient we get the following equation heart disease equals negative one point zero four three eight plus one point two seven three seven times zero this reduces to heart disease equals negative one point zero four three eight thus the log odds that a female has heart disease equals negative one point zero four three eight if we were predicting heart disease for a male patient we get the following equation heart disease equals negative one point zero four three eight plus one point two seven three seven times one and that reduces to heart disease equals negative one point zero four three eight plus one point two seven three seven since this first term is the log odds of a female having heart disease the second term indicates the increase in the log of the odds that a male has of having heart disease in other words the second term is the log of the odds ratio of the odds that a male will have heart disease over the odds that a female will have heart disease this part of the logistic regression output shows how the Wald's was computed for both coefficients and here are the P values both P values are well below 0.05 and thus the log of the odds and the log of the odds ratios are both statistically significant but remember a small p value alone isn't interesting we also want large effect sizes and that's what the log of the odds and the log of the odds ratio tells us if you want to know more details on the coefficients and the Wold test check out the following stat quests odds and the log odds clearly explained odds ratios and log odds ratios clearly explained in logistic regression details part 1 coefficients next we see the default dispersion parameter used for this logistic regression when we do normal linear regression we estimate both the mean and the variance from the data in contrast with logistic regression we estimate the mean of the data in the variance is derived from the mean since we are not estimating the variance from the data and instead just deriving it from the mean it is possible that the variance is underestimated if so you can adjust the dispersion parameter in the summary command then we have the null deviance and the residual deviance these can be used to compare models compute R squared in an overall p value for more details check out the stat quests logistic regression details part 3 R squared and its p-value in saturated models and deviance statistics clearly explained then we have the AIC the Chi key information criterion which in this context is just the residual deviance adjusted for the number of parameters in the model the AIC can be used to compare one model to another lastly we have the number of Fisher scoring iterations which just tells us how quickly the GLM function converged on the maximum likelihood estimates for the coefficients if you want more details on how the coefficients were estimated check out the stat quest logistic regression details part two fitting a line with maximum likelihood double bam now that we've done a simple logistic regression using just one of the variables sects to predict heart disease we can create a fancy model that uses all of the variables to predict heart disease this formula syntax HD tilde dot means that we want to model heart disease HD using all of the remaining variables in our data frame called data we can then see what our model looks like with the summary function dang the summary goes off the screen no worries we'll just talk about a few of the coefficients we see that age isn't a useful predictor because it has a large p-value however the median age in our dataset was 56 so most of the folks were pretty old and that explains why it wasn't very useful gender is still a good predictor though if we scroll down to the bottom of the output we see that the residual deviance and the AIC are both much smaller for this fancy model than they work for the simple model when we only use gender to predict heart disease if we want to calculate McFadden's pseudo r-squared we can pull the log likelihood of the null model out of the logistic variable by getting the value for the null deviance and dividing by negative two and we can pull the log-likelihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by negative 2 then we just do the math and we end up with a pseudo R squared equals 0.55 this can be interpreted as the overall effect size and we can use those same log likelihoods to calculate a p-value for that r-squared using a chi-squared distribution in this case the p-value is tiny so the r-squared value isn't due to dumb luck one last shameless self-promotion more details on the r-squared and p-value can be found in the following stat quest logistic regression details part 3 R squared and it's p-value lastly we can draw a graph that shows the predicted probabilities that each patient has heart disease along with their actual heart disease status I'll show you the code in a bit most of the patients with heart disease the ones in turquoise are predicted to have a high probability of having heart disease and most of the patients without heart disease the ones in salmon are predicted to have a low probability of having heart disease thus our logistic regression has done a pretty good job however we could use cross-validation to get a better idea of how well it might perform with new data but we'll save that for another day to draw the graph we start by creating a new data frame that contains the probabilities of having heart disease along with the actual heart disease status then we sort the data frame from low probabilities to high probabilities then we add a new column to the data frame that has the rank of each sample from low probability to high probability then we load the ggplot2 library so we can draw a fancy graph then we load the Cal plot library so that ggplot has nice-looking to false then we call GG plot and use G on point to draw the data and lastly we call GG Save to save the graph as a PDF file triple bam hooray we've made it to the end of another exciting stat quest if you like this stat quest and want to see more of them please subscribe and if you want to support stat quest well please click the like button below and consider buying one or two of my original songs alright until next time quest on
Info
Channel: StatQuest with Josh Starmer
Views: 291,333
Rating: 4.930161 out of 5
Keywords: StatQuest, Joshua Starmer, Statistics, Machine Learning, Generalized Linear Models, GLM, Logistic Regression, Clearly Explained
Id: C4N3_XJJ-jU
Channel Id: undefined
Length: 17min 15sec (1035 seconds)
Published: Thu Jul 26 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.