SAS - Logistic Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello youtubers Crone education here with another video on statistics specifically we'll cover logistic regression and how it differs from multiple linear regression or simple linear regression before we get started let's talk about the data that I'll be using and I'll run this here quick for you so you can see what we're working with there will be three variables age gender and status and I took this data you can probably find this online somewhere it's models the Donner Party the Donner Party was a group of families who tried to move from Illinois to California and then essentially they get caught in this big snowstorm a lot of the people end up dying so over here for status this is whether they're alive or dead or what at the time whether they died or survived so one equals survive zero equals the person died now our ultimate goal here of this example is that we're going to try and determine if age and gender had any influence on whether the person ended up dead or alive and if we can essentially predict whether the person would would have ended up dying or would have ended up surviving based upon their age and gender and before we go any further I'd like to now talk about the difference in assumptions between simple and regression multiple linear regression and logistic regression so I'm nice little table here in simple integration multiple linear regression we assume that errors are independence Arizona normal distributed errors have constant variance and that there was a linear relationship between the predictor and the response and now we could test each of these things as well so we might do a white test for constant variance and shepherd wolf test for normally distributed so we could test each of these things but ultimately we're assuming that each of these are satisfied in logistic regression we don't have to assume that all of these things are going to be satisfied in fact we don't have to test and make sure that they're all actually being satisfied either so the only thing that we need to make sure is that there's that the errors are independent and that's really our big assumption for logistic regression we aren't going to be concerned whether the areas aren't normally distributed or whether they have constant bearings now this last one here is a bit tricky it says linear relationship between the predictor in the response in logistic regression it is not a linear relationship specifically between the predictor and the response but there is a linear relationship between the predictors and the logit of the response and we'll talk about that in a second here once we get once we go over our models so let's go back into SAS and I'll give us some bit crude versions of our models here in this text editor now when we were in multiple linear regression this is what we assumed our model to be I'll just put this here in a comment so a multiple linear regression we assumed and I'm going to model just this specific situation but we would have y is equal to beta 0 for the intercept plus beta 1 the coefficient for the X 1 predictor and then beta 2 the coefficient for the X 2 predictor and then we'd have some error at the end this the true error epsilon that we don't really know and then once we actually predict this regression line this would change the Y hats these might go to lowercase B's and then this would essentially be a residual then but this wouldn't be included in the Y hat model so that's our multiple linear regression model so now let's do our logistic regression model our logistic regression model is going to be a bit different here so a little the logistic regression model I said that there is a linear relationship between the logit of the response and then the predictors so here is the logit of the response it's going to be a natural log of this is typically represented with a pie but it's just a probability divided by 1 minus the probability and then it's just equal to notice that the right side is essentially the same except there's no error term here so this is not an estimate but there's still no error term so unlike multiple linear regression we have no true error term at the end so we have our logistic model now so let's talk a little bit more about what we can do with this notice that when I was going over the data I said that our response variable which is the status can take on two values 1 or 0 and that's really essentially what a separates the logistic model from multiple linear regression because in our logistic model y can only take on two possible values it's essentially a binary variable it can be 0 or it can be 1 and that is it now the nice thing about our logistic model is that it makes it very easy to calculate the odds of say our response being one or the probability of our response being one and that's what really this model lends itself to doing and I can show you this with a simple odds calculation now so let's say we wanted to calculate odds and what I mean by odds is the same thing as when you're gambling on a horse maybe it's four to one odds which means it's really it's four to one odds that means it's four out of five and you're a favor that the horse is going to win or 80% chance so let's do an odds calculation essentially all you do is you just raise both sides say e to the both sides and then you can get rid of this natural log and then this is all raised on the right side now so now the odds is essentially going to be this entire left side of the equation a so P divided by 1 minus P so now the nice thing here is that let's say we know the age we know the gender then it's very easy to calculate the odds because then we could just plug the age and if you want I mean X 1 and plug the gender into X 2 and there we have it then we'll be able to calculate the odds very easy once we have our estimated regression equation we'll still need to get that first though and while we're here I'll show the probability equation as well and the probability equation is going to be very similar this just took some algebra manipulation and if you know your algebra you'll be able to easily get this to this form but essentially just solve for P over here and once you do that it'll be e to the beta 0 and this is these dots just represent this right-hand portion here just didn't write it all out but and then divided by 1 plus e to the beta 0 plus beta 1 X 1 plus beta 2 X 2 so on so let's say we wanted the probability of survival given the fact that X 1 is equal to some age and X 2 is equal to some gender well then we could go ahead just plug those values in with our estimator regression equation and then we could find the probability which is probably really more interesting to us than the odds say it would be very useful to have this person with this age in this gender has an 87 percent chance of surviving okay and we've hit the point now where you're probably saying well you're talking about this estimated regression equation how do we get it well SAS will give it to us very simply here and it's very similar to proc reg that we used in multiple and simple in your regression but here now it's going to be proc logistic so proc logistic data equals just our data is called die and then similar to proc reg we're going to say model status which is a response equals to age and gender now before we run this let's just take a second and note this comment I have here I'm using the descending option after proc logistic so what does that do descending tells SAS to model ones rather than zeros so by default SAS is going to model the zeros so if we were to calculate the probability it would actually be the probability of dying instead of the probability of surviving well for this example I want to specifically know the probability and the odds of surviving or the probability of this response equaling one so I'm going to say descending which will switch SAS and tell it to model the ones rather than the zeros alright so let's go ahead and run this and we'll talk about how to interpret this printout now the very first thing we want to establish is our model and we want to be able to find those coefficients for our predictors and this will be very easy same procedure as proc reg we go down here where it says intercept age and gender and then here are the estimates for each of these so intercept 3.23 age negative zero point zero seven and gender negative one point five nine so now to form our estimated regression equation all we need to do is take these SMS and then plug them in for beta zero age for beta one and then this for beta 2 and that will give us our estimated regression equation now for the remainder of the video here I want to talk about the three big tests we can do which is the overall test the individual test for each predictor and then a partial test similar to the partial f test so let's start with the overall test the null hypothesis for the overall test is that each predictor is equal to zero meaning each of them is insignificant and then the alternate as for the overall test is that at least one of them is significant so typically this is the test that is done first just to make sure that at least one of the predictors is significant and predicting the response so this is very easy to find in this printout because it says here we go testing global null hypothesis beta equals zero meaning testing that they all all the betas are equal to zero now I believe typically all three of these will yield similar results but I always use the likelihood ratio and we see that there is a p-value of 0.005 one so even if we chose an alpha of 0.05 we would still reject because the p-value ends up being less than our alpha so alpha is a pretty high of p-value 95% so that would still indicate that at least one of our betas is significant that's pretty easy let's go on now to the individual tests now the big difference here that you probably already noticed is that when we did multiple and simple in your regression this was a T statistic and then we had a probability for the T statistic right here but instead notice we have this Wald chi-square well how do we calculate that Wald chi-square let me pull up the calculator I'll show you how we calculate that when we have our estimate in our standard error so let's just do let's do age so age has zero point zero seven eight two and that is a negative and then we have a standard error of zero point zero three seven three so now when we did the T statistic it was just estimate divided by standard error but now that we're doing Wald chi-square it'll be a bit different it's going to be estimate divided by standard error the quantity squared so let's divide by our standard error of zero 373 and now let's go ahead and Square and we get four point three nine five four point three nine which is incredibly similar to what SAS gives us here and then the p-value is just going to be one degree of freedom and it's always going to be right tail for the chi-square distribution now a hypotheses for these are they're going to be the same as that individual t-test so it would be if we were testing age it would be beta 1 is equal to zero for the null and then the alternate hypothesis would be beta 1 is not equal to 0 and now you have to remember that this is the marginal contribution so this is the contribution of âge giving that the rest of the model stays the same so this is saying is age significant given that gender and the intercept are already in the model and now again let's say our alpha is equal to 0.05 well we have a p-value of 0.03 so we would go ahead and reject since our p-value is less than alpha and then we would reject the null and assume that age is significant and you could essentially do that for each of these and we would find that they are all significant with an alpha equal to 0.05 all right and the last test is just going to be the partial test and this is really equivalent to the partial F test and essentially it's saying let's construct a reduced model and let's determine if this reduced model is sufficient to replace that full model so let's just say that this is our full model the full model includes intercept age and gender and let's say that our reduced model only contains the intercept and the reason I'm doing this is really just it already provides us with this printout in this comparison so it'll be nice and easy for us to calculate but now the nice easy way to do this and if you remember from multiple linear regression we had actually a fairly complicated formula for our F statistic but here is going to be very simple we're going to use this negative 2 log L and all we're going to do is subtract the reduced minus the full and that will give us our statistic and it will be in again in a chi-square distribution so we're going to take our reduced which is going to be our intercept only which would be 61 point 8 2 7 and then we're going to subtract 51 point 2 5 6 51 point 2 5 6 and that will be ten point five seven one so now this gives us a chi-square statistic in ten point five seven one and the degrees of freedom are essentially going to be the degrees of freedom of difference between the full and the reduced so here there are two more predictors and in the reduce there essentially two less so it's going to be two degrees of freedom ten point five seven one in a chi-square and then that should enable you to find your p-value now I just found this p-value using just the ti-83 and it looks like it is zero point zero zero five so now what does that mean to us the null hypothesis for this is the reduced model so it's essentially whatever betas need to equal zero to get you to the reduced model that's your null hypothesis so in this case would be beta 1 equals zero beta 2 equals zero now so that's our null for our alternate then it's going to be the full model so we want to essentially see if we can reject this for reject this reduced model and in this case since our p-value is 0.005 we can go ahead and reject it so we'll reject that reduced model because we need to use this full model according to that p-value that we received and now you might be saying well what if the situation is more complicated than this what if we have say five predictors and you want to test if two of them are equal to zero and that's what ends up giving you the reduced model well you can certainly still do that you're just going to have to run proc logistic twice and then that's going to end up giving you two values here from the negative to log L and then that's what you'll end up subtracting so you won't use this intercept only anymore you'll use the reduced statistic subtracted and then we'll and then subtract the full statistic and that's what will give you your chi-square statistic again make sure that the degrees of freedom is just going to be the amount that are set equal to zero so we've now covered the big test that we can do using this printouts and the last thing I'd like to do is just some concrete examples involving this data so let's say I want to answer this question what are the odds that a female 1 year old child will survive given the data that we have well the first thing that we need to do is we need to come up with our estimator of regression line and giving our parameter estimates from the printout I have this here for us it's going to be P divided by 1 minus P again P represents probability is equal to e to the and then here we have it 3.2 3 0 4 minus and then we have age and gender with coefficients now so let's start out by calculating odds we're going to do odds first not probability well calculating the odds is very simple this is essentially the equation for odds we just need to now plug in and then we'll have our odds once we plug in for age and gender so we know the age the age is 1 the gender is female and I don't know if I stated this up here yes yes I did for female female is equal to the zero so now that's essentially going to zero out this whole term so now all we need to do is take e to the three point two three zero four minus point two zero seven eight two and that is going to give our give us our odds which it looks like it ends up coming out to be twenty three point three nine now I'd say the average person does not deal with odds on a daily basis maybe maybe you gamble more than me but let's now determine the probability the probability should be a more useful number to us so let's say what is the probability that a female one year old child will survive so now the probability is going to be a bit different calculation and we essentially just need to take the last thing that we saw last thing that we had and solve it for P so let's take this and our probability is going to be P is equal to twenty three point three nine divided by one plus twenty three point three nine and once you solve that out that should give you zero point nine five nine which is the probability we can write this in a percent which is ninety five point nine percent now you might be thinking that is very high but if you do this again for let's say someone that is seventy-five years old their odds of survival is going to go down significantly I'm sorry old people if you're putting the same signature same situation your odds of survival will be much lower okay but that does it for logistic regression before I leave I'll tell you one thing that I left out of this model was an interaction term and it's very easy to add this in if you're working in a model that you know you need to include the interaction if you want to include it proc logistic enables us to just put age times gender and this will automatically generate the interaction term there's no need to go back up to the data a whole new variable so then you can just run this and it will add it to our estimates right here age times gender as our interaction term and then you could test it individually just like everything else but thanks for watching
Info
Channel: Krohn - Education
Views: 32,597
Rating: 4.9189191 out of 5
Keywords: SAS, programming, statistics, chi square, logstic, regression, simple linear regression, multiple linear regression, logistic regression
Id: H2s1wgTDqqg
Channel Id: undefined
Length: 17min 56sec (1076 seconds)
Published: Mon Dec 19 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.