Binary logistic regression using Stata (2018)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to demonstrate the use of the state of program in carrying out a binary logistic regression binary logistic regression is carried out in those cases where you have a dependent variable that is dichotomous with the levels representing group membership and the basic idea is that you are attempting to predict group membership as a function of a set of predictor variables so in this particular data set that I have open right now I have a set of variables and one of them is going to be the dependent measure which is to donate variable so the data or the analysis is essentially going to be aimed at predicting participants intentions to donate to a political cause or not to donate to a political cause so the donate variable is actually coded 0 for 4 in tend not to donate one being intention to donate and we're also going to look to see you know whether donation intentions are as a function of a set of predictors so if we take a look in the data editor you can see that we have the donate variable coded 0 for intend not to donate 1/4 intend to donate for gender we have codes of 0 for male 1 for female and then we have political interest dogmatism political advocacy and consideration and future consequences and these variables are all essentially treated as continuous so to carry out a basic logistic regression using the menu option we'll go to statistics binary outcomes and go to logistic regression we actually have a couple of options I'm just gonna stick with this one for the time being will click on dependent variable and donate so that's our outcome variable and like I said we have the values of z the value of 0 stands for r represents intention not to donate a value of 1 represents an intention to donate now it is important to note too that in the context of logistic aggression we generally are focusing in on the probability of membership in a target category so we have to think about one category is reflecting a baseline or reference category the other representing a target category so it's very much akin to our standard notions of dummy coding where when we are applying dummy coded variables when one level will serve as a reference or a baseline group the other one representing a target group so at any rate we'll go to independent variables will select political interest dogmatism and gender but by virtue of it being dummy coded with only two levels we can include that and treat it as a scale variable even though it's technically a nominal variable if we happen to have a variable that is nominal ordinal with more with more than two levels and we might choose to treat it as a factor and so there are other ways of doing that through the program either that or just creating a set of dummy variables and we could still run it through this option so I'm going to click on OK and so we have the basic model and so in evaluating model there are two levels really that we want to pay attention to we want to look at the overall fit of the model to the data and then also if the overall fit looks good then we want to look at the individual predictors in the model so you'll notice that up here we have what's called the likelihood ratio chi-square test and essentially this test is testing whether the model that contains our full slate of predictors represents a significant improvement in fit over a null model with no predictors so if this test is statistically significant then that would indicate that we have evidence of a good model fit or at least in relation to an OLE model and so we have the chi-square statistic or test statistic and then we also have the p-value right here and you can see that the p-value is less than our conventional 0.05 Thresh so we would essentially we would reject the null that that the baseline and our full model exhibit equivalent fit and and and essentially conclude that our current model exhibits a significant improvement in fit over the baseline or null model now we also have the pseudo R Square and this is basically the McFadden's R square and what this is is an analogy to the least-squares R square so it's not computed in the same way and it really doesn't mean exactly the same thing so we sort of use it as a like I said kind of an analogy to the R square that we have in and in least squares regression but this is certainly not a proportion measure in the same sense it doesn't represent the proportion of variation and the dependent variable accounted for by the the predictors it's essentially an analogy it's a little bit tricky to interpret but but that's that's what this is so what we're gonna mainly focus in on just you know saying that you know it looks like our current model exhibits a significant improvement in fit over a null model and look at the individual predictors so you'll notice we got point political interest dogmatism and gender as predictor variables and so we have our table our regression coefficients and in this column standard errors and Z values for Z scores they're computed as a ratio of the regression coefficients to the standard errors and then we also have our p values and so you know we would interpret these pretty much the same way that we do in the standard least squares regression where for instance a p value equal to or less than 0.05 would be judged as statistically significant for the predictor you know value that is you know greater than 0.05 might be would be an indication that the predictor is not significant in the model so just to kind of unpack things a little bit keep in mind that basically these coefficients and in least-squares regression the regression coefficient is can be interpreted as the amount of change in the dependent variable as a function of one unit increase on the predictor variable so in other words the unstandardized coefficient is capturing the predicted change in raw score units for everyone on the deep and it variable for everyone raw score you increase on the predictor in the context of logistic regression it's interpreted a little bit differently it's the same basic idea but in this case what we're looking at is predicted change in something called the log odds and what this is pertaining to is really the the predicted probability of group membership that our target of membership in the target group so we have to think about it in this kind of way that we can you know our model we're essentially trying to kind of model the predicted likelihood or probability of falling into our target group so and and so essentially what we're doing though it in capturing that is we are actually forming a ratio of two probabilities so we have essentially probability of a which is membership and the target group over probability of B which is membership in the non target group and this is what is referred to as odds so essentially if you ever hear the term odds it's just simply a ratio of probabilities the probability that one event will occur versus over the probability that another event will occur assuming that these two events are mutually exclusive so in the context of logistic regression we only have two groups so a is literally going to be the probability of falling into the target category or in our case the don't the intention to donate category whereas B is essentially the probability of falling into the non target category or the intention not to donate group so essentially what's happening though is that we are taking our odds and then obtaining the natural log of those odds and so that's what this is really reflecting is the change in the log odds are predicted change in log odds for every one unit increase on the predictor variable now you may be asking yourself why all of this why not just model the predictive relationship between these variables here and predictive probabilities and the reason why is is that the relationship between our predictors and predictive and probability values is nonlinear and you know in the context of logistic regression we can just kind of look at it as you know we start introducing the concept of a logistic curve where we're looking at probabilities of falling into our two groups and so and so you know the relationship between our predictors and our dependent variable is nonlinear and so we essentially modeled the relationship between our predictors and our outcome variable through this process of essentially converting our probabilities to odds and then to log odds so that's all it's happening so it may be kind of confusing but just you know keep in mind that's that's essentially what's going on so I do want to note too that when we look at odds you know if the probability of the target event is equal to the probability of the non-target event then what we would expect is the odds will equal one and so if the probability of the target event is greater than the probability of the non-target event and we would be looking at an odds greater than one and obviously if the odds of the target event is less and the that of the probability of the target event is less than the probability of the non-target event then we'd be looking at odds that are less than one so we want to just kind of keep that in mind as we as we sort of move forward so at any rate so even though that we're talking of out these coefficients in in terms of law guides you can still loosely think about it as you know the the relationship between the variables and the predicted probabilities with respect to target group memberships so because obviously if the log odds I mean if the odds are greater than one then that's going to indicate that a greater likelihood for event a our membership in a target group as opposed to non target group and so a positive value actually on this is gonna be telling us very much the same thing so you can see that we could say that you know political interest is positively related to the likelihood of falling into the intention to donate group so basically at higher levels of political interest you would expect a greater likelihood that a person would fall into the intention to donate group whereas that lower levels of political interest you would expect less probability of falling into the intention to donate group dogmatism is a positive value here and that's going to also indicate that you know folks that tended to be higher in dogmatism also expressed would be more likely to express an intention to donate whereas those lower and dogmatism might be expressed low would be less likely to express an intention to donate nevertheless the first predictor was significant the second one was not if we use a two-tailed criteria the fact is is that i kind of expected this to be positive anyway and so i you know if i was adopting a one-tailed criteria that would be considered statistically significant the negative coefficient right here if we remember our coding zero for male one for female so the fact that this is a negative coefficient would indicate that males were more you know the probability of falling into the intention group was higher among males than among females but that difference really was not statistically please significant in the model so you know just kind of keep in mind that basically you know as you're looking at these coefficients right here just think about this way that basically a positive coefficient is as as is basically indicating a positive relationship between the predictor and the likelihoods following it to the target group a negative value would indicate that you know at higher levels of the predictor the likelihood of falling into the thar group target group is lower so that's kind of the way to think about it's not technically like I said it's not technically we're not expressing that likelihood in the form of log in the form of probabilities but rather we're kind of capturing that via through the vehicle of log odds so kind of moving on let's say that we wish to to look at the log odds I mean excuse me the odds ratios well we could certainly do that in a couple ways but go back through this original option here we can click on report odds ratios and we'll get those you can see right here that in our output we've got odds ratios which again you know when we you know when we think about odds we're looking at a race essentially a ratio of probabilities probability of falling into the target group over the probability of falling into the non target group now the odds ratio that you see right here is not is not this this is actually the odds ratio is actually reflecting the changes in odds for every unit increase on the predictor variable and so this is actually reflecting a ratio of odds so you know for instance if we started at the mean on political interest you know we would expect a certain odds at the at the mean on political interest and you know let's say we increased our political interest by one unit then we would expect the odds to be multiplied by this factor right here to produce an odds at one unit above the mean on political interests so the odds ratio is reflecting the changes in odds for every increment on the predictor variables so that's the way to think about that and you'll notice that when we look at our original table right here you've got the coefficients and then you've got a confidence interval and basically you know what we're testing right here in terms of the null hypothesis the null is pretty much the same as it is in the context of least squares regression the null being that the regression coefficient and population is equal to zero and so if we were using the 95% confidence interval using the original regression coefficients and we're looking to see if the null value of 0 does that fall within the interval or does it fall outside so you know you can see that zero falls you know outside of this interval right here so it's actually falling you know down on this end whereas 0 is falling within this interval here and it's falling within this interval here so if we were assuming a two-tailed test for each of those predictors only one of them would have been significant that being political interests we also have a confidence interval for the odds ratio and so what we're looking for is essentially the null hypothesis is that the odds ratio is equal to 1 so that's the null hypothesis so if one Falls you know outside of this interval here then we would reject the null if it falls between the lower and the upper bound of the confidence interval then we would maintain and so you can see one falls between these two values here and between these two values here so again that's indicating that that we you know we would be maintaining the null with respect to the dogmatism and gender variables again if we're assuming a two-tailed test now we also have some post estimation options if we go to post estimation and click on specification diagnostic and goodness-of-fit analysis we'll start by just looking at goodness of fit test so I'm going to double click on this and go to the Hosmer Lima Show goodness of fit test so basically this is another sort of global measure of fit and this is a chi-square test and it's it's been pretty rarely criticized but it's still information that you could you could use perhaps so I'm going to click on that and what we have right here is the Hosmer looming show test it's a chi-square test and this is a p-value and now unlike the chi-square test above where we were looking for significance as an indicator of good model fit at least relative to the baseline model here we're looking for non significant chi-square test so non significance with the Hosmer Loomis show test would be an indicator of good model fit and so in this case it's the chi-square test the chi-square value is 8 point 1 1 and the p-value is 0.42 to 9 so it's indicating a good you know good model fit to the data another bit of information that you might ask for is the classification table and sensitivity tests and so forth but I'm just gonna I'm gonna click on that and what you'll see right here is a little table that's giving you classification results so based on you know what we're doing in our model is we're generating predicted probabilities for group membership and then based on those predicted probabilities we we can essentially generate a prediction as to whether a person would fall into you know group 0 the intend not to donate group and the group 1 they intend to donate group so the classification table is really essentially looking at the correspondence between the observed group membership and and group membership that's predicted based on the model and so you can see right here that you've got true and classified so this is reflecting the observed group memberships and then memberships based on the classification are but based on the prediction model so you can see that in this case we had 41 cases that were observed to fall into the intention to donate group whereas we had 134 cases that fell into the do not intend to donate group now of the of the 41 that expressed an intention to donate we had 12 of those 41 that were predicted correctly by the model to fall into that category so you know this is the the positive values it really reflecting the group that's coded one right here and so that's the intention to donate group so when we look at the rate the accuracy rate for for that group you can see it's about 29 percent so that's pretty darn low it's not doing a very good job into the models not doing a very good job in terms of predicting those individuals who would express an intention to donate now over here you see a 129 people were predicted by the model to express an intention not to donate so we had 129 out of 134 that were correctly classified based on the model and so you can see we had an accuracy rate of about 96 percent right here so you can see that overall the overall accuracy rate was about 80 roughly 81 percent but we do a really horrible job of predicting who who expresses an intention to donate versus predicting those who express the intention not to donate so you know you can use this information right here as a you know some more information when it comes to making a judgement about about the fit of the model now I do want to note too that you also have you know you can also use syntax in order to run these models and the nice thing about Stata is is that actually if you run it through run your models through the syntax options you do get you know you actually get the syntax for each of them so you can see right here it says logit and then the names of these variables right here and so this essentially the commands for generating a logistic regression so if you actually if you can start to kind of learn some of the syntax it can be a little bit quicker and offer a little bit more flexibility when it comes to testing models then then just running it through the menu so just to kind of highlight this I'm just going to copy this and put it down in the command line and and then enter and so there you go there's the the analysis so basically you're following the same general idea as you would with least squares regression you have the regression command which in this case is legit you followed up with the dependent variable name and then the names of the predictors and so there you go if you want to look at that odds if you wanted the odds ratio it just basically involves a modification of this part of the statement right here with a little comma and are so if I I'll just you know paste that down here and you can see that's it right there so there's our there are odds ratios to get the the goodness of fit test for the Hosmer limit show you want to be sure that you you know you essentially you would be typing in East at gof comma and then group ten and then basically in the parentheses and there you go so you know that that's the nice thing about knowing a little bit about the syntax is you can kind of do things a little bit quicker than just you know sometimes running it through the menu and then to get the classification table it's just east at classification I do want to show you another example this is using the ado file that I've created and so I've got the syntax for you know essentially what we just generated it's all of this stuff right here - the odds ratios and so what I want to do is let's say that I wish to test you know sort of a hierarchical our tests the predictors in a hierarchical sense so let's say I want to do kind of a hierarchical logistic regression where I want to add and you know have a set of predictor variables at one level and then add in another predictor or multiple predictors in a separate model so and so what I'm what I'm doing right here is I have the basic model everything and then what I'm gonna do is I'm gonna have a second model where I'm gonna add in CFC as an additional predictor so in this model we don't have CFC in this one we do and so we're gonna see if essentially is there a significant increase in fit as a result of adding in CFC or so in this case what we're gonna use is the estimates command we're gonna store it says estimates store and then m/l and so what that's going to do is this kind of store the chi-square value and whatever estimates from the first model and then I'm gonna do the same thing for this second model so and I'm gonna run this kind of as a batch instead of running this separately but so once I've stored the estimates from both models then I can use the LR test which is the likelihood ratio test to test a difference between model one and model two and sort of see if you know do we improve you know does is there value added and adding in the predictor if I want to do this kind of in a hierarchical way so I'm going to highlight all of this and click on this little button right here for execute selection and so now I've run the analysis and so you can see right here you know this was our original model and so we've done that but I've generated the classification statistics the Hosmer levy show and then estimates store so stored the the estimates from model one then we have you know essentially the next model where we've added an CFC and when we've done that now we you can see CFC in this case right here there's actually a negative coefficient it's not statistically significant so you can see right there really just adding that end and didn't really help us out all that much we had the classification statistics related to that you can see that essentially we we have a you know not much of a gain in terms of predicting who intended to or who was going to express an intention to vote I would not vote to donate so really not much of a change in the overall you know classification accuracy or anything like that there's a goodness of fit test and then so you can see right down here the likelihood ratio test where we were you know basically we had model 1 versus model 2 and so you can see that that there was a you know no significant difference in models so basically adding in the additional variable for model 2 didn't yield a significant improvement and fit so that that's kind of a quick and dirty overview of binary logistic regression using Stata there are other options that are available you know if you want to you know if you want to you know do things like use robust standard errors or bootstrapping or things like that you can certainly do that this was just like I said sort of a quick overview so I hope that you find this helpful
Info
Channel: Mike Crowson
Views: 75,545
Rating: 4.8819189 out of 5
Keywords: Stata, binary logistic regression
Id: Nbffvey80bU
Channel Id: undefined
Length: 28min 11sec (1691 seconds)
Published: Mon Mar 26 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.