Logistic Regression with Stata

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
greeting students today I'm going to spend a little time walking you through Chapter ten on logistic regression so at this point in class you have obtained the skills to analyze different situations um you can analyze two categorical variables using the tabulate command and you can analyze interval level dependent variables with the regress command but today we're going to talk about something a little bit different we're going to talk about logistic regression so the first thing you need to do is open up GSS 2012 and I have my handy little do file over here and I'm just going to take a couple minutes to explain logistic regression so logistic regression is designed to analyze relationships between independent variables and a binary dependent variable so up to this point we've been talking a lot about linearity um things like what what predicts maybe the Obama thermometer how people feel about President Obama on a scale of 0 to 100 or the Gini coefficient inequality on a scale of 0 to 100 and these are um continuous variables and so we assume linearity between our independent and our dependent variable with logistic regression uh we can't assume linearity the same way at least so with logistic regression the dependent variable is this is the same as a dummy variable it's binary um we look at things like voting you either voted or you didn't I you support gay marriage or you don't you live in the south or you don't you're married or you're not these are things that you can't be kind of you can't be you can't kind of vote you can't kind of live in the south you can't kind of be married um and so we code these variables the same way we as we code dummy variables ones and zeros and a common dependent variable in political in in the analysis of political behavior is whether people voted in an election so in in this first step we're going to look at voting in 2008 and so first thing I'll show you is the code book command which will show us how voting in 2008 is coded and as you can see it says it's a 0 and a 1 you'll be devoted or you didn't or your value is missing um which in that case those respondents will drop out of the model so it's ones or zeros and the other thing we're going to use um so at for an independent variable that is often linked to turnout it is education in the GSS education ranges from 0 to 20 zero is no formal schooling and 20 is 20 years of education I'll show you that one too so here you can see our range is 0 to 20 we have 21 unique values our mean education is 13.5 so that would be graduating high school plus a year and a half of college um yeah so we would expect our relationship we're going to model a the probability of someone voting but with our independent variable being education years so we would expect a positive relationship between the independent variable and the dependent variable as education Rises the likelihood of the individual voting also Rises unlike ordinary least-squares we cannot assume that one year change in education is associated with a consistent increase in the probability of voting so yeah we can't assume that the the effect of some of education on somebody at the low end of education is the same as it is for somebody with a high level of education with one unit of change you'll see what I'm talking about better later so um what we're looking at is the logged odds of voting is our is our dependent variable because that's what logistic regression computes the log odds of voting and your constant is represented about represented by a and B is the coefficient that represents a change in the log odds of voting for each year of education so unfortunately and I know what you're thinking what are logged odds so logistic regression is harder to interpret than linear regression um with linear regression we have our constant and our coefficients that represent changes in the actual dependent variable in logistic regression the constant will tell you the logged odds of voting when education is zero and the coefficient will estimate the change in the log odds for each unit change in education and the problem with that is no one understands what that means um your mother asks you what's the impact of education on the probability of voting and you can't say well it's a it's an increase in the logged odds of blank that doesn't make any sense so we have different ways of figuring out how to make sense of this so now we're going to talk about the legit command and the logistic command itself the logic command will display an estimate for a constant a and the coefficient B both expressed in terms of log odds of the dependent variable the logistic command will run the same analysis but it will express the coefficients in either odds ratios or log odds so I will show you this now so here is the logic command for our model predicting the likelihood of voting and you can see here's a coefficient you've got a z-score which is a lot like a t-score the p-value um pseudo r squared which is like an R squared it looks a lot like when you run linear regression I'll run the other one here logistic now this one spits out something a little bit different um as you can see so we have a coefficient here and here we have an odds ratio so let's take a look at the logit model first the constant is the estimated log odds of voting for people with no education so that's what this number represents and the log odds of voting increased by 0.2 to 6 for each one year increase in education right here as expected the independent variable increases and so does as the independent variable increases so does our likelihood of voting we can tell that by we've got a positive coefficient and we've got a p-value less than point zero five so that part of the interpretation is the same as OLS however we don't really know what this what we don't really understand logged odds so we're going to take a look at the other regression which gives us odds ratios so notice we now have odds ratios where the coefficients were this odds ratio tells you by how by how much the odds of the dependent variable change for each unit change in the independent variable an odds ratio of less than one says that the odds decrease as the independent variable increases so we'd have an inverse relationship an odds very Oh an odds ratio equal to one says there's no relationship an odds ratio greater than one says the odds of the dependent variable occurring increase as the independent variable increases so there's a positive relationship so again we can see here we have we have a statistically significant p-value a pretty large z-score which is like a t-score um and we have an odds ratio of 1.25 and because it's above 1 it means we have a positive relationship between the years of education and the probability of voting in 2008 so in our model the odds ratio is 1.25 which means that the respondent at a given level of education is 1.25 times more likely to have voted than a respondent in the next lower level of education odds ratios make things a little bit more understandable they tell you the percentage of change in the odds with each unit of change in the independent variable in our model a one-year increase in education increases the odds of voting by 25% remember one if this was a 1 we'd have no relationship so you're really looking at how far away from one is it if it was under 1 we we would have a different um we would have a negative relationship since it's over 1 and 1 is no relationship the the odds of the odds increase for each by 25% for each year of education and that's 0.25 25% so the next thing that the book wanted you to look at was iteration so that's up here now what logistic regression uses is maximum likelihood estimator MLE so it first what's data will do it will first try to predict the observed values on the dependent variable without using any independent variables as a predictor so basically Stata is going to take a random guess and it's going going to ignore whatever you think goes into the probability of whatever it is you're trying to predict in this case voting and then it's going to bring in the independent variable and run multiple analyses to determine the best predictive fit and so the idea here is you want this number so this is this is iterations 0 is predicting based on just random and that number we want to get far away from this number we want to get close to 0 and as you can see we didn't get very our before it was spitting out the same number and and we're not very different from just random guessing which would lead you to conclude that you're missing something in your model obviously we only have one independent variable in this model there may be other things we would want to control for to predict whether a person votes or not so the another thing I need to draw to your attention is the chi-square over here we have a wold chi-square um this tells us whether our model is is significantly better than the know-nothing model so the model in iteration one where we're just randomly guessing and in this kit in this case our p-value for the model is equal to zero as you can see here and so we can conclude that including education as a predictor significantly enhances the prediction performance of our model and here we have our pseudo eye r-squared which which is this a lot like the r-squared in OLS it's supposed to let you know how well your model explains variance in the dependent variable and as you can see we don't have a very high r-squared but we only have one independent variable in our model so that will happen so speaking of only having one independent variable as we move on we're going to do logistic regression with multiple independent variables so other things may play a role in the probability of a person votes and logistic regression can include much like linear regression control variables and I will show you that right now okay so controlling for education each additional year in age increases the odds of voting by 4% and we can see that right here and this one with the odds ratios we're at one point zero four this would round to four four four percent um the the model is statistically significant so compared to knowing nothing and just randomly guessing on our dependent variable our model significantly improves our ability to predict the likelihood of voting you can see education is statistically significant and positive age is statistically significant and positive and our pseudo r-squared went up quite a bit pretty much a double so we're doing a better job of predicting voting in the 2008 election with this model now still uh the percentage increase with the odds ratios is not ideal for understanding what's going on with this relationship so what we can do is we can use predicted probabilities we can use predicted probabilities to answer questions like controlling for age what is the effect of one year of a one year increase in education on the probability of voting logistic regression alone will provide an inconvenient answer it depends in logistic regression the effect of each independent variable on the probability of the dependent variable will depend on the values of the other predictors in the model so for example an additional year of education is going to have a different impact on the probability of voting for someone who is 18 than it is for someone who's 45 and we'll see that later so we can hold specific independent variables constant by using the margins command so we run our model right here and then we use the margins command and this says this will give us predicted probabilities at education ranging 0 to 20 in increments of 1 so it will give us 0 years of education through 20 at increments of 1 and it will hold any other variable in the model to the mean so to hold age at the mean so here we go there's our model again same thing and it gives us gotta get past some of this it gives us problem predicted probabilities and so this is every year of Education and this is the probability of voting for every year of education and the age is held at the mean which in our in our sample in the GSS is almost 48 years old so we can see what an increase of 1 year of education has on the probability of voting for someone who is roughly 48 years old and I will point out you need to notice that the increases in the probability so you have a 7% chance of voting or 9% chance of voting or 11 and 14 and so on the probability of voting changes through each year of education the difference the difference between each year is not consistent we're so we're not looking at a linear relationship so for example somebody with uh here's a pretty good jump somebody was nine years of education is 40% likely to turn out to vote one year change you ten years of education you're 47% so that's a set that's a good seven point difference but down here like you look at so I'm I'm not 47 years old but I do have 20 years of education so you know if I go when I go from my 20 years of education to 21 I go from 93 percent I mean assuming I was 48 years old but I go from 93 percent likely to vote to 94 which isn't a very big change at all so education makes a smaller difference at the top and the bottom of the scale so let's see um now the next thing we're going to do is margins with the over option oh no wait we're going to take a look at this first so we can we can plot we can do a margin plot of these predicted probabilities using this command here and this is in your book so I will run this and hopefully if it works there we go all right so this shows the probabilities we just looked at for each year of education um holding the age at the mean as you can see we never get below zero so your probability of voting can't be negative and we never get to fully to one but we do have this S curve and that's what you end up getting when you look when you're looking at probabilities and logistic regression and you can see that they draw they put a dotted line here through 50 percent 0.5 so that so that you can see where we're on education does a person become more likely to vote than not so X naught that now we have the margins command with the over option the margins over option will run margins the margins command separately once for each value of a categorical variable specified in the option so we're going to do is we're going to generate a new variable because we want to look at two different groups of ages and the probability of voting as education increases so um yeah let's do that first so we're going to split up we're going to do our logistic regression again and then we're going to create a variable that has age for under 30 and age for over 60 to see what the probabilities are different for people with education with a certain level of education under 30 than they are over 60 and um so this says age greater than or equal to 260 and age that doesn't equal missing so we're going to cut out the missing data so here we go we created created our uh our new variable and now we're gonna use the margins command again and the over option to figure out how these eight how the probability of voting differs between these two groups okay so this is going to spit out a lot of stuff you have to hit more or more okay so now what you see here is similar to what we just had however you get two values for every eat for every year of education so you have you have one year of Education for people under 30 and one year of education for people over 60 and as you can see people over 60 are much more likely to vote with one year of Education than people under 30 with one year of Education as we scroll down you can see that we have the same kind of pattern here the different the we get the probability for people under 30 with the same amount of education as the as with the probability of people over 60 with the same amount of Education and you can see the differences between the two so if you wanted to look at the probability of somebody voting in the 2008 election with twenty years of education who's under thirty so somebody like myself I would be about about of point eight four but if I were over sixty with the same amount of education I would be at a point nine seven so um you know kind of common knowledge in politics that young people are less likely to vote than old people and even controlling for education we see that so uh the next thing we want to do is take a look at visually take a look at this so we can margins plot with the over option and this code is also in your book um careful copying it because you know how it is if you get one thing wrong it will yell at you all right sweet I made it disappear there it is okay so as you can see we get two lines in this plot and they both kind of have that s-curve again never going above one never going below zero and you can see that for our age group eighteen to thirty obviously it's eighteen to thirty because you can't vote until you're eighteen I don't think there's anybody sampled in the GSS that's under 18 so we have eighteen to thirty year olds and this is their probability of voting with education and then we have our group of 60 and older and this is their probability of voting with certain levels of education you can see that they're always at any given point more likely to vote than the younger group um in fact at about you can see here at about five years of education or so mm maybe six we hit we hit the fifty you become more likely to vote than not when you're over sixty and you have about mmm six years of education and eighteen to thirty you don't become more likely to vote than not until you have about looks like maybe thirteen twelve or thirteen years of education so graduating high school yeah so uh let's see here we doing next okay so the last part we look at is kind of a hybrid of the first two things we looked at so we have we have seen that we can use the app means command to hold our control variables at their means so that's when that's when we were holding age at like forty seven point eight eight eight something something something and then we allow the independent variable to vary that was our education we have also seen that we can use the over option to specify a range of interesting variables that's what we just did with looking at the difference in probability of voting predicted by taking to account education comparing eighteen to thirty year olds and sixty plus year olds now we're I'm going to show you pretty quickly that we can we can do both of these things together so sometimes you want to sometimes you want to hold options at a certain value so if I wanted to look at different probabilities for um yeah predicted probabilities for voting and I had a gender in my model I wouldn't really want the model to I wouldn't want stated to hold gender at it's um at its average hold it at its mean because that would be like a 0.5 and that doesn't mean anything your male or female same thing yeah you know like earlier uh uh uh married or whatever like you didn't you you don't want certain things to be held at the mean because you get you get something that's just in the real world doesn't make any sense you're not Oh point five of a man um and you're not Oh point five married so so we can hold certain we can hold certain things at the average and then we can hold other things at values that we want to hold the map because we find that interesting if I want to compare older women with younger women I can hold the gender value to a one and compare older and younger women and their probability of voting or whatever we're trying to model so um let's see we can go ahead and run this then now in this model we've decided to take into consideration not just education an age like we did do but also income and I think this is a six category four variable and that's why it's 0:06 um and then partisan partisanship which takes the value of one if the if the person if the respondent identifies as a partisan so a Democrat a strong Democrat or a strong Republican so they're highly partisan people um and we can see in our model here income is not statistically significant but partisanship so if you're more partisan you're more likely to vote not real surprising also if you're older you're more likely to vote and if you have more education you're more likely to vote these are all positive as we would expect and statistically significant um the the pseudo r-squared has now gone up to seventeen point one seven and the probability that we could predict the likelihood of voting randomly better than this model well we can't so this is this model is better than random chance um all right so now they wanted us to take a look at the some of the variables that we just brought in you can see here that we have our weight and then our means for these variables are men's our Macs and we're going to take we're going to do the margins command to compare these groups at certain values and what we're going to do is as you can see okay delimit make sure you put the semicolon at the beginning in the end margins if our age variable that we created is is not missing so the exclamation point is means is not doesn't mean you're excited um and then we're using the over option again with the age variable at education years equals and that's zero we want to go zero to 20 so the full range of our scale in one-year increments and we're going to hold income and partisanship at there means we can make these variables anything if we wanted to look at if we wanted to look at people with high incomes we could pick a high income category if we wanted to hold people to a different level of partisanship we could just put a different value in here it's whatever we want to hold it at um by the way as a note I don't know what the oh six means because this mean is obviously higher than six which means our variable is not a six-point scale like I had figured um so let's run this we'll take a look at that okay so it's gonna it's it's going to spit out a lot of this jargon where it's telling you what it's doing so we got to click more a little bit more there we go okay so now we're getting the probability of voting for let's see what would this be yeah our age and our education so education age just like we did before actually except now we're holding our income and our partisanship at a specific value and you get the same 0.05 point to 0 you can see that the change the changes are different on different points in the scale so um up here holding pull holding income and partisanship at there mean for a person under 30 with 20 years of education you have a value of 0.8 for your probability of voting 84% 0.8 for um you are 84 percent likely to vote the same person who's older with the same income so the same income and same partisanship with the same years of education but they're over 60 is more likely to vote with the value of 0.9 6 so the you can see that the changes are different in different parts of the scale and these margins commands make understanding what logistic regression spits out at you much better makes it much easier so instead of telling somebody telling somebody you know listing off a coefficient and saying well this is this has a point two three a positive 0.23 impact on the logged odds of voting you can actually tell people what what the probability that the person with certain characteristics will turn out to vote is so this is the final thing that the chapter wanted us to do and I hope this was helpful and this concludes the walkthrough of chapter 10 on logistic regression
Info
Channel: Dominic Wells
Views: 89,761
Rating: 4.8486485 out of 5
Keywords: Logistic Regression, Regression Analysis, Stata (Software)
Id: jiRUQT7imAE
Channel Id: undefined
Length: 39min 3sec (2343 seconds)
Published: Sun Nov 22 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.