14. Multivariate Models and Confounding - Stata

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] remember that the sample studies and data management techniques used throughout this course are to be useful in helping to inform your own research apply the principles and techniques of data management testing and analysis you see in these examples to your own research project let's suppose that we've carried out the observational study of smoking cessation methods we're trying to determine which method works best drugs to alleviate nicotine addiction therapy combine drugs in therapy or simply quitting the explanatory variable is the method while the response variable is eventual success or failure in quitting our study shows that the percentage succeeding with the combination drug and therapy method was highest while the percentage succeeding with neither therapy nor drugs was lowest in this example there is clear evidence of an association between method used and success rate can we conclude that the combination drug therapy method causes success more than using neither therapy or drugs it is at precisely this point that we confront the underlying weakness of most observational studies some members of the sample have opted for certain values of the explanatory variable method of quitting while others have opted for other values it could be that those individuals may be different in additional ways that would also play a role in the response of interest for instance suppose older people are more likely to choose certain methods to quit and suppose older people in general tend to be more successful in quitting than younger people the data would make it appear that the method itself was responsible for success whereas in truth it may just be that being older is the reason for success we can express this scenario in terms of the key variables involved in addition to the explanatory variable or method in the response variable success or failure a third lurking variable age is tied in or confounded with the explanatory variables values and may itself cause the response to be success or failure we could control for the lurking variable age by studying older and younger adults separately then if both older and younger adults who chose one method have higher success rates than those opting for another method we would be closer to producing evidence of causation the diagram demonstrates how straightforward it is to control for the lurking variable age by modifying your study design notice that we did not claim that controlling for age would allow us to make a definite claim of causation only that we would be closer to establishing a causal connection this is due to the fact that other lurking variables may also be involved such as the level of the participant's desire to quit specifically those who have chosen to use the drug therapy method may already be the ones who are most determined to succeed while those who have chosen to quit without investing in drugs or therapy may from the outset be less committed to quitting [Music] to attempt to control for this lurking variable we could interview the individuals at the outset in order to rate their desire to quit on a scale from one to five with one being the weakest and five being the strongest then we could study the relationship between method and success separately for each of the five groups but desire to quit is very subjective thing difficult to assign a specific number to realistically we may be unable to effectively control for the lurking variable desire to quit who's to say the age and or desire to quit are the only lurking variables involved there may be other subtle differences among individuals who choose one of the four various methods to quit smoking and researchers may fail to conceive of these subtle differences as they attempt to control for possible lurking variables for example smokers who opt to quit using neither therapy nor drugs may tend to be in a lower income bracket than those who opt for drugs and or therapy because they can afford this method perhaps smokers in a lower income bracket also tend to be less successful in quitting because more of their family members and co-workers smoke thus socioeconomic status is yet another possible lurking variable in the relationship between cessation method and success rate [Music] it's because of the existence of a virtually unlimited number of potential lurking variables that we could never be 100 certain of a claim of causation based on an observational study observational studies cannot prove causality on the other hand observational studies are an extremely common tool used by researchers to attempt to draw conclusions about causal connections to do this great care must be taken to control for the most likely lurking variables only then can researchers assert that observational study may suggest a causal relationship so far we've discussed different ways in which data can be used to explore the relationship or association between two variables when we explore the relationship between two variables there is often a temptation to conclude from the observed association that changes in the explanatory variable cause changes in the response variable in other words you might be tempted to interpret the observed association as causation the purpose of this part of the course is to convince you that this kind of interpretation is often wrong the motto of this section is one of the most fundamental principles of this course association does not imply causation [Music] for example house fires and wildfires cause substantial damage when they occur what variables might affect the extent of this damage this scatter plot illustrates how the number of firefighters sent to fires on the x-axis is related to the amount of damage caused by fires on the y-axis in a certain city the scatter plot clearly displays a fairly strong or slightly curved positive relationship between the two variables would it then be reasonable to conclude that sending more firefighters to a fire causes more damage or that the city should send fewer firefighters to a fire in order to decrease the amount of damage done by the fire of course not so what's going on here there is a third variable in the background the seriousness of the fire that's responsible for the observed relationship more serious fires require more firefighters and also cause more damage this model will help you visualize this situation the seriousness of the fire is a confounding variable in statistics a confounding variable also known as confounding factor a lurking variable a confound or confounder is an extraneous variable that is associated positively or negatively with both the explanatory variable and response variable we need to control for these factors to avoid incorrectly believing that the response variable is associated with the explanatory variable confounding is a major threat to the validity of inferences made about the statistical associations in the case of a confounding variable the observed association with the response variable should be attributed to the confounder rather than the explanatory variable in science we test for confounders by including these third variables or fourth or fifth or sixth in our statistical models that may explain the association of interest in other words we want to demonstrate that our association of interest is significant even after controlling for potential confounders because adding potential confounding variables to our statistical model can help us to gain a deeper understanding of the relationship between variables or lead us to rethink and association it's important to learn about statistical tools that will allow us to examine multiple variables simultaneously that is look at more than two or three variables at the same time the general purpose of multivariate modeling techniques such as multiple regression and logistic regression is to learn more about the relationship between several explanatory variables and one response variable these regression procedures are very widely used in research in general they allow us to ask and hopefully answer the question what is the best predictor of and does variable a or variable b confound the relationship between my explanatory variable of interest and my response variable [Music] for example educational researchers might want to learn about the best predictors of success in high school sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt to their new country of residence biologists may want to find out which factors such as temperature or barometric pressure or humidity best predict caterpillar reproduction so how can multivariate models help us to evaluate the presence or absence of confounding or lurking variables since the difficulty arises because of the lurking variable's values being tied in with those of the explanatory variable one way to attempt to unravel the true nature of the relationship between explanatory and response variables is to separate out the effects of the lurking variable you may have already identified a significant relationship between your explanatory and response variables and now want to think about whether this is a real relationship or if instead the relationship is confounded by one or more lurking variables for example here's a graphical association between birth order and number of cases of down syndrome per 100 000 live births as you can see it looks like a linear association where the first born in a family has the lowest likelihood of having down syndrome with later birth order up to a fifth born child there's increased risk of being born with down syndrome this is a statistically significant association when analyzed via a chi-square test of independence with birth order as the categorical explanatory variable in the presence or absence of down syndrome as the two level categorical response variable another statistically significant relationship is the association between maternal age at a child's birth and the likelihood that the child will have down syndrome you can see here that babies of younger women up to about the age of 29 or 30 to 34 have really low rates of down syndrome among mothers aged 35 to 39 and older you see the rates are clearly higher remember in the case of a confounding variable the observed association with the response variable should be attributed to the confounder rather than the explanatory variable we test for confounders by including these third variables or fourth or fifth in our statistical models that may explain the association of interest in these examples it's possible that the association between a child's birth order and risk for down syndrome could be confounded by maternal age alternately the association between maternal age and down syndrome might be confounded by birth order or both birth order and maternal age might independently predict the likelihood of a diagnosis of down syndrome after controlling for each other here's a graph that answers this question by showing that maternal age confounds the relationship between birth rank and down syndrome and that it's really maternal age rather than birth rank that's associated with down syndrome here you see birth order along the horizontal axis the maternal age groups are along the z axis then on the y-axis we have cases of down syndrome per 100 000 live births if we look across birth order separately for each maternal age group we see that there really is no difference in rates of down syndrome by birth order in other words once we control for the age of the mother that is examine the rates of down syndrome across birth order but one maternal age group at a time there's no association between birth order and down syndrome if we look at rates of down syndrome across maternal age for each individual birth order we see an upward trend as maternal age increases so if you look across these colors this is a great graphical representation where we see that it isn't birth order that is associated with down syndrome it's maternal age in other words once we control for birth order there's still an association between maternal age and down syndrome birth order does not confound the relationship between maternal age and down syndrome the relationship holds even after controlling for birth order here is another question about confounding i start with a simple question about an association between an explanatory and response variable is the incidence of coronary heart disease greater among men who drink coffee than among men who don't drink coffee my response variable is coronary heart disease and my explanatory variable is a history of coffee drinking if i find a significant association in my data i will also want to evaluate whether there are other variables that might confound or explain the relationship it strikes me that some people who drink lots of coffee do so while also smoking cigarettes so i would like to evaluate if smoking is a confounder in the relationship between coffee drinking and coronary heart disease this relationship between coffee drinking and coronary heart disease is the one that we're testing but we also want to partial out or remove smoking from that association we want to see if that relationship between coffee drinking and coronary heart disease is still significant after we account for smoking here's a venn diagram that illustrates how our multivariate models will handle this question coronary heart disease is our response variable our explanatory variable is coffee drinking our basic question is do we know something about the presence or absence of coronary heart disease by knowing the level of coffee drinking in our sample but we may also know or believe that smoking is related to both coronary heart disease and coffee drinking so we want to include smoking in our model as a possible confounder in the relationship between coffee drinking and coronary heart disease other terms used to describe a potentially confounding variable in my statistical model include control variable covariate third variable or lurking variable when we're looking at the association between coffee drinking and coronary heart disease the overlap you see in the venn diagram is what we're testing we're saying is that overlap significant are they significantly associated when we add the potential confounder of smoking we're asking is coffee drinking and coronary heart disease still significantly associated after we partial out the overlap between coffee drinking and coronary heart disease that can be accounted for by smoking because smoking is associated with coronary heart disease and coffee there's a part of the association between coffee drinking and coronary heart disease that can be accounted for by smoking the highlighted area in the venn diagram what we want to do is mathematically partial that out when we run multivariate models we're partially now the portion of the association between the explanatory and response variables that can be accounted for by that overlap with the third variable for this course we will only be discussing two types of multivariate models the multiple regression where a response variable is quantitative and the logistic regression where our response variable is binary that is a two level categorical variable the question of when is a third or fourth or fifth variable in our multivariate model a confounder is strategically important if the variable is a confounder when we include it in the statistical model and the association of interest is no longer statistically significant then we can determine that our original variables had no real relationship testing for confounding variables with multivariate models is vital in the testing for true statistically significant associations or real relationships between variables in our research if we had run the model with maternal age and birth order predicting down syndrome birth order would have been significantly associated with down syndrome in that first model once we added maternal age to the model as a potential confounder the association between birth order and down syndrome would no longer be significant [Music] so we now know that we will use multiple regression to evaluate multiple explanatory variables and or potential confounders when predicting a quantitative response variable so how does a linear regression analysis work let's start with a simple example we imposed our causal model on observational data by selecting our explanatory and response variables denoted by x and y and placed on the x and y axis of a bivariate graph let's return to the graph that we made earlier using the gapminder data set here we visualize the association between internet use rate in a country and the percent of its population that lives in an urban setting as you'll recall we place the explanatory variable on the x-axis and the response variable on the y-axis so our research question is is the rate of urbanization associated with the rate of people who use the internet as you can see we also ran a pearson correlation and found a pretty strong positive and significant linear association between these two variables r equals 0.61 in order to test this model our first goal is to determine the equation of that best fit line the line that we created in our graph that shows the best linear fit between our two variables of interest as you may recall from high school algebra the equation of a line is usually defined as y equals m x plus b where x and y are the variables that are on those respective axes m is the slope of the line and b is the y-intercept the spot where the line crosses the y-axis in our model we know that internet use rate is our y or response variable and urban rate is our x or explanatory variable so we need to determine our slope and our intercept in order to define this best fitting line in statistics we often use terminology that's a little different from y equals mx plus b our symbols y and x remain the same but we call the slope of variable x the beta sub 1 value and the intercept of our model is called our beta sub 0 value these beta values are often called coefficients together beta sub 0 and beta sub 1 times x make up what is called the linear component of our model in addition we will always have some error in our data and so we must include that in our model as well so how do we find the equation of this best fit line in stata the command that we'll be using is called reg as you can see from the sample syntax after the reg you type the response variable and then the explanatory variable for this sample research question from the gapminder data set we'll type reg internet use rate which is our response variable urban rate which is our explanatory variable so let's run this program and look at the output first you can see the number of observations that were used in the model next let's move to the parameter estimates here we have our estimates also known as coefficients or beta weights for both our intercept and for the variable urban rate thus the beta sub 1 value here is 0.72 and the beta sub 0 value is negative 4.90 so we now know that our equation for the best fit line of this graph is internet use rate equal to minus 4.90 plus 0.72 times urban rate before we analyze this equation a little more in depth let's look at a few more components of our output for example we also have a column called p greater than t which gives us the p value for explanatory variables association with the response variable the red command also provides an r squared value a value that we talked about in the chapter on pearson correlation now we don't need to calculate the r2 value by hand as it will be given directly to us we now know that this model accounts for about 38 percent of the variability we see in our response variable internet use rate let's return to the equation for the line that we generated look at how our equation is written y is a function of the variable x and some constant thus as x changes y will change with it in building this model we're saying that we believe that x relates to y in some meaningful way what's exciting about this equation is that we can also use it to generate predicted values for y the symbol that we use for predicted values of y is y hat for example let's say we're told that a country has 80 urbanization can we predict their level of internet use yes we just plug the value 80 into our equation where we have our x value as you can see in a country with 80 urbanization we would expect 52.7 people out of every 100 to use the internet also note from our beta sub 1 that this value is by how much internet use would increase for every one unit increase in urban rate for example if we had a country with 81 urbanization we would know that we would expect their internet use rate to be 0.72 people higher that is almost one person than a country with 80 percent urbanization however note that this is only the expected internet use rate given what we know about urbanization it's the value that rests exactly on the best fit line unless our data were perfectly correlated we would anticipate that our expected value and our observed values would differ from one another to some extent [Music] for example canada has an urban rate of about 80 percent however its internet use rate is observed at 81.3 not 52.7 this is exactly why we include an error term in our model we are not perfect designers of the future what we can do with statistics however is identify trends in our data and use those trends to look at what we would expect our data to look like these trends are incredibly important this equation makes a lot of sense to us when we're working with a quantitative explanatory variable and quantitative response variable but what about a categorical explanatory variable and quantitative response variable it obviously wouldn't make very much sense for example for us to create a scatter plot and use gender as our predictor variable however a regression model will still be informative let's look at the output testing the linear relationship between depression and number of nicotine-dependent symptoms where major depression is a binary categorical explanatory variable and number of nicotine-dependent symptoms ranging from 0 to 7 is a quantitative response variable our research question is is having major depression associated with an increased number of nicotine dependent symptoms however first we're going to look at the data management necessary to create a new variable nd symptoms a quantitative response variable there are seven symptoms of nicotine dependence according to the nistart code book we need today to manage each of these nicotine dependence variables in order to be able to add them together to create a sum score that indicates the severity of nicotine dependence we will do this in accordance with a common list of criteria for nicotine dependence based on the dsm diagnostic and statistical manual the first symptom is tolerance which is measured by two variables in nesarc we create a new variable c to b crit 1. recode s3aq8b11 parents one equals one closed parents parents two equals zero closed parens parents nine equal dot close parents recode s3a q8b12 the same way parens one equal one closed parents parents two equals zero close parents parens nine equal dot closed parens comma gin parens c to b crit 1 closed parents this part of our code sets the answers for the two variables so that one equals yes 2 equals no and the answer 9 is set to missing the code then generates our new variable c2b crit1 and this part of our code defines it as one if participants answered yes to either one of the two tolerance variables and zero if they answered no to both the second symptom is withdrawal and that's derived from eight variables in nice arc we first generate eight variables in stata based on the nisarc withdrawal variables we use the same code format to set yes answers of 1 to 1 no answers of 2 to 0 set 9 or unknown answers to missing then we generate the withdrawal variable based on combining the 8 variables from newsark in stata egen stands for extended generation we use this function here to generate the withdraw count variable by summing variables values 1 through value 8 and we're setting as missing in this count those variables with missing answers in this portion of the code sets the pertinent nissar variable with 1 answers as yes two answers is no and nine answers is missing then we generate the c to b crit two variable according to the diagnostic criteria someone has withdrawal if they answered yes to at least four of the eight withdrawal variables or if they answered yes to using tobacco in the last 12 months in order to avoid withdrawal symptoms so we code in these criteria for our newly created variable c to b crit 2. if the sum of the withdrawal symptoms variables is 4 or more or the answer is yes to using tobacco in the last 12 months to avoid nicotine withdrawal symptoms then this new withdrawal variable is defined as 1. if the sum of the withdrawal symptom variable is less than 4 and the answer is no to using tobacco in the last 12 months to avoid nicotine withdrawal symptoms then the new withdrawal variable is defined as 0. for the variable asking if someone has used more tobacco than intended in the last 12 months c to b crit 3 is assigned 1 if they answered yes and 0 if they answered no nine answers was set to missing then we generate the c to b crit 3 variable for the variables asking if someone had attempted to cut down on tobacco use more than once but couldn't do it and asking if someone had wanted to stop or cut down on tobacco use in the last 12 months they are recoded as one if the answer was yes zero if the answer is no and nine answers are set to missing then we generate the c to b crit four variable and this part of our code defines it as one if participants answered yes to either one of the two desire attempt to cut down variables and zero if they answered no to both the next new variable is defined as 1 if the answer to the variable have you ever found yourself chain smoking in the last 12 months is yes it's defined as 0 if the answer is no and 9 answers are set to missing then we create this new variable c to b crypt 5. for the variables asking if someone had reduced important activities or activities of interest because tobacco use was not permitted at the activity they are coded as one for yes answers zero for no answers and nine or unknown answers were set to missing then our new variable c to b crit six is generated and this part of our code defines it as one if participants answered yes to either one of the two activity reduction variables and zero if they answered no to both for the variables asking if tobacco use was continued despite physical or psychological problems they are coded as one for yes answers zero for no answers and nine or unknown answers are set to missing our final new variable c to b crit seven is then generated data management defines c2b crit 7 as one if the answer to continued use of tobacco despite knowledge of physical or psychological problems in the last 12 months is yes this variable is defined zero if the answer to each variable was no now that we're finished data managing each of the seven symptoms of nicotine dependence we can create our variable nd symptoms which will add the totals and count how many symptoms each participant reported each in nd symptoms equal row total parentheses c to b crit 1 c to b crit 2 c to b crit 3 c to b crit four c to b crit five c crit six c to b crit seven close parens comma m the option m for missing tells data to assign a missing value to the new variable if all the source variables are missing if at least one source variable is not missing the new variable will represent the sum of the non-missing variables that is the data management required to create our new nd symptoms variable in order to answer the following research question is having major depression associated with an increased number of nicotine-dependent symptoms now we can work on the code to test the linear relationship between depression the variable major depth life and number of nicotine dependent symptoms our new variable nd symptoms reg indie symptoms major depth life in this code the response variable comes first then the explanatory variable we see the same output format as with the gapminder regression example we see the number of observations the name of our response variable the coefficients or parameter estimates and our squared value a value that we talked about in the chapter on pearson correlation and p-values thus we know that our equation is indie symptoms equal 1.97 plus 1.23 times major depth life let's consider what this equation actually means since it's not a best fit line of the scatter plot we know that the variable major depth life is our depression variable and it takes on the value 0 if the individual does not have major depression and the value 1 if the individual does have major depression thus we can plug in the values 0 and 1 into our major depth life variable to get the expected number of nicotine dependent symptoms for each group as we can see we would expect daily smokers without depression to have 1.97 nicotine dependent symptoms in daily smokers with depression to have 3.2 nicotine dependent symptoms remember that we previously subset our data to daily smokers aged 18 to 25. notice that this is also the mean number of nicotine dependent symptoms for each group which we can see by running summary statistics we can also generate a bar chart of the means graph bar parens means closed parens indie symptoms comma over paren's major depth life closed parens so although we may not be working with the best fit line we are still generating important descriptive information out of this equation this does not mean that everyone in our sample with depression has exactly 3.2 symptoms obviously no one can have 0.2 symptoms our low r squared value 0.086 tells us that we're only capturing a small amount of variability about 9 percent in the number of nicotine dependent symptoms among daily smokers but nonetheless this is the value that we would expect given our data also note that the categorical variable is a binary categorical variable if your categorical variable has more than two levels you will need to create dummy variables for your analysis we'll go over this process in supplementary material [Music] there are a lot of factors that contribute to internet use rate and nicotine dependence the response variables in each of my examples if we had more information and if we included those other factors in our model it is quite possible that our expected values would be even closer to our observed values we could include several explanatory and or predictor variables into our model in order to evaluate both the independent contribution of multiple explanatory variables in predicting our response variable and also in order to evaluate whether specific variables confound the relationship between our explanatory variable of interest and our response variable while we now have evidence that depression is significantly associated with the number of nicotine dependent symptoms endorsed by young daily adult smokers are sample another likely predictor of nicotine dependent symptoms is of course the number of cigarettes a person smokes each day what if number of cigarettes is associated with both our explanatory variable major depression and response variable nicotine dependence symptoms what if it is really smoking rather than major depression that is associated with number of nicotine-dependent symptoms to evaluate whether this is true we add number of cigarettes smoked per day to our model to create this variable which we'll call number six smoked we basically rename the nisarc variable s3a q3c1 which is the usual number of cigarettes smoked we do this with the following code after we've created and data managed the new variable number sig smoked we add it to our model here's the output we examine the p-values and parameter estimates or coefficients for each predictor variable that is our explanatory variable depression and our potential confounder number of cigarettes smoked as you can see both p-values are less than 0.05 and both of the parameter estimates or coefficients are positive thus we can conclude that both major depression and number of cigarettes smoked are significantly associated with number of nicotine dependent symptoms after partiallying out the part of the association that can be accounted for by the other in other words depression is positively associated with number of nicotine dependent symptoms after controlling for number of cigarettes smoked and number of cigarettes smoked is positively associated with number of nicotine dependent symptoms after controlling for the presence or absence of depression note that if a parameter estimate is negative and the p-value is significant it would mean that there was a negative relationship between that variable and the response variable suppose we started with a different explanatory variable dysthymia is a pervasive low-level depression that lasts a long time often a few years suppose we wanted to test the linear relationship between dysthymia a binary categorical explanatory variable and number of nicotine-dependent symptoms a quantitative response variable the code for this would be reg indie symptoms dis life the code for the multiple regression in this example is reg indie symptoms dis life major depth life number six smoked age sex you can see from the significant p-value and positive parameter estimate that dysthymia is positively associated with number of nicotine-dependent symptoms that is the presence of dysthymia is associated with a larger number of nicotine-dependent symptoms and the absence of dysthymia is associated with a smaller number of nicotine-dependent symptoms while dysthymia is long-lasting low-level depression major depression is a disorder characterized by a discrete episode of severe depression so what happens when we control for major depression in this model as you can see dysthymia is no longer significantly associated with number of nicotine-dependent symptoms after controlling for major depression here we have an example of confounding we would say that major depression confounds the relationship between dysthymia and number of nicotine-dependent symptoms because the p-value for dysthymia is no longer significant when major depression is included in the model as in the previous example using multiple regression we can continue to add variables to this model in order to evaluate multiple predictors of our quantitative response variable number of nicotine-dependent symptoms here we can see that when evaluating the independent association among several predictor variables and number of nicotine-dependent symptoms major depression and number of cigarettes smoked are positively and significantly associated with number of nicotine dependent symptoms while dysthymia age and gender are not the code for the multiple regression in this example is reg indie symptoms dis life major depth life number six smoked age sex multiple regression is the appropriate statistical tool when your response variable is quantitative if a response variable is categorical with two levels we need to use another multivariate tool logistic regression the stated command we will use is logit l-o-g-i-t the logic command provides the coefficients p-values and confidence interval right in its output it's used in the following format logit binary response variable explanatory variable 1 explanatory variable 2 explanatory variable 3 and so on for example let's say that we're interested in determining whether nicotine dependence is associated with social phobia among our sample population of young adult daily smokers that is whether or not those with or without social phobia are more or less likely to meet criteria for nicotine dependence we're going to create a new response variable for this example based on the nisarc variable tab 12 mdx gin nicotine dep equal tab 12 mdx thus our newly named response variable nicotine dep is binary yes or no to nicotine dependence and so we should use a logistic regression note however that because our variable is coded zero and one where zero is the absence of nicotine dependence and one is the presence of nicotine dependence the model predicts the presence of nicotine dependence nisarg also has a variable that we'll use as an explanatory variable called soch pd life that indicates the presence or absence of social phobia which is an anxiety disorder marked by a strong fear of being judged by others and being embarrassed thus our code would be logit nicotine dep social pd life let's take a look at the output here you can see the number of observations and here's the name of our response variable here's the coefficient p-value and confidence interval included in the output of the logic command our regression here is significant with a p-value of 0.000 we could generate the linear equation nicotine dep is a function of 0.053 plus 1.03 times social pd life but let's really think about this equation some more in a regression model our response variable was quantitative and so it could theoretically take on any value in a logistic regression a response variable only takes on the values 0 and 1. therefore if i tried to use this equation as a best fit line i would run into some problems instead of talking in decimals it may be more helpful for us to talk about how the probability of being nicotine dependent changes based on the presence or absence of social phobia for example are those with social phobia more or less likely to be nicotine dependent than those without social phobia instead of true expected values we want probabilities described visually will no longer find the best fit line shown in red very helpful to us as our outcome variable cannot take on any value instead we're saying that there is somewhere along our x-axis where our outcome variable moves from being more likely to be a zero to be more likely to be a one our goal will be to quantify the probability of getting a 1 versus a 0 for a given value on our x-axis in order to better answer our research question we will choose odds ratios as opposed to coefficients the odds ratio is the probability of an event occurring in one group compared to the probability of an event occurring in another group odds ratios are always given in the form of odds and are not linear odds ratios are often a confusing topic for students when they're first introduced to it so it will be important to go through it conceptually and better understand exactly what an odds ratio is and what it means an odds ratio can range from zero to positive infinity and is centered around the value one if we ran our model and got an odds ratio of one it would mean that there's an equal probability of nicotine dependence among those with and without social phobia those with social phobia are equally as likely to be nicotine dependent as those without it's also likely then that our model would be statistically non-significant if an odds ratio is greater than one it means that the probability of becoming nicotine dependent increases among those with social phobia compared to those without in contrast if the odds ratio is below 1 it means that the probability of becoming nicotine dependent is lower among those with social phobia than among those without so how do we calculate the odds ratio it is possible to do this by hand the odds ratio is the natural exponentiation of our parameter estimate thus all we would need to do is calculate the power of our parameter estimate however we could also let stata do this for us we're going to use the logistic command rather than the logit command this data logic command provides coefficients in its output while the logistic command provides odds ratios in its output the logistic command is formatted the same way as the logit command logistic binary response variable explanatory variable 1 explanatory variable 2 explanatory variable 3 and so on here's our code logistic nicotine dep social pd life and here's the output because both are explanatory and response variables in this model are binary coded 0 and 1 we can interpret this odds ratio in the following way within our sample those young adult daily smokers with social phobia are 2.8 times more likely to have nicotine dependence than young adult smokers without social phobia we also get a confidence interval for our odds ratio remember that our data set is just a sample of a population we do not have every young adult daily smoker in the u.s this confidence interval tells us that we can be 95 confident that if we select another sample from the population the odds ratio for that new sample will be somewhere between these two numbers 95 times out of a hundred so for example the odds ratio for social phobia is 2.8 if we were to draw additional samples of young adult daily smokers in the us 95 times out of 100 the odds ratio would fall somewhere between 1.71 and 4.59 it's important to keep in mind that the odds ratio is simply a statistic calculated for the sample so looking at the confidence interval we can get a better picture of how much this value would change for a different sample drawn from the population based on our model those with social phobia are anywhere from 1.71 to 4.59 times more likely to have nicotine dependence than those without social phobia the odds ratio is a sample statistic and the confidence intervals are an estimate of the population parameter but what happens when we control for major depression here's the code we use logistic nicotine dep social pd life major depth life as you can see both social phobia and major depression are independently associated with the likelihood of having nicotine dependence given that both social phobia and major depression are positively associated with the likelihood of being nicotine dependent and our predictor or explanatory variables are both binary we can interpret the odds ratio in the following way young adult daily smokers the sample population with social phobia are 1.9 times more likely to have nicotine dependence than young adult daily smokers without social phobia after controlling for major depression also daily smokers with major depression are 2.9 times more likely to have nicotine dependence than daily smokers without major depression after controlling for the presence of social phobia importantly because the confidence intervals on our odds ratios overlap we cannot say that major depression is more strongly associated with nicotine dependence than is social phobia for the population of young adult daily smokers we can say that those with social phobia are anywhere from 1.1 to 3.1 times more likely to have nicotine dependence than those without social phobia and those with major depression are between 2.3 to 3.7 times more likely to have nicotine dependence than those without major depression both of these estimates are calculated after accounting for the alternate disorder as with multiple regression when using logistic regression we can continue to add variables to our model in order to evaluate multiple predictors of our binary categorical response variable presence or absence of nicotine-dependent symptoms another example of confounding occurs when a logistic regression model is run to test the association between panic disorder as the explanatory variable and nicotine dependence the response variable panic disorder is an anxiety disorder characterized by recurring panic attacks first we're going to make sure we're working with our subset of young adult daily smokers keep if check 321 equal equal one and age is less than or equal to 25 and s3a q3 b1 equal equal 1. next we're going to do some data manipulation in order to generate a panic variable these four variables describe the presence or absence of panic disorder in a couple of forms so we manage this new panic variable by assigning 1 for yes answers and 0 for no answers there are no missings in these variables if the answers to each of these variables is zero then our panic variable is set to zero or no if the answers to any of these variables is one then our panic variable is set to one or yes logistic nicotine dep panic here we see a significant positive association and note that young adult daily smokers with panic disorder in our sample are 2.8 times more likely to have nicotine dependence than young adult daily smokers without panic disorder however when we add major depression to the model panic disorder is no longer significantly associated with nicotine dependence here we have an example of confounding we would say that major depression confounds the relationship between panic disorder and nicotine dependence because the p-value for panic disorder is no longer significant when major depression is included in the model further because panic disorder is no longer associated with nicotine dependence we would not interpret the corresponding odds ratio but would interpret the significant odds ratio between major depression and nicotine dependence that is that young adult smokers with major depression are 3.7 times more likely to have nicotine dependence than young adult smokers without major depression after controlling for panic disorder by now you should be feeling a little more comfortable with the idea of generating a regression model when your outcome variable is binary remember to always code your outcome variable so zero means no outcome and a one means that an outcome occurred this is true whether your outcome is positive such as graduating from college or negative such as developing nicotine [Music] dependence [Music] you
Info
Channel: Lisa Dierker
Views: 1,333
Rating: 5 out of 5
Keywords:
Id: 7x-bWrC0Q_c
Channel Id: undefined
Length: 58min 1sec (3481 seconds)
Published: Tue Feb 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.