Video 5: Dummy Variables

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this video on linear regression in this video we will be talking about how do we use and how to interpret the coefficients of dummy variables we start this video by talking about the differences between continuous and categorical variables our linear regressions always have a dependent variable in a series of regressors or independent variables and so far we have always worked but they continuously dependent variable in this particular video we will be talking about the salaries of professors in the United States so salary could be our dependent variable which is continuously other potential continuous variables that we could use as regressors to explain variants and salaries are for example tenure how many years a professor has been an institution another continuous variable could be the number of courses taught by this professor or also in research institutions the number of publications this professor has may have an impact on his or her salary however there are also categorical variables which are not numbers and for instance we could test the gender is associated with salary which we actually hope is not associated with salary and that there are no differences in the salaries or incomes of women and men in the faculty we can also talk about a professor's qualification and rank but I'll leave the description of these for later another category could be a professor's department and a professor could earn differently if he or she is in the finance and the strategy marketing or IT departments the universities can distinguish themselves by how much they paid to their professors and also the city the cost of living of a city could influence the salary of a professor however note that all these are categories not numbers and there's a problem in how do we choose to represent these categories by using numbers we need numbers because linear regression models assume that the variables the XS are numbers this is how the model works continuous variables are already numbers but what numbers should we use to represent the categorical variables should we assign one two three four four four different cities or maybe 100 200 300 and 400 four to four different cities it's very hard to tell moreover if we assign numbers we might be assuming an ordering of the categories in the sense that for example beam and city a is more than being city B and more than in city C which makes little sense and implicitly by using these kind of numberings in our models we're given the coefficients of these categorical variables as slope that usually makes little or no sense whatsoever so how do we solve the representation of categorical variables in linear regression models we do this by using dummy variables a dummy variable is a binary variable which means it takes a value of 0 or 1 and no other values that will represent different observation belongs to a certain category or has a particular attribute they are also called indicator variables because they will indicate if a particular observation in this case a professor for example has a particular attribute or not and in binary notation also in computer science usually a 1 represents a yes and a 0 will represent a note so following this tradition we will use a 1 to represent if a certain observation belongs to a certain category the name is funny we call them dummies because they are not real they're just there to represent something else and once we convert the categorical variables into binary variables what we're going to have is a series of zeros and ones once again the data we'll be using for this example is professor salaries from the US and we have data from over 26,000 professors in the US and we know several attributes of them we know their salary in thousands of US dollars per year this is a continuous variable and is going to be our dependent variable there are models we will be investigating what professor attributes determine that professors salary and we're going to have two different attributes of them one is going to be their qualification which is going to be a categorical variable a professor can either be an academic or a professional an academic is the usual professor who does some research and works in a university but a professor could also be a professional that works in industry and simply takes the teaching position as a part-time job this is a CEO of a firm for example that is taking a sabbatical and comes to teach at a certain University we also have their rank and here we have several categories a professor can simply be an instructor or he can be a tenured track professor who starts off as an assistant professor then assigns to an associate professor and finally becomes a full professor or just professor and in general we would expect that the income of these professors will increase as the rank increases as well so just to better exemplify what we're looking at this is a sample of our data where we see that there are two categorical variables qualification and rank and they're going to have one of the values we just described and we have a continuous variable which is going to be the salary to further probe how this data looks and how the things behave before running any models we're going to look at some graphs generated in tableau this first graph shows a histogram of the salary for the entire population of over 26,000 professors note that even though it has so right tail driven by some outliers overall it is centered between 50,000 and $100,000 per year in fact the mean value is 100 $12,000 for a given year now on the left hand side I'm showing you the distribution of qualification and you can see that there are many more academics in the data set than professionals about 86 percent of the observations of professors are academics meanwhile only close to 14 percent are professionals and on the right hand side we show box plots for the salaries of these two qualifications you can see that most of the outliers in the higher end of salaries are academics and across the different person tells in particular the median academics tend to earn more from professionals so we could expect once we run our MA else that we find that academics indeed earn more than professionals and we want to quantify how much if we look at the distribution of ranks across the population we see that only a minority of the population of the faculty are instructors there are many more assistants and associates and the majority 33% of them are professors correspondingly as one would expect the box plots indicate that there's a tendency in increasing of the salaries as the ranks go higher in the sense that instructors earn the least assistant associate professors earn sooner wages it appears and the professors to full tenured professors are the ones who appear to be earning the most at least that is what we get from graphically comparing their medians if we combined both categorical variables and see how the salaries defer by rank and qualification we observe the following in this case the different columns represent the ranks and we use colors to represent the different qualifications whereby academics are shown in orange and professionals are shown in blue we also use the size of the circle to represent how many of our observations belong to each of these individual subcategories so they don't tell us anything about the salary just something about the mix of the population since our vertical axis is the average salary we can see that for every single of the ranks academics on average always earn more than professionals we see that the orange circles are always above the blue circles moreover as we go across the ranks we know that there's still an upward tendency where regardless of the professor is an academic or a professional as the rank goes higher so does the average salary and keep this graph in mind because we're going to be coming back to it when we conclude this video all right now that we know how the data looks like and more or less who should earn a higher salary than who other let's start running some models to test this we will first start testing the difference in salaries between academics and professionals and for this we're going to analyze the qualification category which has two potential values and what we're going to do is that we're going to create the dummy variable to indicate if a faculty belongs to one of these two categories in particular we're going to create a professional indicator there's going to be a binary variable that's going to have a value equal to one if that particular observation if that professor is a professional and it's going to be a zero indicating that it is not a professional which in this case means that professor is an academic note that we only need one variable one binary variable to represent two categories because the professor is only going to be on one of these two categories it's either going to be a one or a zero once we convert our categorical variable into a dummy variable in this case we took qualification and based on the values we constructed the dummy variable professional we see that whenever qualifications and academic we have a no or a zero for professional and conversely when qualification is professional we have a one for professional now we're going to regress salary on qualification and this is going to be our baseline model we have an intercept and we have a slope for professional now let's make a small pause here what does beta one exactly represent and how do we interpret it I'm going to give you a few moments to think about this additionally beta1 would represent a slope which represents how much will the dependent variable change for every unit change in professional in this case the unit goes from zero to one so beta1 will represent the change in salary one professional turns from a zero to a one which is exactly what we want to test what is the change in salary when someone goes from being a professional to being an academic or vice versa and let's do some simple algebra here what is the predicted salary for an academic under this model note that if a professor is an academic then the professional variable is going to be a zero thus the expected salary when professional is zero is going to be beta zero the intercept plus beta1 times zero which means we cancel out the beta one and we're only left with our estimate for beta zero the intercept so in this case the intercept is going to represent the expected salary for an academic when professional is zero conversely what happens if the professor is a professional in this case professional has a value of one case in which the expected salary is going to be beta 0 plus beta 1 times 1 or the sum of the two estimated coefficients almost always when we are using dummy variables what we're trying to do is test how a dependent variable changes depending on the different category attributes so in this particular case we already know that the salary of an academic and a salary for professional are given by those estimates and what we would be interested in testing is what is the difference between the two thus one category earned a higher salary than the other and note that if we subtract one from the other beta one is really representing that difference beta one is the added or subtracted salary that a professional earns relative to an academic so we can do formal tests based on the value of beta 1 the estimated value of beta 1 if beta 1 is greater than zero then we would be finding that a professional faculty earns more than an academic faculty if beta zero is close to zero or is insignificant it means that it doesn't matter if for faculty is a professional or an academic the salary is going to be the same which in this case would be represented by beta zero finally if they tawan happen to be negative that means that a professional is earning less than an academic which is what we expect this is the regression output from our model note that we have an intercept of 119 and the coefficient for a professional dummy variable is minus 49 also note something that is very important for forthcoming analysis the coefficient for professional is statistically significant which means that it is statistically different from zero if we computed the confidence interval based on the standard error of point seventy two we know that there is no way that that minus forty-nine would turn into zero moreover the p-value is pretty much a zero which means we're very certain that zero is not within the potential values of the professional coefficient so recapping that this was our model we found that following two estimates and let's start interpreting this what was the expected salary of an academic remember if a professor is an academic that means that the professional dummy is turned off it is a zero and we are only left with beta 0 implying that the expected salary of an academic is 119 we can also easily compute the expected salary for a professional case in which the professional dummy is turned on so we add up the estimates of beta0 and beta1 and we find that the expected salary for professional professor is 69 point 70 and we were also interested in testing if the two categories earn different salaries which was given by beta one and we find that beta one is negative and statistically significant statistically different from zero so we can conclude from this analysis that academics earn more than professionals as we had expected our understanding of how we can use a dummy variable to distinguish between two categories makes it a lot easier to test references across more than two categories and in this case we're going to start working with the rent categorical variable recall that rank had four potential categories instructor assistant professor associate professor or full professor however if we used a single dummy variable we can only distinguish between two of them it's only a one or a zero what we will do to solve this issue is we're going to use more than a single dummy variable and in this case we will be using three dummy variables to represent the four different categories of rank note that we're leaving one out and you'll find out do I soon for now it is important that you learn this rule whenever we want to represent K categories we will be using K minus 1 guns we always leave on category out so how should our model look like let's start by constructing the dummies and remember we need to leave one of the categories out it is very standard that we leave the least important or the lowest value category out so in this case we're going to be leaving out the instructor category and construct dummy variables for the other three categories namely we're going to create a dummy variable called assistant which is going to be equal to one if the rank is an assistant and is going to be zero otherwise we're going to have another dummy variable for associate which is going to be equal to one if the rank is associate and it's going to be zero otherwise and finally we create the dummy variable for the full professor which works in the same fashion if we put all these together we find this model right here note that we have an intercept and one coefficient for each of the three dummy variables we just created and a very important attribute of these dummy variables is going to make their interpretation easier note that whenever any of these dummy variables is a one all the others are going to be a zero so how can we use this to know what the expected salary of each rant is let's start thinking about the instructor professor an instructor is not an assistant not an associate nor a full professor so all three dummy variables in the model are going to be zeros thus we are only left with beta zero the intercept so the expected salary of an instructor is going to be whatever estimated coefficient we find for the constant term beta zero if we now have an assistant professor then only the assistant dummy variable will be turned on the associate and full dummy variables are going to be turned off which means we're only left with beta0 and beta1 and correspondingly the sum of these two coefficients is going to be the expected salary for an assistant professor following the same logic the expected salary for an associate professor is going to be beta 0 plus beta 2 and finally the expected salary for a full professor beta 0 plus beta 3 because once again only the full professor dummy variable is turned on in this case and to very quickly clarify this let's see how the dummy variables look in the data table and we see that whenever the rank is an assistant professor only the assistant dummy variable is turned on when the rank is an associate professor only the associate dummy variable is turned on when it is an instructor none of the three dummy variables are turned on and when it is a full professor or professor only the full dummy variable is turned on this is the regression output and if we write down our model we end up with an intercept of 65 point 25 plus 41 times assistant plus 45 times associate plus 72 times 4 and note something that will be very important as it was before that each of the different coefficients for each of the dummy variables is statistically significant so whatever differences in salaries they represent we are certain that they are statistically significant so once again if this is our estimated model and we would like to know for example what is the expected salary of an instructor that would imply that all the other three dummy variables for the ranks are zeros and we are only left with the intercept so the expected salary of an instructor is 65 point 25 now what would beta 1 or 40 1.71 represent which is the coefficient for the assistant this 41 represents the difference in salaries between the instructor and the assistant and remember that the coefficient for assistant that forty 1.71 was statistically significant and different from zero so we know that the difference is they're very similar the difference between an associate and instructor is going to be represented by the coefficient for the associate dummy variable and the difference between the full professor and the instructor is the coefficient for the full dummy variable overall note that we follow the expected trend and you can start seeing that the higher the rank the higher the wage but let me ask you a tricky question what is the difference between an assistant professor and an associate professor I hope that by now you see that the difference would be represented by the difference in the coefficients for assistant and associate where do you expect the differences of three point 61 but if we look at our model we don't have any information that would tell us if that difference is statistically significant or not in some statistical packages this is something we can test but in particular Gretel does not offer this functionality but we can redesign our model to be able to test it think for a second on what could we do to test the difference between an assistant and an associate professor note that since we excluded the instructor parameter all our coefficients represent the difference between an instructor salary and all the other ranks salaries so why don't we do the same yet at this time we exclude the assistant dummy variable and instead we create a dummy variable that represents the instructor professor and use that in our model this is a regression output of a model that excludes assistant and instead uses an instructor dummy variable so in this case if all three dummy variables are turned off the intercept represents the salary of an assistant professor which is going to be 106 meanwhile the coefficient for instructor is going to be a minus 40 1.71 and this is the difference how much less an instructor earns relative to an assistant now let's very quickly compare this to our prior model where our baseline was instructor and note that the difference between an instructor and an assistant was a plus 40 1.71 so really representing the exact same phenomena only that our baseline for comparison in the earlier model was an instructor now we're using an assistant as our baseline model and the coefficient for associate is 3 point 61 and noting the regression output that even though this coefficient is small it is statistically different from 0 the standard error of this coefficient is just a point 64 and even if we subtract 2 times the standard error coefficient we're not even close to 0 our low p-value also signals that this coefficient is statistically significant in conclusion we have evidence that an associate professor does earn more than assistant professor and specifically about $3,600 per year now let's finish this video by spicing things up a bit and let's include all of our dummy variables we're going to have a dummy variable to represent if a professor is a professional or an academic which was the first dummy variable professional which we use to represent the qualification categorical variable and we're going to have the assistant associate and full dummy variables to correspondingly represent the four ranks we just talked about I'm going to let the interpretation of these coefficients to you and try answering some of the following questions note that we have two qualifications and for potential ranks so there are eight possible combinations here the first thing I want you to think about is what is the expected salary for each of these eight potential combinations then I want you to think and test if the differences between the salaries of the different categories are statistically significant or not that is my homework to you thank you very much
Info
Channel: dataminingincae
Views: 177,375
Rating: 4.9420638 out of 5
Keywords: Dummy Variables, Statistics (Field Of Study)
Id: 9yTui_LoSOc
Channel Id: undefined
Length: 23min 44sec (1424 seconds)
Published: Fri Sep 12 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.