Multiple Regression Explained with Excel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this tutorial on multiple regression multiple regression is an extension of simple regression and allows us to consider more variables to get a better prediction than we would using simple regression before viewing this tutorial you should be thoroughly familiar with the basic concepts of simple linear regression you should know what a slope is what a y-intercept is what a least-squares line is and how to calculate that line you should also know what R squared and R represent and how to test the slope for significance if any of these terms are unfamiliar to you please go back and review my tutorial on simple linear regression before proceeding any further multiple regression is an extension of simple regression in simple regression we develop a model to explain the relationship between Y and one independent variable X that simple regression model looks like this with beta naught representing the y intercept and then beta 1 the slope for X the independent variable plus Epsilon the random error term in multiple regression we develop a model to explain the relationship between y and multiple X's that is more than one independent variable the multiple regression model would look like this you can see that it is similar to the simple regression model except that it has more terms in the equation and those terms represent the parameters for the extra variables beta naught beta 1 beta 2 and so on until beta subscript P are the parameters for the P independent variables the epsilon represents the error term and accounts for the variability in Y that cannot be explained by the model since we don't know what that error is it will drop out when we calculate our predicted Y hat line from a sample you can see by looking at the multiple regression equation that we have multiple X's and each has its own slope associated with it there is only one beta not only one y-intercept for the multiple regression model that would be the value of when all the X variables are zero as we did in the past we will be using a sample to develop an estimated multiple regression equation using the least squares method since the population parameters of beta naught beta 1 beta 2 etc are not known we will need to use sample statistics to estimate these population parameters the sample statistics would use B instead of data so we would have B naught B 1 B 2 and so on when you see the Greek letter beta it is always referring to a true population parameter obviously if we knew the true population parameters we would not need to take a sample to estimate them but we don't know the true population parameters so we take a sample and get sample statistics these sample statistics use lowercase non Greek letters B naught B 1 and so on so our regression equation using sample statistics will look like this you may recognize this as our Y hat line from simple linear regression but it has the extra independent variables the extra X's and their associated slopes now we will use the least squares method to develop an estimated regression equation the goal of the least squares line is to find the line that fits our data the best the least squares criterion is shown here notice it is the same criterion that we used in simple regression since the goal is the same to find the best line that fits our data why I represents the observed value of y for the I thought servation and y hat represents the predicted value of y for the iPhone so we have two values the observed value from the sample and the predicted value from the Y hat line of regression the line that fits the data the best is the one that minimizes the squared differences of each observed Y I value from its predicted y hat value it is the distance from each dot on the scatter diagram to the line of regression both above and below that line since multiple regression involves the use of matrix algebra to compute the y-intercept and the various slopes of each X variable we will not be solving multiple regression by hand we will instead use a computer program to do the calculations for us there are many programs that can do that for us but I will demonstrate the multiple regression in this tutorial using Excel we will then focus on how to read and interpret the computer output in my previous tutorial on simple regression I use grades as the dependent variable and number of hours studied as the independent variable now let's extend that same example to include more independent variables I will still use grade on exam as my y dependent variable but let's see if I can incorporate more independent variables to get a better prediction model here is my data set you can see I have a column for grades on exam and our studied but I also have collected data from each student for overall GPA age gender number of credits earned number of hours working full-time or part-time student status let's say I believe that a good portion of a student's grade can be predicted by the number of hours study but I want to add more variables to my model I want to see if overall GPA and age are also related to grade on exam so I have my dependent variable Y as grades on exam and now I will define my X variables let's define x1 as number of hours studied x2 as overall grade point average and x3 as age let's now see how we can use Excel for multiple regression to use Excel for multiple regression and also simple regression we first select data on the data tab you can see it's circled here on the ribbon once you click on the data tab you will get this ribbon and on the right hand side you will see data analysis in the analysis group click on data analysis and you will see a data tools dialog box up here look for a regression you might have to scroll down for it and then click okay what will pop up after you click okay is this regression dialog box now we need to put in the ranges for our wide dependent variables and our X independent variables here are the variables we decided to use in our multiple regression model so we need to put the range in for these cells in the input range box here when I put the range in I included the labels in the first row you can see I have B 1 through B 51 for the grades and C 1 through a 51 for the X variables our studied overall grade point average and age there are 50 variables here so the 1 means that I've used row 1 for the labels so because when I put the range in I included the labels in the first row then I need to select labels here next you select confidence level and finally select the output range I like to select a new worksheet so I have the output on a fresh page and here is what the output would look like for our problem this output will help us to develop this regression line our Y hat line first we need B not the y intercept so looking at the bottom third of this output where it says intercept and coefficient the number is twenty nine point seven two two that's our y intercept that is the coefficient for our intercept our y intercept otherwise known as B not now let's look for b1 the slope of variable one or the slope for the number of hours studied and that would be here three point four two three next we need the slope for x2 overall GPA and that would be here eight point two three five and finally to complete our model we need the slope for the age variable and that would be here point three six zero seven now that we have all the coefficients for the slopes and the y-intercept we can get our line of regression and this is what it would look like y hat is equal to twenty nine point seven two two plus three point four to three times x1 plus eight point two three five times x2 plus 0.36 zero seven times x3 now let's take a look at how we would use this Y hat line here is the line again now suppose we have a twenty three-year-old with a GPA of two point four who studied for three hours what would his predicted grade on the exam be Y hat will be the predicted grade on the exam x1 is the number of hours studied so that would be 3 x2 is the overall grade point average so that would be two point five and x3 is age so that would be twenty three now we can plug in all the numbers into the Y hat formula to get Y hat the predicted value of y for all of these X's and we get this Y hat line with the slopes and the X variables filled in and that would give us these numbers and finally we will get a y hat value of sixty-eight point zero five one one now we need to talk a little more about these coefficients that we just got off the computer printout and what they mean here is the Y hat line again looking at this line we can see that B naught is twenty nine point seven seven two but what does that mean well B naught is the y-intercept so it is the value of y when all of the X's are zero in multiple regression this number doesn't always make sense as you can see here an age of zero is obviously not possible so it is really the point where if we extend the line to cross the y axis where it would hit the y axis the next value we have in the regression line is B one and that is three point four to three remember for simple regression the slope tells us the change in Y for every one unit increase in X so in multiple regression it tells us the increase in Y for every one unit increase in X one holding all the other X's constant so that they don't change for our example that would represent the increase in grade for every one increase in the number of hours studied without any changes to the GPA and the age b2 is eight point two three five and that tells us the expected increase in grade for every one unit increase in GPA when the other independent variables don't change and finally b3 is point three zero seven which represents the slope for the age and that would represent the expected increase in grade for every one year increase in age now let's take another look at the output and see what other numbers might be of importance to us here we have a row titled multiple R and we have a value of 0.9 six three nine this refers to the correlation coefficient the same as it did in simple regression its values range from minus 1 to plus 1 where -1 is perfect negative relationship and plus 1 would be a perfect positive relationship the next number is R squared and that is again the same thing as it was in simple regression it is the coefficient of determination this tells us the percent of variation in Y that is explained by X next is our adjusted r-squared value this value adjusts for the number of terms in the model and it is a more accurate number so we will use this number whenever we have more than one X variable and the last line here is our observations that is our sample size in this example we had data from 50 students so n our sample size is 50 next we can test for the individual significance of each of the beta parameters specifically we would like to test if any of the slopes is equal to 0 remember if the slope is equal to 0 then there is no relationship between x and y for our example we can take beta 1 beta 2 and beta 3 this shows beta I since this is a generic representation of the hypothesis test and our test statistic is the same as it was in simple regression but instead of a subscript 1 we will use subscript I so the test statistic B sub I over SBI you remember how complicated the calculations were for simple regression but you don't have to worry in multiple regression we will be using the computer printout to get our test statistic the rejection rule is the same as it always was we reject the null hypothesis if the p-value is less than or equal to alpha or using the critical value approach we would reject if the test statistic is in the rejection region either above the positive critical value or below the negative critical value we would get the critical value from the tea-table under alpha divided in half and degrees of freedom n minus P minus 1 so to get the degrees of freedom n minus P minus 1 n is the sample size P is the number of independent variables or the number of X terms and then minus 1 so for our example we have 50 observations minus 3 there are 3 X variables minus 1 so we have 50 minus 3 minus 1 which is 46 at this point we can also define our alpha value as 0.05 so if you were to use the critical value approach you would split alpha and a half since this is a two-tailed test so alpha would be point zero two five and then you would look the critical value up under forty six degrees of freedom and then decide does the test statistic fall in the tail areas which would be the rejection regions or between the two tails which is in the non rejection region now we are ready to do a hypothesis test for the first variable x1 number of hours studied the hypotheses would look like this H naught that beta 1 is equal to zero and H a that beta 1 is not equal to zero we want to find evidence for the alternative hypothesis that the slope is not equal to zero and that would give us evidence of a relationship between x and y between number of hours studied and grade on exam all right let's see where we would get our test statistic for b1 here on the Excel output we see the test statistic is here under the column labeled t stat so the test statistic for the number of hours studied is 9.47 seven look across the row where it says our studied and look down where it says tea staff the tea staff column is your test statistic column so we got nine point four seven seven as our test statistic for b1 now is that significant or not well we could look up the critical value under alpha divided in half point oh two five and degrees-of-freedom 46 or we can simply look at the p-value right here we can see that the p-value is a very very very small number so small that it is written in scientific notation since there's not enough room for all the zeros and therefore we would reject the null hypothesis and find evidence of a significant relationship between the number of hours studied and grade on exam what about testing beta 2 the overall grade point average is that related to grade on exam so we can set up our null and alternative hypotheses to test beta 2 just as we did before but this time it would be beta 2 instead of beta one in hypotheses since we are testing to see if the slope for variable 2 that is x2 is equal to or not equal to zero let's see where we would get our test statistic for beta 2 here on the Excel output we can see the test statistic is here so the test statistic for overall GPA is 5 point 6 4 to 9 follow across the row word says GPA or down the column that has the t-test statistic it says T stat over there now is that significant or not we see that the p-value is a very very very small number so small that it's written in scientific notation since there is not enough room for all the zeros and therefore we would reject the null and find evidence of a significant relationship between overall GPA and grade on exam and finally let's test the last slope to see if it is significant this is for the variable age is that related to grade on exam so we can set up or know an alternative hypotheses as we did before but this time we have beta 3 equal to 0 and beta 3 not equal to 0 since we are testing the if the slope for age is equal to zero or not where would we get the test statistic for beta 3 yes on the Excel output we look for the test statistic right over here and we see the test statistic is one point four seven four three now is that significant or not well apparently not we chose an alpha value of point zero five and we can see the p-value is greater than that it is point one four seven two and therefore we would say do not reject the null hypothesis and we find evidence that there is no significant relationship between age and grade on exam there is so much more to talk about in multiple regression but that will be it for this tutorial the basics have been covered here I hope you enjoyed this tutorial and as always I hope you learned something you
Info
Channel: Learn Something
Views: 103,405
Rating: 4.8769231 out of 5
Keywords: Multiple Regression using Excel Data Analysis Tool, Multiple Regression, Data Analysis Tool for Regression, t test for slopes, Adjusted r squared
Id: jGd2cj4K4Ww
Channel Id: undefined
Length: 18min 48sec (1128 seconds)
Published: Fri May 08 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.