Simple Linear Regression: Basic Concepts Continued Part II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome this tutorial is a continuation of my previous video on simple linear regression in that video I use data collected from ten students recording the number of our study there's X the independent variable and grade on exam as Y the dependent variable we plotted the data points on a scale diagram and found a positive linear relationship appears to exist we can see this graphically as x increases Y also increases which indicates a positive slope in the previous video we calculated the least squares line also known as the line of regression y hat is equal to fifty five point zero four eight plus four point seven four times X the slope was calculated using this formula and we got four point seven four for the slope remember X is the number of our study y is grade on exam the slope tells us the increase in Y for every one unit increase in X so for this example a slope of 4 point seven four indicates that for every one hour of study a student's grade should increase by four point seven four points next we calculated the y-intercept be not using this formula we got fifty five point zero four eight remember the y-intercept is a value of y when x is zero that means for this example when x the number of our study to zero grade on exam is predicted to be fifty five point zero four eight we also calculated the coefficient of determination R squared which is sfr divided by SST again if you are unfamiliar with these terms please review my previous introductory video on simple linear regression the link for it appears in the description box for this video here are the numbers we calculated for the sums of squares we did this by hand for demonstration purposes but we will also do this using Excel in this video which is of course much quicker and much more accurate using our hand calculations we got point nine five zero five four r-squared our squared can take on any value between zero and one it tells us the proportion or percentage of variation in Y that is explained by X in this example r-squared is point nine five zero five that means that ninety five point zero five percent of the variation in Y grades is explained by X number of hours studied next we calculated our the correlation coefficient this is simply the sign of the slope and then the square root of R squared we have an R value of positive point nine seven four nine for this example R can take on any value from negative one to positive one a negative one would indicate perfect negative correlation positive one would indicate perfect positive correlation and of course if R is zero that would indicate zero correlation or no relationship between x and y so this example we have an R value of positive point nine seven or nine this would indicate a very strong positive linear relationship between x and y since + or positive point nine seven four nine is very close to plus one and very far away from zero and finally we ended the previous video with hypothesis test for the slope we wanted to test to see if the slope was equal to zero or not if we rejected the null hypothesis which we did then we would show support for the alternative hypothesis that the slope is not equal to zero indicating that there is a relationship between x and y we know the sample slope be one is four point seven four for this example what we are testing here is whether there is evidence from the sample data to show that the true population slope beta 1 is not equal to zero our test statistic remember we always calculate a test statistic from the sample data the test hypotheses are statistics be 1 over s be won by hand this took a bit of calculating first we needed to calculate s b1 but to get s b1 we first needed to calculate s the standard error to put into the numerator for SP 1 X is the square root of SF e divided by n minus 2 we had already calculated SSE the sum of the squares for the error is seventy nine point one two one five since n minus two is 10 minus 2 or eight then the square root of seventy nine point one two one 5/8 and -2 is s three point one four four nine so now that we have s we can put that three point one four four nine into the numerator for SP one and we get point three eight two five for FB one again you can review all of these step by step in my first video on simple linear regression the link for that video is in the description box now we are ready to put FB one into the test statistic so we have B 1 over s be one four point seven four divided by point three eight two five and we get a test statistic of twelve point three nine two one now that we conducted in my hypothesis test on the slope we can continue here with a confidence interval estimate on that slope here you have the formula for a confidence interval on beta one the true slope the sample slope B 1 is four point seven four for this problem what we want to do is take that sample slope and add and subtract a margin of error so that we have an interval for the true population so the formula is b1 the sample slope plus n minus P of alpha divided in half times F b1 we already calculated s b1 and we that point three eight to five so we have most of what we need from our previous calculations 4.7 4 is the sample so press the minus key of alpha divided in half times s be one point three eight to five we are however missing our T value for a 99% confidence interval what would alpha be where alpha is one minus or level of confidence so then alpha would be point zero one remember we are constructing a confidence interval here and with an interval you always have some values above the upper tail and some values below in the lower tail so with confidence intervals you always split alpha and half alpha divided in half here is point zero zero five we are using a tea table so we need degrees of freedom degrees of freedom for simple regression is n minus two which is eight for our example since we have 10 observations now let's take a look at our key value in the tea table under eight degrees of freedom and alpha divided in half of point zero zero five and we get a tea value of three point three five five so now we can take that T value from the table three point three five five and put it into the formula for the confidence interval so we get four point seven four plus and minus three point three five five times SP one which was point three eight two five that gives us an interval of four point seven four plus and minus the margin of error of one point two eight three three so we get a lower interval value of three point four five six seven and an upper value of six point one two three three this is a 99% confidence interval on the slope that means we are 99% confident that the true population slope beta one is contained within this interval we only know the number of 4.7 for from the sample of 10 students but the population slope beta 1 is not necessarily exactly 4.7 4 it could be a little more or it could be a little less by adding and subtracting a margin of error around the sample statistic we are creating an interval within which we can be 99% confident that the true population parameter exists now let's conduct a hypothesis test on the correlation coefficient remember R is the correlation coefficient for the sample we have a sample of 10 student grades and we calculated our from that sample of 10 observations but if we are to have data from the entire population of students then the correlation coefficient would be denoted by the Greek letter Rho Rho Rho looks like a P but think of it as an R with a missing leg Rho is the symbol that we use when we are talking about the population parameter for the correlation coefficient so we are interested in testing whether or not Rho the population correlation coefficient is equal to 0 or not R for this example is positive point nine seven four nine we know that r can take on values from negative 1 to +1 and that 0 means no correlation our sample correlation coefficient R seems to indicate a strong positive correlation but this is from a sample of 10 student grades is this enough evidence for us to reject the null hypothesis and find support for the alternative hypothesis that Rho is not equal to 0 that would indicate a strong relationship or association between x and y number of hours studied and grade on exam the test statistic we use is a t-test with R in the numerator and the square root of 1 minus R squared divided by n minus 2 in the denominator in the previous video tutorial we conducted an hypothesis test from beta 1 the slope to see if it was equal to 0 or not here we are conducting a hypothesis test on row to see if it is equal to 0 or not these two tests are really one and the same we will find the same result in fact when we do the calculations here for the test statistic it will produce the same test visit the same result so let's go ahead and calculate the test statistic r is a positive point nine seven four nine so filling in the formula we get point nine seven four nine divided by the square root of one minus point nine five zero five divided by ten minus 2r squared is the coefficient of determination we had previously calculated our square to be point nine five zero five and we get twelve point three nine three seven we get a number that is very close to the test statistic that we calculated for the slope beta one the number should be exactly the same the only reason is different here is because I'm using a handheld calculator and therefore there's some negligible rounding errors not to worry Excel will give you an exact number the rounding errors here will not affect our conclusion back to what we were doing our hypothesis test for Rho the correlation coefficient our test statistic is twelve point three nine three seven so does it fall in the rejection region or the non rejection region we have previously looked up a critical value for alpha of 0.01 alpha divided and a half point zero zero five with eight degrees of freedom and remember we just looked it up it was three point three five five so three point three five five our critical value splits our rejection and non rejection region and now we can see that twelve point three nine three seven the test statistic Falls clearly in the upper tail rejection region our statistical conclusion is then reject the null hypothesis there is evidence that Rho is not equal to zero and that a significant correlation exists between grades and number of hours studied we can also use the p-value approach to come to the same conclusion for the p-value approach we use the test statistic twelve point three nine three seven and degrees of freedom eight and using the T table under eight degrees of freedom for the test statistic we find that it would be off the chart somewhere here this would give us a value for alpha that is off the chart but we can extrapolate and conclude that the value would be close to zero certainly less than point zero zero zero five let's use point zero zero zero five is the number even though we know the value would be much less than that since this is a two-tailed test we need to double the value that we looked up point zero zero zero five times two is point zero zero one as before the rejection rule is reject the null if the p-value is less than or equal to alpha if we use alpha point O one then our p-value point zero zero one is less than our alpha value of point zero one so we reject the null hypothesis this is the same conclusion as the critical value approach there is evidence that Rho is not equal to zero and that a significant correlation exists between grade and number of hours studied one more topic for this tutorial and then we will look at the Excel output for this set of data remember we calculated y half the predicted value of y for a given x value using the line of regression shown here we use an x value of three to predict y if a student said that the number of hours they studied was three then what would Y happy Y hat is the predicted grade for that individual response plugging X 3 into the model we get Y hat is equal to sixty nine point two six eight so if an individual student value for x is three our study then the predicted grade on the exam would be Y hat sixty nine point two six eight but remember Y hat a point estimate suppose we want a more realistic estimate we would take Y 1/2 plus a minus a margin of error a confidence interval estimate would be a more realistic way of expressing the individual students response the formula for calculating the confidence interval for an individual response Y for a given X is shown here y hat plus and minus T of alpha divided in half times F times the square root of 1 plus h survived we already have Y hat that is sixty nine point two six eight we already looked up the T value for 99% confidence interval when we calculated the confidence interval for the slope so let's use that again three point three five five also the standard error 3.14 for nine was previously calculated so all we need to find is H sub I here is the formula for H sub I let's have a closer look at it the first part what over N is easy n is ten for example so this is simply 1 over 10 next we have the numerator for the second part of the formula X sub I minus X bar squared that is different than the denominator which has a summation sign so the numerator part is for just one X of I in this case X sub I equal three however the denominator is the sum of all of the X of ice all ten of them subtracted from X bar squared and then added up if you look back to the previous video you will see we already calculated this when we were calculating the slope so the denominator will be the sum of the squared deviations of each exercise from the mean or 67 point 6 the numerator is the squared difference of just one X of I the one we are interested in X equal three and we subtract that from the mean which was four point eight again look back in your notes from the first video now we can calculate H sub I and we get 1 over 10 plus 3 minus 4 point 8 squared divided by 67 point 6 and that is point 1 4 7 9 now let's go back to the original formula for calculating a confidence interval for y want we have y 1/2 for X equal 3 and for a 99% confidence interval alpha is 0.01 alpha divided and 1/2 is point zero zero size with 8 degrees of freedom we look that up and we got a p-value of 3 point 3 5 5 h sub i we just calculated as 0.1 4 7 9 and s we have previously determined to be three point one four four nine so now we need to plug in everything into the formula y hat plus and minus three point three five five that's your p value times 3.14 for nine that is f times 1.07 one for 1.07 one four is not H it is the square root of 1 plus h sub I and that gives us 69 point two six eight plus and minus eleven point three zero four four let me go back for a second here this one point zero seven one four again is for that entire piece that is circled in red square root of 1 plus h 2 by H of I is point one four one seven so one plus that and then per square root of all and you will get one point O seven one four when you multiply all three terms out to get eleven point three zero four four so eleven point three zero four four is the margin of error and then we get an interval for an individual response of wine when X is equal to three of between fifty seven point nine six three six and eighty point five seven two four so we are able to say with 99% confident that is an individual student studies for three hours so X is equal to three then his predicted grade will be y half will be between fifty-seven point nine six three six and eighty point five seven two four which is a more realistic way of predicting the students grades and to say they'll just get a sixty nine point three six eight now let's see how we can get Excel to do most of these computations for us here is a worksheet with my ten beta values first click on the data tab and then choose data analysis sometimes data analysis is not on your Excel ribbon because it hasn't been added in yet by default it is not added in in order to add it in you have to go to the file menu click options then click on Add Ins and at the bottom where it says manage Excel add-ins click go choose analysis toolpak and click OK and then the data analysis option should appear once this is done and you have data analysis in your ribbon then choose the regression option in the dialog box and click OK a new dialog box comes up asking for the range of Y values and X values enter the range for the grades for the Y range and our studied for the X range check the labels box if you have labels in your first row as we do and then check confidence level if you want the output to have a confidence interval the default is 95% confidence interval you can see I change that to 99% for our example this is the resulting output the regression equation for our example is y hat is equal to 55 point zero three five five plus four point seven four two six zero for X you can see these numbers are slightly different than what we calculated by hand we get this Y hat line from the Excel output very easily 55 point zero three five five is the y-intercept B naught and it is given to us here under the column labeled coefficient and the row labeled intercept the value in the equation of four point seven four two six zero four is the slope B one and we can find that number under the row for the independent variable X it says they are our study and it is right there where the green arrow is pointing we can also find our squared here it actually says R squared remember R squared was point nine five zero five three seven this formula for R squared is SS r over SS T so R squared is 0.95 zero five three seven FS refers to the sum of the squared values and we can find all of our sums of squares under this column that is labeled SS SS R is in the row for regression SST is in the row for total so we get SSR over SST R squared point nine five zero five so ninety five point zero five percent of the variation in grades is explained by our studied remember R squared is a variation and why that is explained by another number we can get from the excel printout is s s is a number we use quite a bit in our various calculations F is a standard error and we find it right here three point one four four eight six three so to review V not the y-intercept is here and B 1 the slope is here an SD one the standard error for the slope can be found here point three eight two four nine seven the test statistics for the slope is be one over s be one and we can take B 1 and s be one right off of the Excel output for point seven four at divided by 0.3 eight to four and we get twelve point three nine nine zero five we can also look at the tea staff column label test statistic T cell column and we can see twelve point three nine nine zero five right there so we don't even have to do the calculations it is given to us right on the Excel output finally we don't have to look up the p-value even it is given to us right over here under the column labeled p-value we can see it is one point six seven e minus 0.06 and that means that the number is very close to zero you would move the decimal place to the left six spaces - point - oh six means left six spaces and that gives us a very very small number very close to zero indicating support for the alternative hypothesis that the slope is not equal to zero if the slope is not equal to zero then there is a significant relationship between x and y our studies and grade on exam the last thing we can look at on this output is a confidence interval estimate for the slope at 99% confidence we calculated a lower 99% interval of three point four five nine two and an upper value of six point zero two six and there it is that concludes our tutorial for part two simple linear regression thank you and I hope you enjoyed this video you
Info
Channel: Learn Something
Views: 21,949
Rating: 4.9288259 out of 5
Keywords: Simple Linear Regression, correlation, hypothesis testing, least squares line, confidence intervals
Id: NIewQv8uhL0
Channel Id: undefined
Length: 26min 16sec (1576 seconds)
Published: Tue May 02 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.