Statistics 101: Multiple Linear Regression, Data Preparation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- [Brandon] So in this video, we're going to pick up where we left off at the end of Part 1. Remember, in Part 1, we went over the basics of multiple regression. Now if you have not watched that one, I would highly recommend going back and watching it before proceeding with this one. So at the end of Part 1, we talked about all of the prep work you have to do in multiple regression before actually running the numbers and that's what this video is going to be about. So let's go ahead and reset the problem cause it is a little bit different, I did tweak it a little bit from Part 1. So let's assume that you are a small business owner for Regional Delivery Service Incorporated, or RDS, who offers same day delivery for letters, packages, and other small cargo. You are able to use Google Maps to group individual deliveries into one trip to reduce time and fuel costs. Therefore, some trips will have more than one delivery. Now this is the same thing that UPS, FedEx, the Postal Service does, or any other delivery service. Now as the owner, you would like to be able to estimate how long a delivery will take based on three factors that you have deemed important. One, the total distance of the trip in miles, makes sense. Two, the number of deliveries that must be made during the trip. So we sort of assume that the more deliveries that must be made, the longer the delivery will take. And three, the daily price of gas in U.S. dollars. So, maybe delivery drivers drive slower to use less fuel if gas or petrol is more expensive. So those are the three variables we're gonna look at when estimating the delivery time. So as we discussed in Part 1, conducting multiple regression analysis requires a fair amount of pre-work before actually running the regression. So here are the steps. One, generate a list of potential variables, the independent variables and the dependent variable. Now we've already done that as far as me setting up the problem for you but if you are doing your own problem, you would really have to think ahead of time what variables make sense, what potential independent variables explain the dependent variable and so forth. And of course, two, collect data on the variables. Three, and this is most important for this video, we're gonna check the relationships between each independent variable and the dependent variable using scatterplots and correlations. Four, also important for this video, we're gonna check the relationships among the independent variables again using scatterplots and correlations. So we're really gonna focus on steps three and four in this video. And five, which is technically optional, but I would do anyway, and that is conduct simple linear regressions for each independent variable dependent variable pair. And we'll talk about that more as we go. Six, use the non-redundant independent variables, and we'll talk about what that means, in the analysis to find the best fitting model. And then seven, use the best fitting model we come up with to make predictions about the dependent variable. In this case, the delivery time. So to conduct your analysis, you take a random sample of ten past trips and record four pieces of information for each trip. One, the total miles traveled, that's our first variable. Two, the number of deliveries, that's our second variable. Three, the daily gas price, that's our third independent variable. And four, the total travel time in hours, that's our dependent variable. So you do that and you come up with something like this. So the first column is miles traveled, that's our first independent variable, x1. Then we have number of deliveries, that's x2. Gas price, I shoulda done a bit different because this is added to what we did in Part 1, so gas price is x3, our independent variable. And then travel time in hours is our dependent variable there in the right hand column. So, as example, for the first trip, we traveled 89 miles, there were four deliveries during that trip, and on that day, the average price of gasoline was $3.84 and then the travel time took seven hours. So that's how we would read each entry or each record, if you want, of the data. So we wanna sketch out our relationships. Now remember that multiple regression is a many-to-one relationship. So in this case, we have our dependent variable, travel time, which we designate as y, and then we have our three independent variables, miles traveled, which is x1; number of deliveries, which is x2; and the gas price, which is x3. So from each independent variable to the dependent variable, we have a relationship. So it's a total of three relationships. But, we're not done there. We also have relationships among the independent variables themselves. So there are three relationships there. You can see they're in the dotted line. So, altogether, we have six relationships we have to analyze during our prep work. So as far as relationships of the independent variable to the dependent variable, we have one, two, three. Three relationships to analyze and we'll go ahead and do that next. So in this section, we're gonna look at the independent variable to dependent variable scatterplots. And what we're doing here is checking the relevancy of each independent variable and as we go you'll see what we mean by that. So here is our first scatterplot. So you can see we have miles traveled on the x-axis, that's our first independent variable, x1. And then on the y-axis on the left, we have time traveled, that's our dependent variable. Now what we're looking for here is a relatively strong linear relationship. Now if you look at the dots in the scatterplot, you can see that it's a fairly strong linear relationship. So we wanna make sure we note that and realize that our first independent variable and the dependent variable do have a strong linear relationship. Now we'll do the same for the second independent variable. So we have number of deliveries on the bottom and our dependent variable, y, travel time along the y-axis and again it appears that we have a very strong linear relationship between our second independent variable and our dependent variable. And then final we have our third independent variable, gas price and we're looking for a linear relationship to the dependent variable over on the left hand side, on the y-axis. But as you can see, there isn't one. The scatterplot is not in any discernible pattern, the data points are all over the place, they don't form a line of any sort. So we can just say, based on visual examination, that gas price, our third independent variable, really does not have any strong linear relationship or linear relationship at all, to our dependent variable, time traveled. So here is a summary of our scatterplots. So of course, in the first case, our first independent variable, miles traveled, that had a strong linear relationship with travel time. Our second independent variable, number of deliveries, also had a strong linear relationship with our dependent variable, travel time. But gas price did not, so in that case, we're gonna put a little x there, just noting to ourselves that the gas price does not have a very strong relationship to our dependent variable. So, here's our scatterplot summary. For dependent variable and independent variables, the travel time, the dependent variable, appears highly correlated with miles traveled. That's our first independent variable. Travel time also appears highly correlated with our second independent variable, number of deliveries. However, travel time does not appear highly correlated with gas price. Remember in that case, the data points were sort of in a random blob all over the scatterplot so no linear relationship there. Now since gas price, our third independent variable, does not appear correlated with the dependent variable, we would not use that variable in the multiple regression. So as you can see here, sort of a precursor, or a requirement for even including an independent variable in the multiple regression, is that before we even begin, it has to have some sort of relatively strong linear relationship. Otherwise, it does not make any sense to include it because it's not gonna make any contribution to the prediction. For now, we will keep gas price in and then take it out later, just for learning purposes. So if we were actually doing this regression, we would just go ahead and take out gas price and continue forward. But for learning purposes, I'm going to go ahead and keep it in so you can see how it affects the regression as we go, just keep that in mind. Sketching out some more relationships, remember we have three more we have to consider. And that is the three relationships between the independent variables themselves. And that's three relationships. So in the independent variable to independent variable scatterplots, what we're looking for is, checking for multicollinearity. And remember, that is correlation among the independent variables themselves, which as we talked about in Part 1, can cause some serious problems with your regression. So again, this is a scatterplot of independent variable compared with independent variable. So we have miles traveled, which is our first independent variable on the x-axis. And then we have number of deliveries, which is our second independent variable on the y-axis. Now as you can see, it's very apparent that there is a very strong linear relationship between these two independent variables. So we definitely wanna note that as we go forward. So our next scatterplot, we're looking at our first independent variable, miles traveled, and our third independent variable, gas price, which is x3. And as you can see, there really is no discernible linear pattern or linear relationship between these two independent variables. And finally, we have number of deliveries, which is our second independent variable and we're looking at that scatterplot with gas price on the y-axis and again, we can see that there is no discernible linear pattern, or linear relationship, between the second and third independent variables. So here's our independent variables scatterplot summary. And again, we're checking for multicollinearity. Now the first graph shows us that we have a potential problem. The first two independent variables, miles traveled and number of deliveries, seem to be highly correlated, have a highly linear relationship, and that is the pure definition of multicollinearity. And we know that including both of those in the regression can cause some serious problems because the regression, or the computer when it does it, is not really sure what coefficients to assign to those two variables if they are so similar. In the example I used in Part 1 was that if I'm cooking dinner and I add table salt to my dinner and I add sea salt to my dinner, when I eat it, all I can tell is that it's salty. I can't tell if it's the sea salt or the table salt because they're so similar in taste. And this is what happens when two independent variables are correlated in multiple regression. Now x1 and x3 are not correlated so that's okay. And then x2 and x3 are not correlated, so again, that's okay. So in terms of multicollinearity, we might have a problem, or we're probably going to have a problem with the multicollinearity between x1 and x2. So, the summary of the scatterplots for the independent variables. So number of deliveries appears highly correlated with miles traveled and this is, by definition, multicollinearity. However, miles traveled does not appear highly correlated with gas price and gas price does not appear correlated with number of deliveries. Now since number of deliveries is highly correlated with miles traveled, we would not use both in the multiple regression. They are redundant. Because they appear, at least visually, to be so highly correlated, have a very strong linear relationship, they are called redundant and we would only use one of them in the actual multiple regression. Now again, for now, we're gonna go ahead and keep both in there and then take out one later just for learning purposes. So, again, if we were doing the actual multiple regression, for an assignment or for an analysis with our place of business, we would not include both of those in there, we would take one of them out because they are so highly correlated. So up to this point, we've just used some scatterplots, we've done some visual examination with some lines that I've just sort of eyeballed. So what we want to do is actually go ahead and run correlation analysis among the variables. So went ahead and went into Minitab actually cause it gives back more information, now you could do this in Excel, but Excel does not give the p-values for the correlations. So went ahead and used Minitab to go ahead and do this analysis. So the first thing I wanna draw your attention to is the bottom row. So we have travel time, y. Now remember, that's our dependent variable. And in the columns we have our three independent variables. So if you look at the intersection of travel time, y, and miles traveled, we can see that those two have a correlation of .928. That is very, very strong in terms of correlation. Now the number below that, in the italic, the .000, that's the actual p-value for that correlation. So that means it's actually less than .001. Now remember, our threshold is .05. So if it's below .05, we would say that is statistically significant and of course in this case, it is. Now if we move over, we have the correlation between our dependent variable travel time, y, and number of deliveries. And again, that correlation is very strong at .916, with a p-value of .000 or less than .001. And then finally in the last column, we have the intersection of travel time and gas price. And that correlation is only .267. And it's p-value is .455, which is way above .05. So that is not significant and that just serves as confirmation of our scatterplots. So scatterplots are simply a visual examination, a quick way to look at variables in terms of pairs. We go ahead and actually run the correlation to get an objective measure of the relationship and we can see across the bottom that they do line up or they do confirm what we saw in our scatterplots. So look at the top two rows now. At the first intersection, we have miles traveled, x1, in the column. We have number of deliveries, x2, in the row. That correlation is extremely strong. So .956 and again, a p-value of .000 or less than .001. So that is a very strong red flag for multicollinearity. We have two independent variables that have a correlation of .956. That is extremely strong. So that's one of the things you wanna look out for. Now in the second row, we have miles traveled vs. gas price, that correlation is only .356 with a p-value of .313, again not a strong correlation and not significant. And then finally, we have the intersection of number of deliveries, x2, and gas price, x3. That correlation is .498, again not very strong with a p-value of .143. So these correlations just serve as a confirmation of the analysis we did in our scatterplots and it will help us decide which variables to leave into the regression and which variables to take out when we actually do the analysis. So let's go ahead and do a summary of our correlations. This first graph is miles traveled, x1, and our dependent variable, y, which is travel time. So that correlation was .928, the p-value .000, less than .001 so we give that a check mark cause again, in this case, we're confirming that our independent variable is strongly related to our dependent variable cause that is a requirement of even having that independent variable in the regression. So our second independent variable has a correlation with the dependent variable of .916, the p-value less than .001 so we'll give that a green check, that passes the test. And then finally, our third independent variable, gas price, that had a correlation of .267, a p-value of .455 so that does not make the cut, we'll put an x through that one and note to ourselves that we will most likely not include that in our multiple regression. So the first two variables make the cut but the third one, gas price, does not. So let's go ahead and look at a correlation summary of the independent variable comparisons. So again here we're looking for multicollinearity, which is a high correlation between independent variables. So our first pair, miles traveled, x1, and number of deliveries, x2, we have a correlation of .956 with a p-value that's less than 001. This is a problem. Now two independent variables that are that highly correlated, above .95 in correlation are gonna be multicollinear and therefore we cannot include both of them in the regression. So we'll talk about that as we go forward. But this is an example of a problem we wanna look for between independent variables. The second one, we have miles traveled, x1, and gas price, x3, the correlation's .356 with a p-value of .313, no problems there, no risk of multicollinearity. And then the last one we have number deliveries, x2, compared to gas prices, x3, correlation of .498 with a p-value of .143, again, no problem with multicollinearity there as well. So it appears the two offending, or the two problematic variables are x1, miles traveled, and x2, number of deliveries. They have a very high correlation with each other. And let's step back for a minute. This should make sense in real life. The number of miles we travel is gonna be highly related to the number of deliveries we have on that trip. So if we have more deliveries, we're probably gonna have to drive more. That makes sense in real life. But in the regression, we can substitute one for the other because they are so highly correlated and we will only include one in the regression, as you'll see in the future. Okay, so let's go ahead and look at our correlation summary. So correlation analysis confirms the conclusions reached by visual examination of the scatterplots. So we have some redundant multicollinear variables. So miles traveled and number of deliveries are both highly correlated with each other and therefore, are redundant, only one should be used in the final multiple regression analysis. Now we do have a non-contributing variable. So gas price is not correlated with the dependent variable really at all and should be excluded. So review and conclusion. So in multiple regression, a lot of prep work must be done before ever clicking the "Run" button in your software. Do not blindly mash buttons in stats software. Step back, think about the variables, do some simple scatterplots, do some correlations, look at relationships among all of that, and then decide how you wanna proceed. There's some techniques we discussed. scatterplots, correlation analysis, and then individual or group regressions. Now we did not do that in this video, we will talk about that in the next video but that's another technique you can employ to examine those relationships. So next steps from here. For the sake of learning, we are going to break the rules and include all three independent variables we talked about in the regression at first. Then, we will remove the problematic independent variables, as we should, and then watch what happens to the regression results. We will also perform simple regressions with the dependent variable to use as a baseline, again for the sake of learning. So we will do a simple regression with the first independent variable and the dependent variable, the second independent variable and the dependent variable, and the third and the dependent variable. And in the end, we will come up with the best model that's regression. And finally, we will do more examples in future videos that are a bit different than the one in this video. Again, just for the sake of learning. Okay, so we have completed Part 2 of our series on multiple regression. So quick recap, in Part 1, we went over the very basics. Here in Part 2 we talked about the prep work we have to do before actually running the regression in a computer. Now in subsequent parts, we'll talk about actually running the regression. We'll talk about picking the variables using different techniques, using the computer. We'll talk about how to interpret the results we get from the computer and we'll talk about how to use the equation we get to actually make predictions and some of the limitations around those predictions. So we have much more to go in talking about multiple regression. So if you're not a subscriber, please click Subscribe up here in the top right, I would appreciate that very much. I've also included some links down here on the right to the playlist page where you can find all the stats videos, to the playlist for multiple regression, and the playlist for simple regression if you need to go back and touch up on those concepts. So, again, thank you very much for watching, I wish you the best of luck in your work and in your studies and look forward to seeing you again next time. (gentle guitar)
Info
Channel: Brandon Foltz
Views: 351,201
Rating: 4.9675498 out of 5
Keywords: statistics 101 multiple regression, statistics 101: multiple regression, brandon foltz multiple regression, multiple regression brandon foltz, statistics 101 multiple regression (part 2), statistics 101: multiple regression (part 2), statistics 101 regression, multiple regression analysis, multiple regression, statistics 101, brandon foltz, linear regression, Regression Analysis, machine learning, machine learning basics, machine learning tutorial, Multiple linear regression
Id: 2I_AYIECCOQ
Channel Id: undefined
Length: 24min 4sec (1444 seconds)
Published: Sun Dec 07 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.