- [Brandon] So in this
video, we're going to pick up where we left off at the end of Part 1. Remember, in Part 1, we went over the basics of multiple regression. Now if you have not watched that one, I would highly recommend
going back and watching it before proceeding with this one. So at the end of Part 1, we talked about all of the prep work you have to do in multiple regression before actually running the numbers and that's what this video
is going to be about. So let's go ahead and reset the problem cause it is a little bit different, I did tweak it a little bit from Part 1. So let's assume that you
are a small business owner for Regional Delivery
Service Incorporated, or RDS, who offers same day delivery for letters, packages,
and other small cargo. You are able to use Google Maps to group individual deliveries into one trip to reduce time and fuel costs. Therefore, some trips will
have more than one delivery. Now this is the same
thing that UPS, FedEx, the Postal Service does, or
any other delivery service. Now as the owner, you would
like to be able to estimate how long a delivery will
take based on three factors that you have deemed important. One, the total distance of the
trip in miles, makes sense. Two, the number of
deliveries that must be made during the trip. So we sort of assume
that the more deliveries that must be made, the longer
the delivery will take. And three, the daily price
of gas in U.S. dollars. So, maybe delivery drivers drive slower to use less fuel if gas or
petrol is more expensive. So those are the three
variables we're gonna look at when estimating the delivery time. So as we discussed in Part 1, conducting multiple regression analysis
requires a fair amount of pre-work before actually
running the regression. So here are the steps. One, generate a list
of potential variables, the independent variables
and the dependent variable. Now we've already done that as far as me setting up the problem for you but if you are doing your own problem, you would really have to think ahead of time what variables make sense, what potential independent variables explain the dependent
variable and so forth. And of course, two, collect
data on the variables. Three, and this is most
important for this video, we're gonna check the relationships between each independent variable
and the dependent variable using scatterplots and correlations. Four, also important for this video, we're gonna check the relationships among the independent variables again using scatterplots and correlations. So we're really gonna focus on steps three and four in this video. And five, which is technically optional, but I would do anyway, and that is conduct
simple linear regressions for each independent variable
dependent variable pair. And we'll talk about that more as we go. Six, use the non-redundant
independent variables, and we'll talk about what that means, in the analysis to find
the best fitting model. And then seven, use the best
fitting model we come up with to make predictions about
the dependent variable. In this case, the delivery time. So to conduct your analysis, you take a random sample of ten past trips and record four pieces of
information for each trip. One, the total miles traveled,
that's our first variable. Two, the number of deliveries,
that's our second variable. Three, the daily gas price, that's our third independent variable. And four, the total travel time in hours, that's our dependent variable. So you do that and you come
up with something like this. So the first column is miles traveled, that's our first independent variable, x1. Then we have number of
deliveries, that's x2. Gas price, I shoulda done a bit different because this is added to
what we did in Part 1, so gas price is x3, our
independent variable. And then travel time in hours
is our dependent variable there in the right hand column. So, as example, for the first
trip, we traveled 89 miles, there were four deliveries
during that trip, and on that day, the average
price of gasoline was $3.84 and then the travel time took seven hours. So that's how we would read each entry or each record, if you want, of the data. So we wanna sketch out our relationships. Now remember that multiple regression is a many-to-one relationship. So in this case, we have
our dependent variable, travel time, which we designate as y, and then we have our three
independent variables, miles traveled, which is x1; number of deliveries, which is x2; and the gas price, which is x3. So from each independent variable to the dependent variable,
we have a relationship. So it's a total of three relationships. But, we're not done there. We also have relationships among the independent
variables themselves. So there are three relationships there. You can see they're in the dotted line. So, altogether, we have six relationships we have to analyze during our prep work. So as far as relationships
of the independent variable to the dependent variable,
we have one, two, three. Three relationships to analyze and we'll go ahead and do that next. So in this section, we're gonna look at the
independent variable to dependent variable scatterplots. And what we're doing here is checking the relevancy of each independent variable and as we go you'll see
what we mean by that. So here is our first scatterplot. So you can see we have miles
traveled on the x-axis, that's our first independent variable, x1. And then on the y-axis on the left, we have time traveled, that's
our dependent variable. Now what we're looking for here is a relatively strong
linear relationship. Now if you look at the
dots in the scatterplot, you can see that it's a fairly
strong linear relationship. So we wanna make sure we note that and realize that our
first independent variable and the dependent variable do have a strong linear relationship. Now we'll do the same for the
second independent variable. So we have number of
deliveries on the bottom and our dependent variable, y, travel time along the y-axis and again it appears that we have a very strong linear relationship between our second independent variable and our dependent variable. And then final we have our
third independent variable, gas price and we're looking for a linear relationship to
the dependent variable over on the left hand side, on the y-axis. But as you can see, there isn't one. The scatterplot is not in
any discernible pattern, the data points are all over the place, they don't form a line of any sort. So we can just say, based
on visual examination, that gas price, our third
independent variable, really does not have any
strong linear relationship or linear relationship at all, to our dependent variable, time traveled. So here is a summary of our scatterplots. So of course, in the first case, our first independent
variable, miles traveled, that had a strong linear
relationship with travel time. Our second independent
variable, number of deliveries, also had a strong linear relationship with our dependent variable, travel time. But gas price did not, so in that case, we're gonna put a little x there, just noting to ourselves
that the gas price does not have a very strong relationship to our dependent variable. So, here's our scatterplot summary. For dependent variable
and independent variables, the travel time, the dependent variable, appears highly correlated
with miles traveled. That's our first independent variable. Travel time also appears highly correlated with our second independent
variable, number of deliveries. However, travel time does not appear highly correlated with gas price. Remember in that case, the data points were sort of in a random
blob all over the scatterplot so no linear relationship there. Now since gas price, our
third independent variable, does not appear correlated
with the dependent variable, we would not use that variable
in the multiple regression. So as you can see here,
sort of a precursor, or a requirement for even including an independent variable in
the multiple regression, is that before we even
begin, it has to have some sort of relatively
strong linear relationship. Otherwise, it does not make
any sense to include it because it's not gonna
make any contribution to the prediction. For now, we will keep gas price in and then take it out later,
just for learning purposes. So if we were actually
doing this regression, we would just go ahead
and take out gas price and continue forward. But for learning purposes,
I'm going to go ahead and keep it in so you can see how it affects the regression as we go, just keep that in mind. Sketching out some more relationships, remember we have three
more we have to consider. And that is the three relationships between the independent
variables themselves. And that's three relationships. So in the independent variable to independent variable scatterplots, what we're looking for is, checking for multicollinearity. And remember, that is correlation among the independent
variables themselves, which as we talked about in Part 1, can cause some serious
problems with your regression. So again, this is a scatterplot of independent variable compared
with independent variable. So we have miles traveled, which is our first independent
variable on the x-axis. And then we have number of deliveries, which is our second independent
variable on the y-axis. Now as you can see, it's very apparent that there is a very
strong linear relationship between these two independent variables. So we definitely wanna
note that as we go forward. So our next scatterplot, we're looking at our first independent
variable, miles traveled, and our third independent
variable, gas price, which is x3. And as you can see, there really is no discernible linear pattern
or linear relationship between these two independent variables. And finally, we have number of deliveries, which is our second independent variable and we're looking at that scatterplot with gas price on the y-axis and again, we can see that there is no discernible linear pattern,
or linear relationship, between the second and
third independent variables. So here's our independent
variables scatterplot summary. And again, we're checking
for multicollinearity. Now the first graph shows us that we have a potential problem. The first two independent variables, miles traveled and number of deliveries, seem to be highly correlated, have a highly linear relationship, and that is the pure definition
of multicollinearity. And we know that including both of those in the regression can
cause some serious problems because the regression, or
the computer when it does it, is not really sure what
coefficients to assign to those two variables
if they are so similar. In the example I used in Part 1 was that if I'm cooking dinner and I add table salt to my dinner and I add sea salt to my dinner, when I eat it, all I can
tell is that it's salty. I can't tell if it's the
sea salt or the table salt because they're so similar in taste. And this is what happens when
two independent variables are correlated in multiple regression. Now x1 and x3 are not
correlated so that's okay. And then x2 and x3 are not correlated, so again, that's okay. So in terms of multicollinearity, we might have a problem, or we're probably going to have a problem
with the multicollinearity between x1 and x2. So, the summary of the scatterplots for the independent variables. So number of deliveries
appears highly correlated with miles traveled and this is, by definition, multicollinearity. However, miles traveled does not appear highly correlated with gas price and gas price does not appear correlated with number of deliveries. Now since number of deliveries is highly correlated with miles traveled, we would not use both in
the multiple regression. They are redundant. Because they appear, at least visually, to be so highly correlated, have a very strong linear relationship, they are called redundant
and we would only use one of them in the actual
multiple regression. Now again, for now, we're gonna go ahead and keep both in there and
then take out one later just for learning purposes. So, again, if we were doing
the actual multiple regression, for an assignment or for an analysis with our place of business, we would not include
both of those in there, we would take one of them out because they are so highly correlated. So up to this point, we've
just used some scatterplots, we've done some visual
examination with some lines that I've just sort of eyeballed. So what we want to do is actually go ahead and run correlation analysis
among the variables. So went ahead and went
into Minitab actually cause it gives back more information, now you could do this in Excel, but Excel does not give the
p-values for the correlations. So went ahead and used Minitab to go ahead and do this analysis. So the first thing I wanna
draw your attention to is the bottom row. So we have travel time, y. Now remember, that's
our dependent variable. And in the columns we have our
three independent variables. So if you look at the intersection of travel time, y, and miles traveled, we can see that those two
have a correlation of .928. That is very, very strong
in terms of correlation. Now the number below that, in the italic, the .000, that's the actual
p-value for that correlation. So that means it's
actually less than .001. Now remember, our threshold is .05. So if it's below .05, we would say that is statistically significant and of course in this case, it is. Now if we move over,
we have the correlation between our dependent
variable travel time, y, and number of deliveries. And again, that correlation
is very strong at .916, with a p-value of .000 or less than .001. And then finally in the last column, we have the intersection of
travel time and gas price. And that correlation is only .267. And it's p-value is .455,
which is way above .05. So that is not significant
and that just serves as confirmation of our scatterplots. So scatterplots are simply
a visual examination, a quick way to look at
variables in terms of pairs. We go ahead and actually
run the correlation to get an objective
measure of the relationship and we can see across the bottom that they do line up or they do confirm what we saw in our scatterplots. So look at the top two rows now. At the first intersection, we have miles traveled, x1, in the column. We have number of
deliveries, x2, in the row. That correlation is extremely strong. So .956 and again, a p-value
of .000 or less than .001. So that is a very strong red
flag for multicollinearity. We have two independent variables that have a correlation of .956. That is extremely strong. So that's one of the things
you wanna look out for. Now in the second row, we have
miles traveled vs. gas price, that correlation is only
.356 with a p-value of .313, again not a strong correlation
and not significant. And then finally, we have the intersection of number of deliveries,
x2, and gas price, x3. That correlation is .498,
again not very strong with a p-value of .143. So these correlations just
serve as a confirmation of the analysis we did in our scatterplots and it will help us decide which variables to leave into the regression
and which variables to take out when we
actually do the analysis. So let's go ahead and do a
summary of our correlations. This first graph is miles traveled, x1, and our dependent variable,
y, which is travel time. So that correlation was
.928, the p-value .000, less than .001 so we
give that a check mark cause again, in this
case, we're confirming that our independent
variable is strongly related to our dependent variable
cause that is a requirement of even having that independent
variable in the regression. So our second independent variable has a correlation with the
dependent variable of .916, the p-value less than
.001 so we'll give that a green check, that passes the test. And then finally, our
third independent variable, gas price, that had a correlation of .267, a p-value of .455 so that
does not make the cut, we'll put an x through that
one and note to ourselves that we will most likely not include that in our multiple regression. So the first two variables make the cut but the third one, gas price, does not. So let's go ahead and look
at a correlation summary of the independent variable comparisons. So again here we're looking
for multicollinearity, which is a high correlation
between independent variables. So our first pair, miles traveled, x1, and number of deliveries, x2, we have a correlation of .956 with a p-value that's less than 001. This is a problem. Now two independent variables that are that highly correlated, above .95 in correlation
are gonna be multicollinear and therefore we cannot include both of them in the regression. So we'll talk about that as we go forward. But this is an example of a problem we wanna look for between
independent variables. The second one, we have
miles traveled, x1, and gas price, x3, the correlation's .356 with a p-value of .313, no problems there, no risk of multicollinearity. And then the last one we
have number deliveries, x2, compared to gas prices,
x3, correlation of .498 with a p-value of .143, again, no problem with multicollinearity there as well. So it appears the two offending, or the two problematic variables are x1, miles traveled, and
x2, number of deliveries. They have a very high
correlation with each other. And let's step back for a minute. This should make sense in real life. The number of miles we travel is gonna be highly related to the number of deliveries we have on that trip. So if we have more deliveries, we're probably gonna have to drive more. That makes sense in real life. But in the regression,
we can substitute one for the other because they
are so highly correlated and we will only include
one in the regression, as you'll see in the future. Okay, so let's go ahead and
look at our correlation summary. So correlation analysis
confirms the conclusions reached by visual examination of the scatterplots. So we have some redundant
multicollinear variables. So miles traveled and number of deliveries are both highly correlated with each other and therefore, are redundant,
only one should be used in the final multiple regression analysis. Now we do have a
non-contributing variable. So gas price is not correlated with the dependent variable really at all and should be excluded. So review and conclusion. So in multiple regression,
a lot of prep work must be done before ever clicking the "Run" button in your software. Do not blindly mash
buttons in stats software. Step back, think about the variables, do some simple scatterplots,
do some correlations, look at relationships among all of that, and then decide how you wanna proceed. There's some techniques we discussed. scatterplots, correlation analysis, and then individual or group regressions. Now we did not do that in this video, we will talk about that in the next video but that's another
technique you can employ to examine those relationships. So next steps from here. For the sake of learning, we
are going to break the rules and include all three
independent variables we talked about in the
regression at first. Then, we will remove the
problematic independent variables, as we should, and then watch what happens to the regression results. We will also perform simple regressions with the dependent variable
to use as a baseline, again for the sake of learning. So we will do a simple regression with the first independent variable and the dependent variable, the second independent variable
and the dependent variable, and the third and the dependent variable. And in the end, we will come up with the best model that's regression. And finally, we will do more
examples in future videos that are a bit different
than the one in this video. Again, just for the sake of learning. Okay, so we have completed Part 2 of our series on multiple regression. So quick recap, in Part 1,
we went over the very basics. Here in Part 2 we talked
about the prep work we have to do before actually running the regression in a computer. Now in subsequent parts, we'll talk about actually running the regression. We'll talk about picking the variables using different techniques,
using the computer. We'll talk about how to
interpret the results we get from the computer
and we'll talk about how to use the equation we get to actually make predictions
and some of the limitations around those predictions. So we have much more to go in talking about multiple regression. So if you're not a subscriber, please click Subscribe
up here in the top right, I would appreciate that very much. I've also included some links down here on the right to the playlist page where you can find all the stats videos, to the playlist for multiple regression, and the playlist for simple regression if you need to go back and
touch up on those concepts. So, again, thank you
very much for watching, I wish you the best of luck in your work and in your studies and look forward to seeing you again next time. (gentle guitar)