- Forward, backward,
stepwise, and subsets. What are these methods, and how do we use them to
build basic regression models? Grab a drink, get comfortable, and let's find out. Hello, and namaste. My name is Brandon, and
welcome to the next video in my series on basic statistics. Whether you are a new viewer
or a returning viewer, I am very grateful you
have chosen to spend some of your valuable time with me. We'll make the most of it. As you're watching, if you like the video, please give it a thumbs up, share it with classmates,
colleagues, or friends, or anyone else you think
might benefit from watching. Also in the description below you'll find a link to all of my playlists. It's basically a table of contents, like you would find in the
front of a book, for example. And please, if you haven't already, please hit the subscribe button and that bell notification. So now that we are introduced, let's go ahead and start learning. So this video is the next in
our series on model building. So when we start to build more
complex regression models, it's good to have a
foundation of, in this case, four basic techniques
that many people learn, and those are called forward regression, backward regression, stepwise regression, and best subsets regression. Now this video is going to be an overview of all four of these. There is no math involved,
it's all conceptual. We're gonna be using a lot of visuals, a lot of graphics, and
animations and things like that to get the fundamental ideas across. And then in future videos, we will take a deeper dive into each one and actually conduct each technique, see how things change as
variables are added or deleted, and things like that. So this is a broad overview, so we can set the stage for
more detailed learning as we go. So let's go ahead and get started. So first we will start
with this famous quote. If you're around statistics
or machine learning or data science long enough, you will encounter this
quote, if you haven't already. So this quote is by George Box, who was a statistician at
the University of Wisconsin here in the U.S., who recently passed away
a few years ago, actually. And his famous quote is,
"All models are wrong, but some are useful." And I think this is a very
good way to start this video. So of course, we're gonna start the basic building blocks of
regression model building. Now, every model we build
is obviously gonna be an approximation of reality. And as you add more variables,
they get more complicated. So things tend to happen, variables interact with each other. We're trying to mimic a
real phenomenon that exists in the world, and all models
are gonna have problems. They're gonna have deficiencies. They might overfit or underfit. We might not have the best mix
of variables at any one time. However, I want to
focus on the usefulness. So remember, when we build models, what we're trying to do is build something that allows us to make predictions that are much better than random guessing. That's the fundamental idea. So I could go and make
a random prediction, pick a number out of a hat about the high temperature
outside tomorrow. Or I could look at weather models and get a much better
guess or approximation of what that high
temperature will be tomorrow. So we're gonna focus on
building useful models while also keeping in mind
some humility as we go. So this is a very quick
overview of the larger problem we're gonna work on
over subsequent videos. And it has to do with a
"Guess Your Weight" game that is common at many
amusement parks or theme parks around the U.S. and the
world, I would assume. So a theme park analyst, that's you, will use historical data to
develop regression models for a "Guess Your Weight"
game that is designed for children and adolescents. So it's a game where children can walk up, of course, with their parent's permission, and play the game, and the person at the theme
park will try to guess the child or adolescent's weight. And if the theme park gets it correct, the child loses, and
they don't get any prize. And of course, they pay to play the game. If the theme park is wrong, the child or adolescent
wins and they get a prize. That's kind of the way the game works. So the full model is this. We're gonna try to guess
the person's weight based on their biological sex,
their age, and their height. And they have ways to give the
theme park this information, which we'll again
explore in future videos, the details of how it would work. So this is our full model. Now the theme park has
access to historical data on these measurements for
236 children and adolescents. So that will be the data
upon which we make our model, or maybe part of it, as we go. We'll see. We'll talk about that later. We have biological sex, age, and height, and we're gonna use
those in some combination to predict the child's weight. So there are seven possible models here. We could just use the
child's biological sex. We could use just their age. We could use just their height. We could use combinations, like their biological sex and age, their biological sex and
height, their age and height, or all three, their biological
sex, age, and height. So model seven at the bottom
there is called the full model. It's all of the variables
that we're looking at in our model. Models one through six
are called reduced models because they have fewer
variables than the full model. You'll also sometimes see
them called nested models, because they are nested
subsets of that full model. And again, the terminology
might change a little bit from wherever you are
reading or find information, but it's the same basic ideas. So the questions, the fundamental questions
we're asking here are which model makes the best predictions while also being the simplest? What do we mean by best in this case? How do we measure which model is best? So the four common techniques. In multiple regression model building, there are four basic strategies we learn at first, usually. Forward regression or addition regression, backward regression or
deletion regression, stepwise regression, which
is kind of a combination of the first two, and then best subsets regression, which is based on all
possible combinations of our variables we're interested in. Now, I should note here that these techniques are
not without controversy. So just like anything
else in complex fields like statistics and data
science and machine learning, there are people who say this technique is not that great, or that
technique is not that great. And that just kind of
comes with the territory. These are no different, but they are great tools for learning how basic models are
created and evaluated. So what do you need to know to
learn about these techniques? Because this is not basic stats anymore. This is not mean, median, and
mode, and stuff like that. This is complex. So you need to have a solid understanding of linear regression,
including what R-squared is. So remember, R-square is the proportion of percentage of variation
in the dependent variable explained by the independent
variable or variables. Interpretations of
coefficients in the output and in the regression equation. The ANOVA F-table that is often
or pretty much all the time output from regression
techniques in like Excel, or R or whatever else you happen to be using. The relationship between SST
or total sum of squares to SSR, which is sum of squares with regression, and sum of squares due to error, because remember, in model building, what we're trying to do
essentially is reduce SSE, or the error. What F-statistics are, so
our ratio of variances. P-values, correlation, and then also, eventually as we go, partial correlation. When we get into the nuts and bolts of how these models work, we will be talking about
partial correlations. If you don't know what that
is now, don't worry about it. I'm gonna cover it
probably in the next video in this series. Likewise, partial F-statistics. Again, if you're not sure what those are, we will cover those in probably
the next video as well. And then some familiarity
with stats software, like I said, Excel, R, JMP, XLSTAT, which is an Excel add in. There are many of them, minitab, yeah, there are many
stat software packages that can do regression. So some general rules and concepts that kind of apply to
everything we're gonna do in this video and future ones. When adding variables to a model, R-square will never decrease. So if we have one variable and then we add a second variable, the R-square value
between those two models from one to two variables
will never decrease. The R-square can only
remain the same or increase. So that's fundamental to know. So more than one variable, variables potentially start
influencing each other. That's a very important concept. When we start adding, you
know, two, three, four, five, or more variables into a model, we have all kinds of
interactions and influencing, and this one, you know,
influencing that one, and so forth, that can make things very complex, so you've got to remember that as soon as you add another variable, they could potentially
influence each other. Not necessarily, but they can. Now also different techniques may result in different models. So if we do forward regression with a certain number of variables, and then we do a stepwise regression with those same variables, we may not get the same model at the end of both of those techniques. That's just something to keep in mind, because they're not all aiming towards one ideal grand model. They all have their own
strategies and techniques in how variables enter and exit
the model building process. So it's quite possible that you can get different regression
equations at the end of them. Now, in forward regression,
once a variable is in, it stays in. Once you're part of the party, you don't get kicked out. In backward regression, once a variable is out, it stays out. So once you're a kicked out of the party, you cannot come back in. For forward, backward, and stepwise, the analyst, which will be you, chooses the criteria for entry and exit. That's usually a P-value in most cases or an F-statistic or partial F. So there is some subjectivity in how strict or liberal your model can be when it comes to adding
or removing variables. Now, R-square alone is
usually not sufficient to determine the best model, and as the models get more complex, there are other measures we can look to in concert with the R-square to determine which model we think is best. We want to aim for a model that
is simple yet fits the best. So we're trying to thread a needle here. We want a simple, well-fitting model that makes good predictions and is therefore generalizable
to different data. And on that note, including more variables risks overfitting the model. So again, if you've been around
stats or machine learning or anything like that, you
know what overfitting means. So in this case, we just
keep dumping variables into our model, more
variables, more variables. But the problem is the more
variables we enter into it, the more likely the model is
to mold itself around the data we are using to build the model. But then we expose that model
to a different set of data, and it makes really bad predictions, or predictions that are not
nearly as good as it did on our original data. So in machine learning, we
call this the train-test split. We train our data on a
subset of a larger dataset, and we tune this model
to get it just right. We have your different variables in there, and we really try to
get it to fit real well. And we take our test data or
completely different data, try to make predictions
based off that new data, and our model makes terrible predictions because it has fit the
original data so well. So in the end, we want
a model that's simple that fits our data well, makes good predictions
without overfitting. So forward selection,
or forward progression. So let's start with our basic idea here. So this black bar represents our R-square. So R-square goes from zero to one, and it represents the
proportion of the variance in our dependent variable
that is explained by our independent variable or variables. So that's fundamental linear progression. So if we can see this bar
here as like a thermometer or some other gauge, where we're trying to fill
it up as much as possible with explained variance. So the way forward
regression works is this. First, we take each of our variables and we run a regression
on our dependent variable. So in this case, we have three variables. So X one regressed with
Y, X two, and X three. So three simple regressions. Then we look at the R-square. So in this case, let's say X two has the
highest R-squared value of these three simple regressions. So we give it a thumbs up. Now, we only keep it in our
model if it's significant. If it's not, then we stop. Now significance actually has a quantitative scientific
numerical meaning, but we'll get to that in future videos. Even if it has the highest R-square, it does not mean it is significant. So in this case, we
will assume that it is, so we'll fill up our R-square bar. And what we'll do is we'll put an a there. So that is the amount of variation
in the dependent variable explained by the variable X two. It looks like it's about
50% or so, about half. As of right now, our regression
equation looks like this. Y equals B zero, that's the intercept, plus B two, that coefficient, X two. That's our model at this stage. What we do now is we keep going forward. We have kept X two from step one. Our a is still in the same place, but notice there's nothing
in the R-squared bar, and I'll explain why
that is here in a second. X two is already in the model. It's forward regression, so it stays. So it's already in the model, but we have two other
variables to consider now. So remember, R-square will
never decrease below point A as variables are added, but the proportion of R-square by each variable can change, and we'll see how that
works here in a second. So we kept X two from step
one of our forward regression. Now we did the same thing
with our other two variables. We evaluate those, which one
has the highest R-square, in this case, it's X three, same process. So we'll say that X three
explains more variance in the dependent variable
so that R-square goes up to point B. Now notice that at point A, X two isn't at the exact
same place anymore. So this is what I mean about
the proportion of R-square by each variable can change. And why is that? Remember what I said earlier, when we start adding
more than one variable, these variables can start
influencing each other because they have a relationship
as well to each other, as well as to the dependent variable. So we're not doing the
actual math here yet, but I just want to get
across the fundamental idea is that even if we add
X three in this case, the overall R-squared explained would never go below point A. Now, luckily for us, we
did explain more variance. So we're up to point B. Now, again, we only keep X three if it offers significant
change to our model, and we will talk
mathematically and numerically what that means in future videos. But we only keep it if
it adds significantly or makes a significant change
to the R-square on our model. So we assume it does. So our model is now Y equals
B zero plus B two X two, plus B three X three. So intercept, coefficient, variable two, coefficient, variable three. So now in this third step, we
have kept X two and X three. So we got thumbs up there. Now we look at X one, we add that in here. Maybe we have something like this. X two has its variance
there in the orange, X three in the purple, and
X one there on the blue. So there are our A and B, and now we have a C, so
point A is where we started in step one, step B is
where we were last time. So remember, in this step,
no matter what we do, if we add X one, our variance
will never go below B. We can only stay at B or
go above B to where C is in this case. We will evaluate X one. So we only keep it if this
change is significant, if not, X one doesn't come
into the model and we stop. So let's say in this example
that the R-squared that's added is not significant, and therefore X one will
not go into the model. And therefore, we will leave
the model at where we were in the previous step. And that is it. So that is the forward regression process. Add a variable, look at
which variable explains the most R-square first, then we evaluate the
rest of the variables, go with the highest R-square there, see if that change is significant, and so on and so on and so on. So that's how forward regression works. We take each variable, see how much variation it helps explain in the dependent variable
on a one-to-one basis. The one with the highest
R-squared is the one we evaluate. We see if that R-squared is significant. In this case, it's gonna
be based on an F-statistic, but again, we'll go
into that further later. And then we repeat the process. It's only based off the
added change in R-square as we go forward. If we add a variable, the
change is not significant, we don't add it to the model and we stop. And that's it. So backward regression,
or backward deletion, is the exact opposite,
obviously, of forward. In this case, we start
with all the variables in, and then we start taking
them out one by one. So here is our full model. So X one, X two, and X three are all in, and that's our equation
over there on the right. Now, we take out one variable in each case as if it were the last variable added. So in this first example,
we take out X one, and then we see what happens. What happens to our overall model? What change happens in the R-square? So we keep that in mind. Then we try the same thing with X two. Now, the full model, we take
X two out, see what happens. See what happens with our R-squared. See if the change is meaningful. Do the same thing with X three,
take it out of the model. See what happens with our R-squared, see if the change in
R-square is significant, and then we can go forward. If the change in R-square
is not significant, then we take that out and it stays out. So in this case, if the change in R-square is not significant, then we know that that variable
doesn't really add anything, and therefore we take
it out and leave it out. The exact opposite of what
we did in forward regression. In that case, we added a variable in. If the addition of that variable created a significant change
in R-squared, we left it in. Backward, we take a variable out. If that change is not
significant, then we leave it out. Kind of see how those work
backwards and forwards. So in this case, we would take out X one, 'cause we're gonna assume that it doesn't really add anything. So the change in R-square
when we take X one out is not significant, so we'll leave it out. Now we're down to our two variable model. So the first one we do is test X two. So we take X two out, look at the change in R-squared, see if that change is significant, and if it is, then we don't take it out. If it's not significant,
then we take it out. The opposite of forward regression. Do the same process with X three. We take it out. We say is the change in
R-square significant? If the change in R-square is significant, we leave it in, because
the reduction is too great. So in this case, it would
appear that both are too much. So both seem to have significant
reductions in R-square when we take them out. So both stay in, and we end up again with our two variable model. So stepwise is actually a
combination of those two. So stepwise regression is
like a forward regression and backward regression combined, but with a few modifications. Number one, at each step, all variables that are currently in
the model are evaluated for their unique
contribution to the model. Even when a variable gets
entered into the model, it is there, I won't say temporarily, but it's subject to critique or scrutiny. In stepwise, once a variable
is entered into the model, it might not stay in
the model going forward, because remember, we have one variable, let's say we add a second variable. And then in stepwise
regression, we take a pause. We look at each variable's
contribution to the model, and if one doesn't contribute to the explanatory power
of the model anymore. So step wise is a bit more robust in evaluating the model at each step. Now, variables removed at one step could reenter at a later step. So just because a variable is ejected in stepwise regression, that does not mean
necessarily that later on, it could not come back in, because as we add and remove variables, the overall model is changing. It's morphing, it's evolving. And it could be that later on, a variable that was thrown out might be able to explain a
bit more significant variance in the dependent variable than it did when it first came in, see? Build the model, look at each variable, if one no longer explains the variance in the dependent variable
very well, it's kicked out. Then we add other variables
that might be waiting sort of on deck. So we'll look at the next variable, stop, look at all the variables in, see if they all continue to contribute. If any don't, they get ejected. Now some can come back in. So you see the process? And again, we'll do this
mathematically later on. So it's a bit more
involved, but more robust. So this means that how variables
are related to each other and the order they are
entered into the model can change the makeup of the variables included in the model. So we have the entry order, how they are related, fundamentally, before we even start can affect how this model was built. The rules for entering
and exiting the model are actually set by the analyst, you. Most stat software
packages and most textbooks will kind of have recommendations
on what P-value to set if a variable can enter the model and what the P-value should be if the variable should be
kicked out of the model, but in the end, it's really
up to you, the analyst. There's no set rules. You can set 0.05 for a
P-value to enter the model. You can set 0.10 to exit the model. It's really up to you, but we'll go more into that when we look at stepwise
regression in detail. So a quick way this would work. Let's say we have the first step, where X two enters the model
and explains the variance up to A. Then we add X three in, and X three explains a ton
of variance in the model, and its entry makes X two
no longer significant. So X two gets bounced out of the model, and we just have X three now. Then, X one is waiting on deck, remember, it's still out there. We put X one in the model. Stepwise regression
looks at each and says, does X three contribute? Yes. Does X one contribute? Yes. So they can both stay. Now let's say hypothetically, we put X two back into the model, and this time at this stage, it does contribute to the
overall model at this point. So it's allowed to stay. So that can happen. So how often that happens is
really dependent on the model. There's no way of knowing, but it is possible. Just because one variable
is kicked out of the model doesn't mean it cannot come back later. So finally we have best subsets, and it's by far the easiest to understand. I'll just let it go, 'cause
you can see what's happening. So in best subsets regression, what we do is create regression models for every possible combination
of variables that exist. So in this case, we have seven. You have sex on its own, age on its own, height on its own. Those are our three one-variable models. Then we have three two-variable
models, sex and age, sex and height, age and height. And then we have one three-variable model, sex, age, and height,
that's our full model. You're probably already
thinking to yourself, this can get pretty big. So once you add four variables
and five variables and six, the list of possible regression equation grows exponentially. In a way, best subsets,
you're never gonna miss the best model. However, once you start
adding a bunch of variables, this can get very, very complex, sometimes hard to interpret. It becomes much more
computationally intensive, depending on how many variables you have. And it's more like just
a brute force method for creating regression equations. Now, when we do this method, we can't rely on R-square alone, because we have many regression equations with different numbers of variables, different R-squares, and things like that. So how do we evaluate them? Well, in addition to
R-squares, in this case, there are other statistics we can use to choose our best model. And they are Mallow's C, which
we'll talk about as we go, and a statistic called the AIC. So I'll save those for the
subsets regression video, but in this case, because
there are so many models, we've got to have other
statistics to evaluate them and ultimately make our choice. So a quick review of the
four common techniques. Forward regression, where we
add variables one at a time and see how the overall model changes and only keep them if the
addition is significant to the overall R-square. Backward is the opposite of forward. We start with all the variables in, take them out one by one, and then we'll get to a point
where taking a variable out is detrimental to the model, and therefore we stop. Stepwise, it's kind of a combination. So we enter a variable,
enter another variable. Then we stop and pause, look at each variable's contribution. If they are both
contributing, they stay in. So if the addition of the second variable, hypothetically speaking,
reduces the amount of variance explained by the first variable, the first variable may get kicked out. And then we add the third variable. Now it doesn't mean that
the kicked out variable cannot come back later, it just means that at that step, it no longer contributes, so it is gone. And then best subsets, again,
all possible combinations. Depending on how many variables you have, it can get very, very numerous in terms of the models
you have to evaluate. So in future videos, we
will pose our problem, look at each technique in depth, including like numbers, and build and evaluate simple models. And again, remember,
"All models are wrong, but some are useful." So let's go ahead and
make some useful models in our next videos. So that wraps up our video
on the four common methods of regression model building, forward, backward,
stepwise, and best subsets. Now remember, no model is perfect. Each one is gonna have its own flaws, but we are trying to build a model that balances good predictions, simplicity, and interpretability. So once you begin to understand how all these parts fit together, the trade offs we have
to make here and there, you will become a much better
analyst and model builder. So thank you very much for watching. I wish you all the best in
your work and in your studies, wish you health and happiness, and look forward to seeing
you again in the next video. Take care. Bye bye.