Statistics 101: Model Building Methods - Forward, Backward, Stepwise, and Subsets

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- Forward, backward, stepwise, and subsets. What are these methods, and how do we use them to build basic regression models? Grab a drink, get comfortable, and let's find out. Hello, and namaste. My name is Brandon, and welcome to the next video in my series on basic statistics. Whether you are a new viewer or a returning viewer, I am very grateful you have chosen to spend some of your valuable time with me. We'll make the most of it. As you're watching, if you like the video, please give it a thumbs up, share it with classmates, colleagues, or friends, or anyone else you think might benefit from watching. Also in the description below you'll find a link to all of my playlists. It's basically a table of contents, like you would find in the front of a book, for example. And please, if you haven't already, please hit the subscribe button and that bell notification. So now that we are introduced, let's go ahead and start learning. So this video is the next in our series on model building. So when we start to build more complex regression models, it's good to have a foundation of, in this case, four basic techniques that many people learn, and those are called forward regression, backward regression, stepwise regression, and best subsets regression. Now this video is going to be an overview of all four of these. There is no math involved, it's all conceptual. We're gonna be using a lot of visuals, a lot of graphics, and animations and things like that to get the fundamental ideas across. And then in future videos, we will take a deeper dive into each one and actually conduct each technique, see how things change as variables are added or deleted, and things like that. So this is a broad overview, so we can set the stage for more detailed learning as we go. So let's go ahead and get started. So first we will start with this famous quote. If you're around statistics or machine learning or data science long enough, you will encounter this quote, if you haven't already. So this quote is by George Box, who was a statistician at the University of Wisconsin here in the U.S., who recently passed away a few years ago, actually. And his famous quote is, "All models are wrong, but some are useful." And I think this is a very good way to start this video. So of course, we're gonna start the basic building blocks of regression model building. Now, every model we build is obviously gonna be an approximation of reality. And as you add more variables, they get more complicated. So things tend to happen, variables interact with each other. We're trying to mimic a real phenomenon that exists in the world, and all models are gonna have problems. They're gonna have deficiencies. They might overfit or underfit. We might not have the best mix of variables at any one time. However, I want to focus on the usefulness. So remember, when we build models, what we're trying to do is build something that allows us to make predictions that are much better than random guessing. That's the fundamental idea. So I could go and make a random prediction, pick a number out of a hat about the high temperature outside tomorrow. Or I could look at weather models and get a much better guess or approximation of what that high temperature will be tomorrow. So we're gonna focus on building useful models while also keeping in mind some humility as we go. So this is a very quick overview of the larger problem we're gonna work on over subsequent videos. And it has to do with a "Guess Your Weight" game that is common at many amusement parks or theme parks around the U.S. and the world, I would assume. So a theme park analyst, that's you, will use historical data to develop regression models for a "Guess Your Weight" game that is designed for children and adolescents. So it's a game where children can walk up, of course, with their parent's permission, and play the game, and the person at the theme park will try to guess the child or adolescent's weight. And if the theme park gets it correct, the child loses, and they don't get any prize. And of course, they pay to play the game. If the theme park is wrong, the child or adolescent wins and they get a prize. That's kind of the way the game works. So the full model is this. We're gonna try to guess the person's weight based on their biological sex, their age, and their height. And they have ways to give the theme park this information, which we'll again explore in future videos, the details of how it would work. So this is our full model. Now the theme park has access to historical data on these measurements for 236 children and adolescents. So that will be the data upon which we make our model, or maybe part of it, as we go. We'll see. We'll talk about that later. We have biological sex, age, and height, and we're gonna use those in some combination to predict the child's weight. So there are seven possible models here. We could just use the child's biological sex. We could use just their age. We could use just their height. We could use combinations, like their biological sex and age, their biological sex and height, their age and height, or all three, their biological sex, age, and height. So model seven at the bottom there is called the full model. It's all of the variables that we're looking at in our model. Models one through six are called reduced models because they have fewer variables than the full model. You'll also sometimes see them called nested models, because they are nested subsets of that full model. And again, the terminology might change a little bit from wherever you are reading or find information, but it's the same basic ideas. So the questions, the fundamental questions we're asking here are which model makes the best predictions while also being the simplest? What do we mean by best in this case? How do we measure which model is best? So the four common techniques. In multiple regression model building, there are four basic strategies we learn at first, usually. Forward regression or addition regression, backward regression or deletion regression, stepwise regression, which is kind of a combination of the first two, and then best subsets regression, which is based on all possible combinations of our variables we're interested in. Now, I should note here that these techniques are not without controversy. So just like anything else in complex fields like statistics and data science and machine learning, there are people who say this technique is not that great, or that technique is not that great. And that just kind of comes with the territory. These are no different, but they are great tools for learning how basic models are created and evaluated. So what do you need to know to learn about these techniques? Because this is not basic stats anymore. This is not mean, median, and mode, and stuff like that. This is complex. So you need to have a solid understanding of linear regression, including what R-squared is. So remember, R-square is the proportion of percentage of variation in the dependent variable explained by the independent variable or variables. Interpretations of coefficients in the output and in the regression equation. The ANOVA F-table that is often or pretty much all the time output from regression techniques in like Excel, or R or whatever else you happen to be using. The relationship between SST or total sum of squares to SSR, which is sum of squares with regression, and sum of squares due to error, because remember, in model building, what we're trying to do essentially is reduce SSE, or the error. What F-statistics are, so our ratio of variances. P-values, correlation, and then also, eventually as we go, partial correlation. When we get into the nuts and bolts of how these models work, we will be talking about partial correlations. If you don't know what that is now, don't worry about it. I'm gonna cover it probably in the next video in this series. Likewise, partial F-statistics. Again, if you're not sure what those are, we will cover those in probably the next video as well. And then some familiarity with stats software, like I said, Excel, R, JMP, XLSTAT, which is an Excel add in. There are many of them, minitab, yeah, there are many stat software packages that can do regression. So some general rules and concepts that kind of apply to everything we're gonna do in this video and future ones. When adding variables to a model, R-square will never decrease. So if we have one variable and then we add a second variable, the R-square value between those two models from one to two variables will never decrease. The R-square can only remain the same or increase. So that's fundamental to know. So more than one variable, variables potentially start influencing each other. That's a very important concept. When we start adding, you know, two, three, four, five, or more variables into a model, we have all kinds of interactions and influencing, and this one, you know, influencing that one, and so forth, that can make things very complex, so you've got to remember that as soon as you add another variable, they could potentially influence each other. Not necessarily, but they can. Now also different techniques may result in different models. So if we do forward regression with a certain number of variables, and then we do a stepwise regression with those same variables, we may not get the same model at the end of both of those techniques. That's just something to keep in mind, because they're not all aiming towards one ideal grand model. They all have their own strategies and techniques in how variables enter and exit the model building process. So it's quite possible that you can get different regression equations at the end of them. Now, in forward regression, once a variable is in, it stays in. Once you're part of the party, you don't get kicked out. In backward regression, once a variable is out, it stays out. So once you're a kicked out of the party, you cannot come back in. For forward, backward, and stepwise, the analyst, which will be you, chooses the criteria for entry and exit. That's usually a P-value in most cases or an F-statistic or partial F. So there is some subjectivity in how strict or liberal your model can be when it comes to adding or removing variables. Now, R-square alone is usually not sufficient to determine the best model, and as the models get more complex, there are other measures we can look to in concert with the R-square to determine which model we think is best. We want to aim for a model that is simple yet fits the best. So we're trying to thread a needle here. We want a simple, well-fitting model that makes good predictions and is therefore generalizable to different data. And on that note, including more variables risks overfitting the model. So again, if you've been around stats or machine learning or anything like that, you know what overfitting means. So in this case, we just keep dumping variables into our model, more variables, more variables. But the problem is the more variables we enter into it, the more likely the model is to mold itself around the data we are using to build the model. But then we expose that model to a different set of data, and it makes really bad predictions, or predictions that are not nearly as good as it did on our original data. So in machine learning, we call this the train-test split. We train our data on a subset of a larger dataset, and we tune this model to get it just right. We have your different variables in there, and we really try to get it to fit real well. And we take our test data or completely different data, try to make predictions based off that new data, and our model makes terrible predictions because it has fit the original data so well. So in the end, we want a model that's simple that fits our data well, makes good predictions without overfitting. So forward selection, or forward progression. So let's start with our basic idea here. So this black bar represents our R-square. So R-square goes from zero to one, and it represents the proportion of the variance in our dependent variable that is explained by our independent variable or variables. So that's fundamental linear progression. So if we can see this bar here as like a thermometer or some other gauge, where we're trying to fill it up as much as possible with explained variance. So the way forward regression works is this. First, we take each of our variables and we run a regression on our dependent variable. So in this case, we have three variables. So X one regressed with Y, X two, and X three. So three simple regressions. Then we look at the R-square. So in this case, let's say X two has the highest R-squared value of these three simple regressions. So we give it a thumbs up. Now, we only keep it in our model if it's significant. If it's not, then we stop. Now significance actually has a quantitative scientific numerical meaning, but we'll get to that in future videos. Even if it has the highest R-square, it does not mean it is significant. So in this case, we will assume that it is, so we'll fill up our R-square bar. And what we'll do is we'll put an a there. So that is the amount of variation in the dependent variable explained by the variable X two. It looks like it's about 50% or so, about half. As of right now, our regression equation looks like this. Y equals B zero, that's the intercept, plus B two, that coefficient, X two. That's our model at this stage. What we do now is we keep going forward. We have kept X two from step one. Our a is still in the same place, but notice there's nothing in the R-squared bar, and I'll explain why that is here in a second. X two is already in the model. It's forward regression, so it stays. So it's already in the model, but we have two other variables to consider now. So remember, R-square will never decrease below point A as variables are added, but the proportion of R-square by each variable can change, and we'll see how that works here in a second. So we kept X two from step one of our forward regression. Now we did the same thing with our other two variables. We evaluate those, which one has the highest R-square, in this case, it's X three, same process. So we'll say that X three explains more variance in the dependent variable so that R-square goes up to point B. Now notice that at point A, X two isn't at the exact same place anymore. So this is what I mean about the proportion of R-square by each variable can change. And why is that? Remember what I said earlier, when we start adding more than one variable, these variables can start influencing each other because they have a relationship as well to each other, as well as to the dependent variable. So we're not doing the actual math here yet, but I just want to get across the fundamental idea is that even if we add X three in this case, the overall R-squared explained would never go below point A. Now, luckily for us, we did explain more variance. So we're up to point B. Now, again, we only keep X three if it offers significant change to our model, and we will talk mathematically and numerically what that means in future videos. But we only keep it if it adds significantly or makes a significant change to the R-square on our model. So we assume it does. So our model is now Y equals B zero plus B two X two, plus B three X three. So intercept, coefficient, variable two, coefficient, variable three. So now in this third step, we have kept X two and X three. So we got thumbs up there. Now we look at X one, we add that in here. Maybe we have something like this. X two has its variance there in the orange, X three in the purple, and X one there on the blue. So there are our A and B, and now we have a C, so point A is where we started in step one, step B is where we were last time. So remember, in this step, no matter what we do, if we add X one, our variance will never go below B. We can only stay at B or go above B to where C is in this case. We will evaluate X one. So we only keep it if this change is significant, if not, X one doesn't come into the model and we stop. So let's say in this example that the R-squared that's added is not significant, and therefore X one will not go into the model. And therefore, we will leave the model at where we were in the previous step. And that is it. So that is the forward regression process. Add a variable, look at which variable explains the most R-square first, then we evaluate the rest of the variables, go with the highest R-square there, see if that change is significant, and so on and so on and so on. So that's how forward regression works. We take each variable, see how much variation it helps explain in the dependent variable on a one-to-one basis. The one with the highest R-squared is the one we evaluate. We see if that R-squared is significant. In this case, it's gonna be based on an F-statistic, but again, we'll go into that further later. And then we repeat the process. It's only based off the added change in R-square as we go forward. If we add a variable, the change is not significant, we don't add it to the model and we stop. And that's it. So backward regression, or backward deletion, is the exact opposite, obviously, of forward. In this case, we start with all the variables in, and then we start taking them out one by one. So here is our full model. So X one, X two, and X three are all in, and that's our equation over there on the right. Now, we take out one variable in each case as if it were the last variable added. So in this first example, we take out X one, and then we see what happens. What happens to our overall model? What change happens in the R-square? So we keep that in mind. Then we try the same thing with X two. Now, the full model, we take X two out, see what happens. See what happens with our R-squared. See if the change is meaningful. Do the same thing with X three, take it out of the model. See what happens with our R-squared, see if the change in R-square is significant, and then we can go forward. If the change in R-square is not significant, then we take that out and it stays out. So in this case, if the change in R-square is not significant, then we know that that variable doesn't really add anything, and therefore we take it out and leave it out. The exact opposite of what we did in forward regression. In that case, we added a variable in. If the addition of that variable created a significant change in R-squared, we left it in. Backward, we take a variable out. If that change is not significant, then we leave it out. Kind of see how those work backwards and forwards. So in this case, we would take out X one, 'cause we're gonna assume that it doesn't really add anything. So the change in R-square when we take X one out is not significant, so we'll leave it out. Now we're down to our two variable model. So the first one we do is test X two. So we take X two out, look at the change in R-squared, see if that change is significant, and if it is, then we don't take it out. If it's not significant, then we take it out. The opposite of forward regression. Do the same process with X three. We take it out. We say is the change in R-square significant? If the change in R-square is significant, we leave it in, because the reduction is too great. So in this case, it would appear that both are too much. So both seem to have significant reductions in R-square when we take them out. So both stay in, and we end up again with our two variable model. So stepwise is actually a combination of those two. So stepwise regression is like a forward regression and backward regression combined, but with a few modifications. Number one, at each step, all variables that are currently in the model are evaluated for their unique contribution to the model. Even when a variable gets entered into the model, it is there, I won't say temporarily, but it's subject to critique or scrutiny. In stepwise, once a variable is entered into the model, it might not stay in the model going forward, because remember, we have one variable, let's say we add a second variable. And then in stepwise regression, we take a pause. We look at each variable's contribution to the model, and if one doesn't contribute to the explanatory power of the model anymore. So step wise is a bit more robust in evaluating the model at each step. Now, variables removed at one step could reenter at a later step. So just because a variable is ejected in stepwise regression, that does not mean necessarily that later on, it could not come back in, because as we add and remove variables, the overall model is changing. It's morphing, it's evolving. And it could be that later on, a variable that was thrown out might be able to explain a bit more significant variance in the dependent variable than it did when it first came in, see? Build the model, look at each variable, if one no longer explains the variance in the dependent variable very well, it's kicked out. Then we add other variables that might be waiting sort of on deck. So we'll look at the next variable, stop, look at all the variables in, see if they all continue to contribute. If any don't, they get ejected. Now some can come back in. So you see the process? And again, we'll do this mathematically later on. So it's a bit more involved, but more robust. So this means that how variables are related to each other and the order they are entered into the model can change the makeup of the variables included in the model. So we have the entry order, how they are related, fundamentally, before we even start can affect how this model was built. The rules for entering and exiting the model are actually set by the analyst, you. Most stat software packages and most textbooks will kind of have recommendations on what P-value to set if a variable can enter the model and what the P-value should be if the variable should be kicked out of the model, but in the end, it's really up to you, the analyst. There's no set rules. You can set 0.05 for a P-value to enter the model. You can set 0.10 to exit the model. It's really up to you, but we'll go more into that when we look at stepwise regression in detail. So a quick way this would work. Let's say we have the first step, where X two enters the model and explains the variance up to A. Then we add X three in, and X three explains a ton of variance in the model, and its entry makes X two no longer significant. So X two gets bounced out of the model, and we just have X three now. Then, X one is waiting on deck, remember, it's still out there. We put X one in the model. Stepwise regression looks at each and says, does X three contribute? Yes. Does X one contribute? Yes. So they can both stay. Now let's say hypothetically, we put X two back into the model, and this time at this stage, it does contribute to the overall model at this point. So it's allowed to stay. So that can happen. So how often that happens is really dependent on the model. There's no way of knowing, but it is possible. Just because one variable is kicked out of the model doesn't mean it cannot come back later. So finally we have best subsets, and it's by far the easiest to understand. I'll just let it go, 'cause you can see what's happening. So in best subsets regression, what we do is create regression models for every possible combination of variables that exist. So in this case, we have seven. You have sex on its own, age on its own, height on its own. Those are our three one-variable models. Then we have three two-variable models, sex and age, sex and height, age and height. And then we have one three-variable model, sex, age, and height, that's our full model. You're probably already thinking to yourself, this can get pretty big. So once you add four variables and five variables and six, the list of possible regression equation grows exponentially. In a way, best subsets, you're never gonna miss the best model. However, once you start adding a bunch of variables, this can get very, very complex, sometimes hard to interpret. It becomes much more computationally intensive, depending on how many variables you have. And it's more like just a brute force method for creating regression equations. Now, when we do this method, we can't rely on R-square alone, because we have many regression equations with different numbers of variables, different R-squares, and things like that. So how do we evaluate them? Well, in addition to R-squares, in this case, there are other statistics we can use to choose our best model. And they are Mallow's C, which we'll talk about as we go, and a statistic called the AIC. So I'll save those for the subsets regression video, but in this case, because there are so many models, we've got to have other statistics to evaluate them and ultimately make our choice. So a quick review of the four common techniques. Forward regression, where we add variables one at a time and see how the overall model changes and only keep them if the addition is significant to the overall R-square. Backward is the opposite of forward. We start with all the variables in, take them out one by one, and then we'll get to a point where taking a variable out is detrimental to the model, and therefore we stop. Stepwise, it's kind of a combination. So we enter a variable, enter another variable. Then we stop and pause, look at each variable's contribution. If they are both contributing, they stay in. So if the addition of the second variable, hypothetically speaking, reduces the amount of variance explained by the first variable, the first variable may get kicked out. And then we add the third variable. Now it doesn't mean that the kicked out variable cannot come back later, it just means that at that step, it no longer contributes, so it is gone. And then best subsets, again, all possible combinations. Depending on how many variables you have, it can get very, very numerous in terms of the models you have to evaluate. So in future videos, we will pose our problem, look at each technique in depth, including like numbers, and build and evaluate simple models. And again, remember, "All models are wrong, but some are useful." So let's go ahead and make some useful models in our next videos. So that wraps up our video on the four common methods of regression model building, forward, backward, stepwise, and best subsets. Now remember, no model is perfect. Each one is gonna have its own flaws, but we are trying to build a model that balances good predictions, simplicity, and interpretability. So once you begin to understand how all these parts fit together, the trade offs we have to make here and there, you will become a much better analyst and model builder. So thank you very much for watching. I wish you all the best in your work and in your studies, wish you health and happiness, and look forward to seeing you again in the next video. Take care. Bye bye.

Info

Channel: Brandon Foltz

Views: 9,522

Rating: 4.9758062 out of 5

Keywords: machine learning, linear regression model, variable selection, linear regression, forward selection, stepwise regression, backward elimination, backward selection, best subsets regression, regression model explanation, regression model statistics, r squared, statistics 101, brandon foltz, statistics 101 brandon foltz, multiple regression, multiple regression model, multiple regression analysis interpretation, subset regression, best subset regression, stats101

Id: -inJu1jHqb8

Channel Id: undefined

Length: 29min 6sec (1746 seconds)

Published: Mon Aug 10 2020