Statistics 101: Multiple Regression, Best Subsets

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hello and Namaste. My name is Brandon and welcome to the next video in my series on basic statistics. If you are new to the channel, welcome, it's great to have you. If you're returning viewer, it is great to have you back. Like this video. Please give it a thumbs up. Share it with classmates, colleagues or friends or anyone else you think might benefit from watching. Now that we are introduced, let's go ahead and get started. So this video is the next in our series on the four common model building techniques in multiple regression. So up to this point, we have looked at forward selection, backwards elimination and stepwise and this is best subsets, which is the fourth and it is quite a bit different. However, we'll keep this concise, conceptual and at a high level because once you kind of have the basics of regression model building, you'll understand how this one is different than the other ones and how to evaluate them. So let's go ahead and dive in. So, of course, there are the three common iterative techniques I mentioned before: So forwards, backwards, and stepwise. This fourth technique is not iterative. It's called best subsets regression. So best subsets is in its name. It examines all possible combinations of feature variables. It's a brute force method that can be computationally expensive depending on how many observations you have and how many feature variables you have. Now, you, the analyst, can specify the maximum number of features that you want in your model. So let's say you have 15 feature variables but you do not want a model that has any more than eight so you can tell the software, most software programs, "Limit my model to only eight variables, "even though I have 15 to work with." So it's one way to limit the size of your model. Now, you, the analysts, can also request the two or three best models for each number of feature variables. So let's say you had the eight model example again. You could tell the software to say, "Hey! "Give me the three best single variable models. "Give me the three best two variable models. "Give me the three best three variable models "and so on and so forth." And that helps limit the extraneous output that you might get if you looked at all of those models and I'll show you how big they can get here in a second. Now, remember that these will not always produce the same best model. Forward, backward, stepwise, and best subsets may or may not produce the same best model. And actually in best subsets, you might have evaluation criteria that are odds with each other within best subsets so you have to use your judgment. But just keep in mind that all these processes may or may not produce the same best model. So remember, the entire goal of model building is to reduce SSE without overfitting. We want to explain the maximum variance in the dependent variable. So up at this point, we have learned about these three other techniques: forward, backward, and stepwise. Technically, these are all stepwise since the model is evaluated at each step. So in Jump, JMP from SAS that I use, it calls them all stepwise. But step wise, the one we've learned about is actually forward and backwards combined but they're all stepwise, technically. Now, by their nature, the first three techniques will not examine all possible models. Only best subsets does that. So in forward and backward, we enter variables one at a time depending on whether or not they meet the threshold value. And in forward selection, once they're in, they're always in. In backwards, once they're out, they're always out. And of course, step wise can do a little bit of both. However, that doesn't mean the process will look at every variable combination. That's not the way those work. Best subsets does work that way. However, mild selection methods are almost always used together since doing so is just a few clicks or a feed line of code. So what's common is that an analyst will run their model through forward selection. They'll run their model through backwards elimination. They'll run their model through stepwise regression and they'll run it through best subsets and then look at what each one produces and then use their judgment to evaluate what the best final model might be. So best subsets, before fast computers, was very difficult due to how numerous and complex the computations could be. With each new feature variable added, the number of possible models balloons since that number is the sum of all possible model combinations of each size. Now that's a lot of words there. Let's see how it works. For example, a full model having only four features would have a sum of all models that have zero features, one feature, two features, three features, and four features, respectively. So mathematically, it looks like this. So it's just combinations, right? It's coming to torques. So in the first case, we have four variables and we choose zero plus four variables where we choose one of the four. We have four variables where we choose two of the four, three of the four, and then four of the four so that sums to 16 possible models. Now, for nine features, that number would grow to 512 possible models. See how fast these can get out of hand. Can you imagine having computer output that has the evaluation of 512 different models on it just using nine features? It can get really, really crazy very quickly. So to manage the output, the analyst often requests the top n models. So top two or three models of each variable size which we call p. That's what we have up here. Zero variables, one variable, two, three, and all four. We can also limit the number of features. So in this case, if we only wanted a model with two variables in it and maximum of two variables, we could just look at the first three combinations here and tell the software to eliminate the last two. So we could do that as well. So we limit the number of features in the model as one way to limit the output and we can limit the number of models we get for each number of feature combinations. So here is the output using best subsets in our home price model we've been using in this series. So what I told Jump to do was give me up to the six best models and up to four terms per model. Now, I did this on purpose, of course, because the combination of four choose two is six so I wanted to make sure I got all the two variable models in here in the middle and I wanted to show all four variables being entered. But I could have said, "Hey, Jump. "Give me the two best models "for up to all four variables entered "or give me the three best models "but limited to three variables, "so on and so forth." You can do whatever you want with the output. It's really up to you. So in this case, I made sure to include all of them 'cause it's a very small feature set. So you can see in the number column we have, how many feature variables are in that model? So it should make sense that in the first block here, we have four single variable models. One for square footage, one for number of bathrooms, one for whether or not it's an exemplary high school and one for number of bedrooms in the home. So there's a four, one feature variable models. Then we have the two feature variable models, the three and then the four. See how this works? So we find this out by just using combinations. Four choose one is four. That's just a combination problem and it makes sense 'cause we have four variables that if we do one variable models, there'll be four of them. And then for the two variable models, that's four choose two. Now, in that case, there are six possible combinations of two feature models. Then for the three feature models, there are four again so four choose three is four. And then for the four choose four, there's only one way to choose all four things and that's one so that's at the bottom. So you can see how it actually selects the number of possible models depending on how many features we want. Then over here, we have the evaluation criteria for all these models. So we have r-square. We have root mean square error. We have the AICC. There's also one does just AIC. We have BIC. And then Mallows C which is C lower case P and these are all ways to evaluate which model is the best. What we're gonna do in future videos in this series is we're gonna look at each one of these individually. Probably not r-square or root mean square because those are basic statistical measures and you should be familiar with those by now. But we will look at the AIC, BIC and Mallows as we go forward. So note, the model with zero features is not shown in this output but it is a model option. That is possible. We could for all four variables into the model. None of them make any difference at all so we wouldn't actually put them in there. And then the predicted value would just be the mean $207,000. I think that was the mean house price. So it doesn't show that model but that is an option and I've seen it happen so just keep that in mind. But the question is, which is the best model? How do we decide? How is best subsets results different than the other models that we might use: forward, backwards, and stepwise. So again, what we will do in the next videos in this series is look at best subsets and then how we evaluate the best models. So we'll look at all these criteria to examine how we might choose which one to use. But I can tell you right now that in this output right here, the evaluation criteria over here sort of in the middle is actually in conflict. One measure is telling us one thing is the best model. Another one's telling us something else is the best model. And then, we, the analysts will have to figure it out using our judgment. So when so many models are present, we need a way to decide which is the best for our purposes. Same basic goal applies: The simplest model that reduces error the most to avoid overfitting and we keep complexity low. Several methods have been introduced to help with selection. So we have Mallows C. We have the AIC. AIC C, which is a take on AIC, of course. BIC, and of course, r-square which we're familiar with. Usually in the output, we'll get all of these and we have to look at them and then determine based of what each one does 'cause they'll do something slightly different. Which model we will select as the best? And that's what we'll get into in the next series of videos. So, as I mentioned, these methods will not always agree. You as the analyst must use judgment, domain knowledge, and common sense. And whatever your goals are to choose the best model. Okay, so that wraps up this video on best subsets regression. I hope you get the basic concept of how they are built in the software, how the output is given to you, then how you can kind of limit the output here and there to make it manageable, and then how you'll begin to assess which model is the best in reducing error but as also the simplest model. And again, we'll get into that in future videos. So hope you found this video very helpful. Thank you very much for watching. I appreciate your time. I look forward to seeing you again in the next video. Take care. Ba-bye.
Info
Channel: Brandon Foltz
Views: 2,127
Rating: 5 out of 5
Keywords: statistics 101, brandon foltz, statistics 101 multiple regression, brandon foltz statistics 101, statistics 101: multiple regression, multiple regression, stepwise regression, forward selection, backward selection, regression analysis, best subsets regression, backward elimination, linear regression, multiple regression analysis, machine learning, machine learning tutorial, data science, Multiple linear regression, best subsets, linear regression statistics
Id: 5jHPTC6_21w
Channel Id: undefined
Length: 11min 58sec (718 seconds)
Published: Mon May 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.