Bias vs. Variance Tradeoff, Cross-Validation, and Overfitting (Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello this is Professor George Easton in this video I'm going to talk about the bias-variance tradeoff cross-validation and overfitting in the context of prediction so as an introduction the situation we are considering is as follows suppose we have a set of data consisting of a dependent Y variable and a set of X variables just as you would have in a regular linear regression but in this case we're imagining that our dependent Y variable could be some non-linear function f of the set of x variables plus an error term epsilon which plays the same role as the error term in a regular linear regression now let's also suppose that we have randomly divided our data into at least training and validation samples and ideally we would have also included a final test sample as well and finally suppose that we have calculated on the training data an estimate of this nonlinear function f of X which we are going to denote by F hat of X so I want to reiterate that this X here which I've written with a capital X really represents a set of independent explanatory variables just as you would have in a regular linear regression so just to make this clear if our F hat of X came from linear regression then F hat of X would be equal to alpha hat plus beta hat 1 times X 1 plus beta hat 2 times X 2 and so on up to beta hat P times X P where these coefficients alpha hat and beta hat 1 to beta hat P are just the usual least squares fitted regression coefficients in this case these fitted coefficients the intercept alpha hat and the slope coefficients beta hat 1 to beta hat P are going to depend on the Y values and the X values in the data that they are estimated from and so what this means in this case is that our f hat of X depends on the training data so if I were to use a more complete and thorough notation what I would do is represent f hat of X as being dependent on the Y training data and the X training data but this notation is cumbersome so I'm going to suppress the dependency on the training data here but you know now that F hat of X really depends on the training data through the estimated coefficients now my example here has been based on linear regression because I expect that you are very familiar with linear regression but remember what I'm really thinking about here is that my function f hat of X is some nonlinear function of those explanatory x variables and that it is an attempt to estimate some true underlying nonlinear function that relates those explanatory variables to the dependent variable Y now before we get into the bias and variance trade-off I want to remind you about what bias and variance are let's suppose that I have a statistic s I'm just using the capital letter s here to represent some statistic which will be computed off of data and let's suppose that this statistic s estimates some unknown parameter theta so again if I want to get specific to the case of regression then the fitted regression coefficient beta hat is going to be estimating the true slope coefficient beta then the bias of our statistic s is going to be the difference between the expected value of s and the value of the thing that is estimating the theta so the bias of s is the expected value of s minus theta now the variance of the statistic s is just going to be the expected value of the squared deviation of the statistic from its mean so what that means is that it's the expected value s minus the expected value of s that's its mean and take that difference and then square it and then take the expectation of that squared difference that is the definition of the variance of s so now applying these definitions to our function f hat of X the bias of F hat of X is just the expectation of F hat of X minus the true f of X that it's trying to estimate similarly the variance of F hat of X just plugging F hat of X into the definition of the variance the variance of F hat of X is just going to be f hat of X minus its expectation you take that difference and square it and then you take the expectation of that squared difference that's going to be the variance of F hat of X so I'm now in a position to give you the formula for the bias-variance tradeoff the formula is as follows and I'm now going to explain each of the terms the term here on the left hand side that we're going to decompose into three terms on the right hand side is the expected mean square error of the prediction on the validation sample so the mean square error of the prediction is going to be the actual realized value minus the value predicted by f hat of X and then you square that difference and take its expectation so this is the expected mean square error on the validation sample it's going to be equal to the variance of the fit so this is the formula that I just showed you for the variance of F hat of X so it's F hat of X minus its expectation take the difference and square it and then take the overall expectation plus then the squared bias so the bias was the expectation of F hat of X minus the true value which is f of X and then this is squared and then finally we add to that the variance of the error term epsilon so this is the variance of the error which we're going you assume as usual is Sigma squared I have a note here which is really mostly to myself so why here is a new observation from the validation sample F hat of X will be computed using the X from the validation sample but the coefficients that define F hat of X will have been estimated from the training sample since the coefficients that define F hat are calculated from the previous Y's that were a part of the training sample the new observation y and the F hat of X are going to be uncorrelated the X here used an F hat of X is the new X corresponding to this new Y observation but our analysis is conditional on this X so it's considered a constant now it's not all that difficult to derive this formula but I'm not going to do that here because I really don't think that it's all that useful but what is important is the idea that the prediction means square error so this is the mean square of the prediction on the validation sample is going to be equal to the squared bias of the fit that's a bias term plus the variance of the fit and then finally the variance of the error now this is sometimes said in kind of a shorthand way that the MSE is the bias plus the variance and this is not quite right because it's really the bias squared and in addition the variance of the error has been left off but this is basically the idea is that the major components of the prediction mean square error correspond to a bias piece plus a variance piece now on the training sample the mean square error always goes down as the model f hat of X becomes more complex or has more parameters and this is intuitive because the more flexibility you have in fitting the data which is really what's meant by saying that F hat of X becomes more complex but the flexibility you have in fitting the data the closer you should be able to come to a perfect fit of that data but on the validation sample things are different on the validation sample what generally happens is the mean square error will usually go down as the model increases its complexity but only up to a point the mean square error is going down initially as you increase complexity because the bias is being reduced but then the mean square error on the validation sample will start to go up and it goes up because of the phenomenon of overfitting and overfitting begins to occur when the model complexity gets sufficiently high that random features of the training data are being fit by the model well this is a very important idea and that idea is that the mean square error will initially decrease but then increase on the validation data even though the mean square error continues to decrease on the training data and as I say this is really a very important concept to understand so now I'm gonna make an illustration that shows what I've been talking about over the last few slides so I'm going to begin with an X into y axis and on the x axis we're going to represent the complexity of the model we're fitting and remember that complexity often corresponds to the number of parameters and on the y-axis we're going to put the performance of the estimator and since we've been talking about prediction that's going to be in this case the mean square error now on the training data set our models may start outperforming not very well because they are not sufficiently complex to represent the structure in the data so when the model is too simple for what's going on in the data the model will fundamentally suffer from a substantial amount of bias and it won't be able to perform very well but as the model complexity goes up on the training data the performance will improve and it will continue to improve as the bias is removed because the model becomes sufficiently complex to actually fit what's going on in the data but then it will continue even further and eventually it's going to fit the data essentially perfectly now it's typically going to be able to fit the data essentially perfectly when the number of parameters is about the same as the number of data points well what happens on the validation data well on the validation sample you're very likely to start in essentially the same place if you're fitting a model that is too simple for the structure and the data then as the complexity increases the performance of the model on the validation data is also going to increase but at some point the performance on the validation data is going to start to taper off and this happens when the complexity of the model is sufficient that there's essentially no bias but then as you add additional complexity what's happening is that the model is beginning to fit random noise in the data and as a result of that the performance on the validation data we'll decline so this trace is for the validation data and then this trace is for the training data so the best possible model is the one that corresponds to the minimum of the trace for the validation data it's going to be the model that has sufficient complexity that is able to fit the structure in the data but not so much complexity that overfitting is beginning to occur now before I go on I want to make some comments about the problems of overfitting and of model selection in statistics traditionally in statistics the sample sizes that we had been dealing with have not been so large that it's been particularly practical to hold out a validation and a test sample as a result the field of statistics has developed a number of goodness-of-fit methods that try to prevent overfitting of the data all on the training data and that is without a validation sample so it's all done within sample one example that I expect that you're familiar with is the adjusted r-squared in regular least squares linear regression and as you probably know the adjusted r-squared can be used to choose between regression models that have different sets of X variables so the idea of the adjusted r-squared is that it penalizes adding in X variables that do not contribute anything to the regression and this is different than the regular r-squared which will continue to get better as new X variables are added into the model so using the adjusted r-squared to try to pick the best model is intended to prevent overfitting because the adjusted r-squared does not reward adding in variables that are not contributing in a meaningful way to the overall fit and are just fitting random noise now another goodness-of-fit measure that is used in model selection and intended to prevent overfitting is called the AIC which stands for the akka iki information criterion and it is named after the professor that proved the main asymptotic result about this measures behavior but for the AIC the smaller the value the better there's also a corrected version of the AIC that works better in small sample sizes which is called the AIC C so like the adjusted r-squared the AIC can be used to choose between models that have different sets of X variables and I'll add here that the absolute magnitude of the AIC does not have a reasonable interpretation so the usefulness of the AIC is to compare the values between models that have different sets of X variables now the AIC and the AIC see in smaller samples is probably a better method for model selection and to prevent overfitting than the adjusted r-squared but overall having a randomly drawn validation sample is a more direct test of overfitting and when you do have enough data to randomly split your sample into a training set and a validation set using the validation sample is probably better than using any of the in-sample measures but do not interpret this to mean that these measures such as the adjusted R squared or the AIC or a ICC are not useful because they are I think they tend to be quite useful in terms of limiting the set of models that you fit on the training data set and then you can go on to verify which one of those is best using the validation sample so I've talked some about overfitting at this point but now I want to address a fitting specifically now when a model is fit to a data set that has sufficient flexibility and here flexibility and complexity really are essentially referring to the same thing but when that model has sufficient flexibility that it begins to fit random features in the data it's said to overfit the data when you have such a model the bias in the model will typically have gone down because the flexibility is allowing that model to fit the main structural features in the data but the variance will increase because the model is actually responding to the noise and the data and it's actually fitting the noise in the data so now referring back to the bias versus variance trade-off when you are overfitting the data the bias will be low but the variance will be high and referring back to our y equals f of X plus a random error model what's happening is that the fit will be picking up the noise in those error terms and the residuals will be unrealistically small so we've talked a lot about the validation data set so far and we've mentioned cross-validation in passing but now I want to address cross-validation directly so in statistics methods that split the data into training and validation datasets generally go under the name of cross-validation so what we've been doing which is just splitting the data into a training set and a validation set randomly of course is a form of cross-validation so let me just clarify a little bit the difference between the first and the second bullet hair many of the methods that go under the name of cross-validation in statistics involve splitting the data into training and validation datasets multiple times what we have done so far is just take our data and split it into a training and a validation and perhaps a test data set as well but just a single time so when we split the data into a training and validation data set randomly a single time this is really the simplest kind of cross validation that we might consider doing so let me just remind you of how you generally do your splitting of data in data science in the context of data science you're generally thinking about very large samples and you want to split your data into a training sample a validation sample and then finally a test sample typical percentages for the various splits might be 50% for the training sample 25% for the validation sample and 25% for the test sample or you might split it 60% for the training 20% for the validation and 20% for the test if you have a relatively small number of observations for example say 300 you might forego the test sample and just use a 2/3 1/3 split so 200 observations in the training sample and 100 in the validation sample so now how do you use these samples well the training data is used to estimate or fit a relatively small set of models so for example these models may be several linear regressions that have different sets of X variables or they might be a collection of regression trees with different numbers of end nodes or you might compare regression trees to linear regression models but when I say fitting what is meant is the estimation of the parameters of these models so for example if one of the models is a linear regression then calculating the fitted linear regression coefficients that is the idea of fitting now the validation data set is used for a different purpose than fitting the coefficients what the validation data set is used for is to pick which of the models that are being considered that we're fit on the training data set is the best and in the context of prediction best is determined by which one of those models has the smallest error in predicting the y-values on the validation data set so the validation data set is really being used to determine the right level of model complexity or equivalently the right level of model flexibility now the real power of cross-validation is that cross-validation generally prevents overfitting the data why does that happen well the random noise are random features that end up in the training data set that are fit by a too flexible model are not likely to also be occurring in the validation set different random features may occur there but those features won't have had any influence on the model fit to the training data set so a model that over fits the training data set will generally perform badly for the validation data and because of this bad performance in comparison to simpler models that don't over fit the overfitting model will not be selected by the validation data now once you've decided on your model you can then if you want refit the model to both the training and the validation data which will of course increase your sample size and therefore reduce the variation in your estimates and then if you have a test data set it can be used as a final check so what I've been discussing here is one of the most important and central ideas in data science so I want to summarize it again and make sure that I've clearly stated how you think about what's going on the training data set is used to both discover and to fit which means to estimate the models parameters a reasonably small number of models these models typically should span a range of flexibility or complexity because exactly how much flexibility you need to fit the structure of the model without overfitting is typically not clear just by examining the training data set alone so the deliverable the thing you should get out of the training data set is a reasonably small collection of models together with their fits which means estimates of their parameters based on the training data that can then be tested on the validation data set these models which were discovered and fit on the training data set then compete on the validation data set by predicting the y-variable so what this means is that the X variables for each observation on the validation data set are plugged into these predicting functions to get a prediction for the Y variable you do not refit or re-estimate the parameters for these models on the validation data set you just apply the models to the X variables to get predictions for the Y variables the validation data set then is used to pick the right model complexity and/or structure based on how well the models do in predicting the Y values on the validation data set so now in conclusion it is really hard to overstate the importance of using cross-validation in data science and really any time that you have enough data that cross-validation is feasible this problem of overfitting of the data is generally very much underappreciated it's really not especially intuitive and if you have enough data cross-validation is easy enough to do so you should basically always do it so that you get an out-of-sample assessment of whether or not the model that you are considering is in fact overfitting the data and finally there are other ways to do cross-validation such as leave one out cross validation or k-fold cross-validation that I'm going to talk about elsewhere the purpose of this video is to really clearly identify what the bias-variance tradeoff is and to discuss how cross validation prevents this problem of overfitting in part two we're going to look at an example that will give you a very clear idea of what overfitting is and how cross validation tends to prevent overfitting from occurring
Info
Channel: ProfessorEaston
Views: 9,274
Rating: 4.9642859 out of 5
Keywords: data science, bias vs. variance, cross-validation, overfitting, regression trees
Id: jiQamxz2ZcQ
Channel Id: undefined
Length: 26min 44sec (1604 seconds)
Published: Mon Dec 07 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.