cross validation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright so in this video I'm going to show you how to do the cost validation process that's necessary when you use any kind of statistical regression and again we're testing for overfitting so I'll just use the same exact example I did in class so really the first thing to look at is water your predictors so I think in class who used this is the Criterion's this is business to healthcare professionals and then we have these three as predictors of stressful life events mental health and then physical health so these are the three predictors will enter into stepwise or forward or backward regression and they're all predicting this but before we do that what we really need to do is select a random sample of data to build the model and then we're going to cross validate that against the other 20% that's left out so we have to go to data on select cases and then we're going to random sample of cases sample and they're going to select 80% you can do however many eighties a pretty good number though are pretty good proportion so click continue okay and then what it's done is select a certain number of those cases and it's again that's random so the ones that are crossed out are not selected next what we need to do is create a new variable call this sample and then we'll give it two different values so 0 is the 20% sample and then one is the 80% sample I'm only have to do after that is simply copy this filter variable which is temporary I'm going to copy that over to this now if you'll remember from a class we have this weird case down here I'm not sure what's going on with that so I'll just delete that okay now I've got filter taken care of we are going to cut that remove it so rat sample the next thing to do is just to make sure that a certain sample of your data is selected so you go back to select cases we're going to go to if condition is satisfied click on if it select sample equal to one so again this is just making sure that the 80% sample is selected and not the 20% so click OK then we should be good now we're ready to build our model so we've got to analyze regression linear as to health professionals is going to be our criterion then we're going to have these three as our predictors so we're going to use stepwise this time and we'll be good with that we don't need to really select anything up here and then we'll click OK alright so you can see exactly what the results are again they're the same the women over in class you've got two predictors in the final model then you've got everything you need right here to build AI your equation and then create predicted values so the thing to you next go back to data select cases and then select all cases or you can just click reset it changes everything back click ok and so now we have all of the data that we're looking at at one so you can see over here nothing's crossed out we can delete this filter variable again we don't really need it at the moment okay so the next thing to do is create the predicted scores so we're going to go transform compute and then we're going to use predicted and all we have to do next is input the appropriate values here so we're going to use the y-intercept first which is negative three point nine seven eight plus say one point seven six nine which is this right here the regression coefficient for physical health symptoms we have to multiply that by physical health symptoms so then close the parentheses then we're going to add the last predictor in the stressful life events so it's point zero one six again times stressful life events so we don't need mental health symptoms here because it's still excluded from the analysis of excluded variables mental health symptoms was never input into the model such as not going to be entered into our equation here so then we can create predicted values for all of the cases the 20% and the 80% sample even though the model is built this model right here was built using only the eighty percent sample so we're going to click OK now we're back here get the predicted values created so the only thing to do next is um split the data so we're going to go to data split file and we're going to compare groups well we're going to compare are the samples here so this is going to compare our 20% and 80% sample we click OK now we're going to analyze we're going to go to correlate and bivariate and we're going to compare our predicted values with the actual criterion so you want a high degree of overlap here you want a pretty large positive correlation coefficient and then we're going to see if that's about the same for the 20% sample and the 80% sample so we'll click OK and as we showed in class this correlation coefficient is point four or five eight for the actual sample it was created from and it's a little bit larger for the 20% sample so what this suggests it's kind of an odd thing that occurs here but what it does suggest that there's not overfitting so if we had 0.45 it here in like 0.258 or something right here then that would probably suggest overfitting so what you would want to do is collect a larger sample preferably from a more generalized group so you want different participants something that's more generalizable so hopefully this has been helpful and we can obviously talk about it in class here any questions
Info
Channel: William Hill
Views: 24,868
Rating: undefined out of 5
Keywords:
Id: I8T9tYpW_lQ
Channel Id: undefined
Length: 6min 57sec (417 seconds)
Published: Fri Apr 24 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.