Cross Validation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Alright, so let me try to get to this concept of cross validation. So, imagine that we've got our data, this is our training set. We can, again, picture geometrically in the case of regression. And, ultimately what we're trying to do is find a way of predicting values and then testing them. So, what we imagine is we do some kind of regression and we might want to fit this too a line. And, you know, the line is good, it kind of captures what's going on and if we apply this to the testing set, maybe it's going to do a pretty good job. But, if we are, you know, feeling kind of obsessive compulsive about it we might say well in this particular case we didn't actually track all the ups and downs of the data. So what can we do in terms of if we, if we fit it with the line and the errors not so great. What else could we switch to Charles? >> We could just use the test. No, sorry. What, what I mean is if we fit, we fit this to a line and we're sort of not happy with the fact that the line isn't fitting all of the points exactly. We might want to use ,uh, maybe a higher order polynomial. >> Oh, I'm sorry, totally misunderstood you. >> To fit this better. So if we, we can fit this with a higher order polynomial and maybe it'll hit, all these points much better. You know, so we have this kind of, kind of other shape, and now it's doing this, it's making weird predictions in certain places. So, really what we'd like to do is, and what was your suggestion? If we trained on the test set, we would do much better on the test set, wouldn't we? >> Yes. >> But that, that, that's definitely cheating. >> Why is cheating? >> Is there some, why is it cheating? Well, if we exactly fit the error, the, the test set. That's not a function at all, is it? [LAUGH] If we exactly fit the, the test set, then again that's not going to generalize to how we use it in the real world. >> So the goal is always to generalize. The test set is just a stand-in For ,what we don't know we're going to see in the future. >> Yes, very well said. Thank you. >> Actually that suggests something very important, right, it suggest that ,um, nothing we do, on our training set or even if we cheat and use the test set .Actually makes sense unless we believe that somehow the training set and the test set represent the future. >> Yes, that's a very good point, that we are assuming that this data is representative of how the system is ultimately going to be used. In fact, there's an abbreviation that statisticians like to use. That the data, we really count on the data being independent and identically distributed, >> Mm-hm. >> which is to say that all the data that we have collected, it's all really coming from the same source, so there is no, no sort of weirdness that the training set looks different from testing set looks different from the world but they are all drawn from the same distribution. >> So would you call that a fundamental assumption of supervised learning? >> I don't know that I'd call it a fundamental of supervised learning per se, but it's a fundamental assumption in a lot of the algorithms that we run, that's for sure. >> Fair enough. >> There's definitely people who have looked at, well what happens in real data if these assumptions are violated? Are there algorithms that we can apply that still do reasonable things? But the stuff that we're talking about? Yes, this is absolutely. A fundamental assumption. Alright, but here's, here's where I'm trying to get with this stuff. So what we really would like to do, is that we'd like to use a model that's complex enough to actually model the structure that's in the data that we're training on, but no so complex that it's, it's matching that so directly that it doesn't really work well on the test set. But unfortunately we don't really have the test set to play with because that again, is going to, it's too much teaching to the test. We need to actually learn the true structure that is going to need to be generalized. So, so how do we find out. How can we, how can we pick a model that is complex enough to model the data while making sure that it hasn't started to kind of diverege in terms of how it's going to be applied to the test set. If we don't have access to the test set, is there something that we can use in the training set that we could have it kind of act like a test set? >> Well, we could take some of the training data and pretend its a test set and that wouldn't be cheating because its not really the test set. >> Excellent. Indeed, right, so there's nothing magic about the training set all needing to be used to fit the coefficient. It could be that we hold out some of it ,as a kind of make pretend test set, a test test set, a trial test set, a what we're going to say cross validation set. And it's going to be a stand in for the actual test data. That we can actually, make use of that doesn't involve actually using the test data directly which is ultimately going to be cheating. So, this cross validation set is going to be really helpful in figuring out what to do. So. Alright, so here's how we're going to do this, this concept of cross validation. We're going to take our training data, and we're going to split it into what are called folds. I'm not actually sure why they're called folds. I don't know if that's a sheep reference. >> Why would it be a sheep reference? >> I think there's a sheep-related concept that is called a fold. Like, You know, we're going to bring you back into the fold. >> Oh. >> It's like the, it's like the group of sheep. >> You are just trunk full of knowledge. >> Alright so what we're going to do is train on the first three folds, and use the fourth one to, to see how we did. Train on the [LAUGH] second there and fourth fold and check on the first one. And we're going to we're going to try all these different combinations leaving out each fold as a kind of a, a fake test set. And then average these errors. The ,uh, the, the goodness of fit. Average them all together, to see how well we've done. And, the model class, so like the degree of the polynomial in this case that does the best job, the lowest error, is the one that we're going to go with. Alright, so if this is a little bit abstract still let me, let me ground this back out in the housing example.

Info

Channel: Udacity

Views: 118,100

Rating: undefined out of 5

Keywords: machine learning, supervised learning, computer science, Georgia Tech, Udacity

Id: sFO2ff-gTh0

Channel Id: undefined

Length: 6min 7sec (367 seconds)

Published: Mon Feb 23 2015