Cross Validation : Data Science Concepts

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey welcome back everyone today we're going to be talking about a really cool and very very important concept in data science called cross validation now let me just jump into an example actually from my own experience when i was an undergrad i was doing research on students in the mathematics department at my college and the big question we wanted to answer was based on information about the student like we had access to their grades which courses they took their demographic information we wanted to come up with a model that would be able to accurately predict if the student would drop out of their major and what i mean by drop out of their major is switch their major for something else we wanted to identify these students early and then honestly there's a lot of reasons students would drop out of their major sometimes it's because their grades are low and they think they would switch to a major where their grades would be better sometimes it's not about grades at all sometimes they just don't find an interest in this current major and want to make a transition but either way we wanted to identify these students one quarter or semester early so that we could better help guide their transition so in the spirit of that example let's say that there was a thousand students and we're just going to go down the typical machine learning pipeline nothing too fancy here and we'll see why there might be something lacking with this basic idea so we have a thousand students let's say that we do a 80 20 split so we use 80 of the data for training and 20 of the data for testing so what that means is 800 students randomly get assigned to training and the other 200 students will be used for testing the strength of the model that we build so we go ahead and train our model on these 800 students it doesn't matter what the model is for this video it could be a random forest svm whatever but we have a model m then we go ahead and apply this model m to these 200 test students which the model has not seen yet and we get a measure of the strength of the model let's just keep it simple and say we calculate the accuracy and that should be the end of the story right it seems we see if the accuracy is high or low and then we go from there well this is okay this is a good start for machine learning but there are some key flaws with this idea one of the key flaws with this idea is that not every student in this set of thousand students that we were using gets included in the testing set so what that means is that we are only evaluating the strength of our model on these 200 students we haven't made any attempt to evaluate the strength of our model on the other 800 students and we can't using this model because that would contaminate the testing and training set so we have to come up with a little bit of a more clever idea the other big problem is that although our sample size is somewhat big it's not that big and especially for sample size is even lower than this you might run into issues where based on whatever randomness happens and whatever test set you get the accuracy you're getting here might not be a true indication of the accuracy for the overall population okay so we want something more robust and that's where this idea of cross-validation comes in so let me go ahead and just show you a concrete example of cross validation then i'll generalize it a little bit and i'll show you some extensions and some cautions with cross validation so here's the new idea that we're going to use to make this example more robust we're going to take our set of thousand students and we're going to split them into five groups at random and each group is going to contain roughly the same amount of students so here we're going to have 200 students per group now what we do is for each i is equal to 1 2 3 4 and 5. we're going to train on all of the sets where it's not equal to i so just to be concrete the first thing we're going to do is say i is equal to 1 so we're going to say group 1 this set of 200 randomly chosen students in group 1 will be our testing set and we're going to train the model on the other 800 students who are made up from group 2 three four and five so far this is the exact same thing we were doing back there we're training on these random 800 and we're testing on a random 200. so then we build this model we're going to call this model m sub negative i you'll see this notation sometimes what it typically means is that this model is built on everything except set i which is exactly what it is we built this model using sets two three four and five so we'll call this model m negative one we get the accuracy on set i which would be 1 here and we would call that a sub 1. so so far that would be the same exact procedure that we just did now here's where this becomes cross validation and where it becomes more interesting the next thing we do is say that now set 2 is going to be our testing set so we're going to pretend the model cannot see set 2 we're going to train a new model on the 800 students that are made up from set 1 3 4 and 5. so that's going to be a different model just making that clear because we're using a different set of students to train this model and then we're going to get that model which will be called m negative 2 since it has not been trained on set 2 and we're going to get the accuracy using that model on the students who are in set 2 and that's going to be called a sub 2. and you can probably see where i'm going with this the next thing we would do is call set three our testing set and train the model on one two four and five and we go from there so at the end of the day we have five different models that were trained on five different subsets of these five sets that contain all of the students in order to get our final measure of accuracy here we would go ahead and simply just take the average so one-fifth of the sum of all of these accuracies that we collected in this process now why is this a good idea one thing i'll say off the bat is it's more computationally expensive right if your model here took like five hours to train just for example if you have lots of students or something this is going to take five times longer than that because you're building five models here of course it's computationally more expensive but why is this still a really good idea and why is something that people do really often in practice so one of the biggest problems that we address that we didn't address here is that every student gets included in testing at some point so whereas before we only had this set of 200 students getting tested on in order to determine our accuracy now in our calculation of final accuracy we are testing every single one of these thousand students because remember each of these sets gets used as the testing set at some point or the other which means that every student in each of those sets gets used as a testing example which means that our final accuracy is truly combining the strength of our models on all of the students in the data set so that's one big pro an even bigger pro is that we're reducing the bias so something i said earlier was that this set of 200 students although it is random it could be better or worse than the true performance of the model on the general population here we are reducing that bias i'm saying all right i'm not going to go ahead and use the model on just one set of 200 students i'm going to build five separate models and test them on these five different sets of 200 students so we start getting a better idea of the true strength of this model that we are building of course now let's say a new student comes in completely new student that you've never seen before which model do you use because if you've noticed we've built five models here one for each combination of training sets right so which model do we use to predict this next student well here you typically combine the models in some way here it's a binary classification problem so it's relatively simple let's say that we have our five models m negative one m negative two m negative three m negative four and m negative five and let's say that four of these models predict that this new student will drop out of their major in the next semester and the last model says that they will not we can just do a simple majority and we can say that the prediction for the student is that they will drop out if you're doing a regression example you could just take a literal arithmetic average of your predictions from the five models and so on so you're going to combine the models in some way to get your final model now i'm ready to finally put a name on this procedure this is called k-fold cross validation and here our k was equal to 5 and k is the number of sets that you break up your whole set into so it's called five-fold cross-validation for us because we took our initial set of a thousand students broken into five sets and then went from there if we had broken that into ten sets of 100 students each this would be called ten-fold cross-validation and actually five and ten are the typical numbers that people use in practice but there's no rule you can really use whatever you want that you think is appropriate for your application but it is important to think about the trade-offs if you make k way too big that means that you're going to split up your initial set into many many many many sets that means every time you are testing a model you're only testing it on a little bit of samples however if you make your k too small like let's say use k is equal to two then essentially just taking your entire sample of students splitting it in half and you're only training on half so that might or might not be appropriate so just think about your choice of k again typically it's 5 or 10. a very special case of cross validation is called leave one out cross validation and this is a very interesting case it basically says that i'm going to take my sample of a thousand students and i'm going to take one student and leave them out each time and i'm going to train my model on the other 999 students and then take that model and apply it to the one student that i left out then i'm going to train my model on a different set of 999 students and test it on the one student i left out and you basically do that a thousand times so one thing to note is that this is going to be very computationally expensive because you have to train not five but a thousand models but this is also an option that is available when you're doing cross validation and the very last thing i'll say because we talk about time series so much on this channel and just because i'm a big fan of time series you have to be careful when you're applying this cross-validation idea to time series because here the thousand students weren't ordered in any particular way it doesn't really matter if we ordered them at all but when you're looking at time series and let's say you have a thousand samples here in time there is a very inherent ordering you can't really apply this cross-validation idea blindly if you did you're going to run into some disasters because you're going to say that i'm just going to split up my set of 1000 samples randomly into five sets but now you're going to be using information from the future and the past to predict stuff that's in the middle and that's not really appropriate when you're doing time series so there's a whole different set of cross validation techniques that you apply to time series and most of them basically just make sure that you take the order into account so if you're going to be doing cross validation on time series make sure that you're taking the order of these samples into account so that you're not getting the wrong measures of accuracy on your time series models all right so this was the really cool and very widely used technique called cross validation in time series if you have any questions please put them in the comments below please like and subscribe for more videos just like this and i'll see you next time
Info
Channel: ritvikmath
Views: 23,532
Rating: undefined out of 5
Keywords: data science, sample, ai, machine learning, big data
Id: wjILv3-UGM8
Channel Id: undefined
Length: 10min 11sec (611 seconds)
Published: Mon Oct 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.