Cross Validation Overview with R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to cradle to grave r my name is mark gengrass and today we're going to talk about machine learning but first we're going to talk about a concept called cross validation before we get into it please if you are interested in more tutorials from me subscribe like and share this post i really appreciate that and thanks everyone that's already been doing that let's jump right into it okay so what i have is in fact um i'm going to give you a demonstration on the on the whiteboard first and this will be one tutorial just on the whiteboard so what i want to do is talk about cross validation or cv you might hear you've probably heard of people talk about k fold cross validation that's what we're going to learn about right now so cross validation and the k is the number of folds now like what's a fold so k fold cross validation those are the terms you might hear when you talk about cross-validation what is that right so you've probably also heard that if you have data you want to separate it out into a training set and a test set sometimes a validation set right so let's visualize this and then i'll walk you through what's going on so if this is our data here this is all of our data in this little box there's however much data we have picture these as basically rows of data right so if we want to make a training algorithm that's going to predict the future or predict new data that's never been uh seen before you know we we might want to leave some of these out so that later on we can check and see if it's making a good prediction so let's do that now i'm going to cut this data right here and i'm going to call this upper half 75 percent of the data and this is 25 of the data now it's arbitrary how i cut that right i could have done 80 20. i could have done any any number of different ways but let's start with 75 25 right this 75 percent of the data is what you have access to this 25 pretend that you cannot see it this is gone right so you want to train an algorithm now a machine learning algorithm that could be cross that could be um you know support vector machine it could be logistic regression it could be k means clustering it could be anything right so i want to classify basically if i get new data i want to be able to classify it and see if i classified it correctly or not how do i measure something like that well cross validation so that being said let's uh do a couple i guess scenarios here so you have 25 data that you do not train on so you create this awesome algorithm let's say that at the end of the day you have a couple of different or should i have different colors but you have a cluster of data here in a cluster of data over here and some machine learning algorithms said hey i'm going to put a boundary around this and a boundary around that so when you have certain features because whatever these features are across this data so the features could be anything it could be random like um do you have a feature called um exercise right do you exercise a lot food habits um a history of heart disease so history and then you have some sort of answer that you're getting to right i have all these different features i have a lot of different features in this data and i'm trying to get to an answer well i know the answer for 100 of this data now i'm only going to show 75 of the data though let's not forget that now the answer here would be are you vulnerable for a heart attack right yes or no it's a boolean yes you are no you're not so picture these two dots is basically the answer and it takes all of these features it could be hundreds of features and it tries to figure out what's the best way to cluster these so that i'm correct every time well when you just use that training set it thinks this training set is the only data in the world and it can train on any algorithm to be pretty damn good like this fine it's seen this data of course it can do this right it's a computer it can get as precise as we want however now we take that 25 and we say okay on this training algorithm we'll call it big a1 the first algorithm we come up with that did this it said hey yes you're vulnerable no you're not right run that algorithm run that algorithm on this 25 of data which is probably different from this data right it's probably very different it's somewhat similar i would think but now that 25 as it plugs those dots in you might get some errors and you can count how many times did you get it correct how many times did you get it wrong and that's the point so let's say on this first pass with this 25 let's let's call this whole thing 100 observations right so 25 observations of 25 observations that it's never seen before it correctly classified you know correct would equal maybe 20 and then incorrect would equal five so you can get a measure of success hey i trained on this data and here is my correct and my incorrect results now cross validation so what that means is i don't always use this 25 down here because what if that 25 was just a particular special case that happened to be near each other in this data set right so what you do is you you you cut this up into pieces and you run this training algorithm on every single piece leaving one piece out one fold out this time i happen to leave this fold out right so i run algorithm number two so algorithm two where i don't use this piece but i am going to use this piece of data for my training algorithm so algorithm two run it on all this data it knows draw the boundaries do the clustering yes and no heart disease then take this particular data that was out take this and this is where you would choose another set of correct and incorrect answers so you have correct equals maybe 21 and incorrect is equal to four right so doing better repeat that over and over again right so you have multiple times and then what you can do is each algorithm which could be a very similar algorithm it could be a variation of a support vector machine what's a support vector machine nobody knows yet but it's a machine learning algorithm i chose and it's going to draw these boundaries whatever algorithm draws these boundaries i tested it out on i tested it out on and got these results over and over again with different pieces that i get to train it on and test it on and you can basically say possibly the best algorithm maybe you have an algorithm 3 that has the correct answer equals 24 and incorrect only once right so if that's the case that's probably a pretty good indicator that you know this is a pretty good algorithm or that chunk happens to just fit the training data and every other chunk won't so you know i doubt that's the case but it could be the case you don't know these are the things that you have to worry about when you're doing machine learning algorithms there's trade-offs you have all kinds of trade-offs do you want to be very strict and not get any wrong but then when you get your test data that you haven't seen you get a bunch wrong or do you want to be loose and kind of you know why why is there a gap there why couldn't i close those in right be a little looser in your interpretation and you can be correct more maybe when it comes to test data because if there's a point right here in the middle the algorithm is going to get confused now it'll probably decide it's going to be this one or that one inside of these bubbles right because the training algorithm the algorithm if that boundary was the algorithm doesn't know how to handle those it just doesn't right so it'd be wrong it'd be incorrect they'd be incorrect so anyways that's what it is now cross validation can get a little more uh crazy because you can do something called the leave one out method so leave one out is basically you can call it leave one out cross validation right so if you have a hundred pieces of data every single 100 of them all the way through you would do this algorithm and training on 100 or 99 100 cases i don't know minus one errors all the time right programming but so basically you would just say hey don't use this one but train on all this train on this test on that then the next one would be don't use this one but train on everything else and you do that essentially a hundred times and you get the you can get an average you can use the best one it doesn't matter because really when you get to use data beyond what you have here that's your true true test to see if it's working regardless of how you split this up now there are some people that split up data like this they'll say we're going to do you know 70 training data maybe 15 percent not my percentage is up so that'd be 70 85 and then 15 percent uh test true test or validation i'm sorry this would be v a l i d validation test and training right so tr training test and then finally validate that so this is truly never seen until the very end if you're running like a contest like a kaggle contest or some sort of data contest you would get the training set and you can break up that training set between uh training and test well you'd get the data set you'd break it up between training and test you do all of the things to try to make sure that that test result no matter which test section you choose within your your data set you would try to make that um work as best you can but ultimately there's data that somebody stored that you've never even seen you couldn't even train you couldn't even test it against it ever finally when you get your final algorithm out you send it over to the kaggle or whatever they run data that you've never even seen before ever and they give you a true result so that's pretty much how how uh cross-validation works there's many ways to measure success it's pretty intuitive and simple to understand i hope that this will give you the basis for our next tutorial which will actually split up a data set into training and test and we'll do some cross-validation after we do some machine learning algorithms if you enjoy these tutorials and you want to learn more are please subscribe uh hit the discord button and join us on discord it's starting to grow that community and ultimately every single time you share this on social media you're helping me out my goal obviously i think people know by now is probably to get monetized i'm getting closer and closer every day looking forward to seeing what kind of results that'll lead and how motivational that'll be when i'm when i'm done so anyways thanks for everything and i'll see you in the next video you
Info
Channel: CradleToGraveR
Views: 565
Rating: 5 out of 5
Keywords:
Id: FcAJYJ2JFi8
Channel Id: undefined
Length: 11min 24sec (684 seconds)
Published: Thu Jul 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.