Split Data R Caret Training and Test

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back everybody to cradle to grave r my name is mark gingras today we're going to talk about how to split a data set up into a training set and a test set randomly we're going to build models off of that we're going to test the models using predictions we're going to create all of this using the tidy verse and the carrot package and we'll use the mt car's data set so let's jump right in and i'll give you some nuances of of why some of these things are important to know so let's just jump in because that's the best way to do it so let's load some libraries here tidy verse is the first library so we can use um all the functions of like um the the using the tidy method if you don't have it obviously click on install down here and just type in tidy verse and then install same thing with the carrot package we're going to load that up so library carrot c-a-r-e-t you're going to see this package in multiple languages besides r i first learned about carrot package in python when i was studying that so load the carrot package again if you don't have it install it real quick and first we're going to do is load the data so we'll we'll create a load the data section and we'll simply just um call it my data as always and run empty cars now this is a very small data set 32 observations with 11 variables typically when you run any of these models with small data sets you're you're not going to get really good results so please explore this later with larger data sets find a larger data set switch out the mt cars with something different and of course your parameter of interest your predicted parameters that you want to uh make a prediction on switch those out as well so first let's do something called inspect the data so we'll inspect data first and what we'll use is part of the deployer or if the tidy verse is a sample n so we'll do sample underscore n so it's just saying hey give me a random sample of this data and how many do i want let's say i want four command enter on that or ctrl enter and you'll see down at the bottom i have randomly four rows now that's what the beauty of sample n is it just gives you a random number uh to uh to inspect because you don't wanna inspect the first three or the last three or the middle three you want random ones because you know maybe these numbers are outrageous right so now we have the miles per gallon we have the displacement et cetera et cetera we have all these different things all right that being said we've sampled it cool looks good you should also view your data you should also do a scatter plot based on the features that you're looking at et cetera et cetera so the first thing we're going to want to do is split the data so split data into training and test set right now to do that we're going to set a seed so set.seed will help us reproduce this random assignment right so and i'll show you how this is going to make a difference later on in just a few minutes so setting the seed tells the computer hey um i want random numbers but i want to be able to reproduce these numbers randomly so that i can troubleshoot it right you can't troubleshoot something if you can never get the same thing twice so we'll get the same random numbers twice if we continue to use set dot seed one two three and i randomly picked one two three completely out of my head so let's just create a training dot samples we're going to create the samples using mt cars uh miles per gallon uh we do this multiple ways but we'll do just miles per gallon like this command or ctrl shift m will give you the pipe operator so i'm going to pipe the miles per gallon into a a function called create data partition now that's part of the uh if you let me erase that real quick if i start typing in create data partition you can see in brackets or the curly braces that carrot package so that's how you can understand like where it's coming from so from the carrot package i'm using create data partition it makes things a lot easier when you can use packages sometimes but it does kind of abstract you away a little bit from what's going on so you still have to be careful and understand what's going on but i think create data partition with a with a partition equal to you can say 0.8 or whatever number you want list equals false i don't want it to return a list i want to return a vector so if i do command enter on that what it does it just creates as you can see on the right hand side if i jump below instead on the right hand side you see train in and it's 1 through 28 and it's got all the numbers if i click on it you can see all the numbers here now because there's only what 32 observations i mean that leaves us only four four observations in our test set so it's a terrible example for the data set but this will get you an idea of how to do this right so we split it randomly now again if i go back to this training one two three it doesn't you'll see it skip some uh let's see 31 30 29 28 there's no 27 here there's no 27. so it randomly took it right but it'll randomly do that the same way because i set my c to 1 2 3. if i set the c to a different number you'll get a different random set right now you'll really notice it with larger data sets but let's continue on and now let's create the model so let's build model here and we'll just say hey the model is going to be a linear model we're going to look for miles per gallon as we always do seems to be tilde dot it's just a notation that says hey based on all the features so uh what do you expect miles per gallon to be based on all of these features right i'm going to bring in the data is equal to the train the train.data uh i didn't actually set up my let's let's break this i created a random train in samples but i didn't actually create the data sets so let's do that first so let's do train dot data is equal to and then now we can just subset it you know my my data subsetted by training.samples because we just created that uh comma all features right so all the rows that training samples was which remember it didn't have the 27th row so it's not going to include that in all the columns so train data is now set and then we'll do test dot data and then to simply subset the rest you do this you do a negative training dot sample so anything any row except for the ones that are in training.sample please pull back and create this test.data right and i want all of the all of the columns so it's just basically the complement of what train data is now we can go build the model based on the training data so train dot data so now i have that to use okay data equals train dot data and that's our linear model command enter the model is now set all right make predictions and let's let's test this bad boy out so we're gonna so we built the model now we're gonna predict using the model i should be using that spin notation but i haven't memorized it enough to do it on the fly but the spin remember spin in a previous tutorial is where you can create our markdowns using our scripts we like to practice what we preach right so predictions we're going to say predictions i'm going to say is the model piped into a predict function and i bring in my test data so remember predict function we've used before the predict function you bring in your test data right and you also bring in the model remember now this is the plier we're using some tidy verse the tidy way to do things remember when i pipe something in using that pipe operator right here it's actually just taking the place of if i just used if i just said predict and then i have to bring in model then my test data right so all that pipe does is say hey i'm going to take this model pipe it into some predict function but really it's saying hey in the predict function you take in a model and i'm just going to go ahead and not include it here because it's it's automatically assumed based on the pipe operator so these are equivalent these two lines 25 and 26 right now but i'm gonna of course i didn't set it equal to predictions but that's what's going on so it's just a little bit different notation in fact there's an error test.data not found did i not sample samples see if without you guys i would be lost why did that not work as well t-r-a-i-n-i-n-g wow okay it's morning time coffee it is predictions now it worked now my predictions worked i have a set of predictions and again though all i have here if i click on it well you can't even click on it because it's just numbers but i could type down here and type in predictions and it'll give me the numbers but it's not it's not that clean well again i said there's only four because it's 28 training and four tests so the four is right here this is what it predicts it predicts this um cadillac fleet would be 13 dodge challenger 17 right so how good did we do i don't know right so let's continue on and do some uh some other things let's do a compare this is what i like to do let's say hey compare and i'll create a data frame i'll say hey data.frame and then my actual numbers is equal to the test data um miles per gallon right i'm only pulling out moles per i don't want to compare features because the features are going to be the same except for miles per gallon right so miles per gallon that's what i'm going to have my actual comma then i'll say predicted equals the actual predictions that i just created remember i just created predictions right here in line 25. so i'm going to do command enter on that now we can look at compare the data frame and say okay this is the actual and this is the predicted and so you can see how much they're off my actual was 10.4 i predicted 13. my actual challenger was 15.5 but i predicted 17. so i'm off by all of them but look at this porsche 914-2 actual 26 and then they predicted 26.51 volvo 21.4 20 you could see some of the some of the differences and i just wanted to show you that now one way to do this remember you have to go back to statistics and understand what root mean squared error equals um you know in a nutshell it's just basically the the average difference between the actual and the predicted throughout all of the the data but go back and look up rmse in fact we're going to use an rmse function here we're going to say uh we'll just call this error error collection right this is our error collection so if i only have one though i'm going to say rmse which is from uh i believe the carrot yeah the carrot and we'll do predictions and test out data miles per gallon so what this is doing is saying hey take the root means great error using the predicted values compare them to the tested you know it's basically we're doing similar something similar here except for this rmse is a function that that can return us the actual rmse without doing the calculations right so our error is 2.56 right so just remember that in fact let's write that down let's say let's let's do error one equals 2.56 okay so what i want to show you what's really important here is i'm going to rerun the whole thing i'm going to do control a in fact just to show you i'm going to clear out with this broom this little sweep thing delete everything delete everything i don't want anything right i'm going to do control a control command or control enter and you'll see that my error is 4.808 i've i played around with it a little bit so let's do this one more time uh command control enter and it should be the same 4.808 no matter how many times you run that it should be 4.808 now because of i've made mistakes and i went back and and i made modified things i think that's why i had the first error uh something different so that being said what i want to show you is that when i change the seed so remember 4.808 so let's actually create that correctly here down here 4.808 now if i run it again with a different seed say 100 control a control enter you will see that i have 3.803 you can see it over here right here 3.803 right if i run it again three point it doesn't change right now if i change this to a 10 or whatever number control a control enter 5.08 right so what's going on what's going on here is that every time that i change the seed every time i change the seed uh it's a different set of random numbers 1 through 32 because we're trying to split that training set of 32. so because it's pulling in different training data and the models are built on that you're going to get different error rates now you'll want to find the lowest error rate so i hope you understand what's going on there every time i'm getting a different set of training data and that's the value of cross validation and pulling in random data um if you just did the first 28 and left the last three you could run into trouble because maybe they're sorted somehow and that's not going to give you good results either right so that's the idea behind it so what i want you to do for your homework is to create some sort of loop that'll do maybe i don't know 100 different set seed numbers and compare the rmses somehow track the rmses in a loop and then find the smallest rmse and that's the model that you probably want to use based on the test data again though this is a very very tiny tiny data set so your values are not going to make it much sense honestly so if you can find a data set that's got maybe 10 000 or 20 000 rows that's the one you want to play with so [Music] you
Info
Channel: CradleToGraveR
Views: 3,316
Rating: undefined out of 5
Keywords:
Id: IUgeH36Hn-E
Channel Id: undefined
Length: 14min 15sec (855 seconds)
Published: Sun Aug 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.