Kaggle's 30 Days Of ML (Day-11): Machine Learning Model to Predict House Prices (Intro to ML Ends)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone and welcome to my youtube channel today we have day 11 of kaggle's 30 days of machine learning and today we don't have any tutorials as you can see there is just this they they ask you to check out house prices competition for kaggle learn users so that's specifically for us so we will go take a look at the competition and we have to our task is to create a submission but we will start from the competition page itself so let's uh take a look at what uh things mean and how to join this competition so i have joined this competition otherwise you will see a join button here so click on that then read the rules if you want to and click on accept and there then you are inside the competition now you you can read the description so what's happening in this competition is you have to i think you have to create a model to predict uh the house prices so you have 79 different variables of different homes in usa and the challenge is to predict the final price of each home so this is the competition description so all competitions are like this um and then you have evaluation so you are generating you're making some prediction but it has to be evaluated right and the way it's evaluated is using some metric so kaggle is using root mean square rmse metric and you have to submit a file so that's called a submission it's a csv file most of the time and that should have an id and a sale price and this is for the test data so what is training data what is test data we will come to that let's take a look at the data page uh so you have training train.csv you have test.csv you have some data description dot txt file that gives probably gives you description of the data and that's yeah that's what is mentioned here full description of each column and you have sample submission so sample submission is a submission file that is provided to you in correct format so that you know what submission is uh or what submission file is and uh they have also given you some kind of brief description of all the different columns that you are going to be dealing with so you see it's a lot of columns a lot of them 79 features here you can take a look at the data so you have in training data you have the id column uh you have some other columns and probably somewhere you have price or something like that sales price okay so here you have the id column and in in the end you have the sale price and this is your trading data so you have all these uh columns which are features some have missing values maybe some have missing values i don't know maybe none of them have missing values so we have to check that and this is your test csv file so just csv file is in the same format except the fact that it doesn't have any uh sale price column so that's so you you have to build your model on the train.csv and use it to predict on test.csv and uh when you look at this sample submission file so before that let's see these ids you see here one six one one four one one four six one sixty two six three and here you have the same ids so instead of submitting the whole test.csv you just need to submit the correct id and the price corresponding to that id that your model has predicted great so um what they have done with sample submission is they have created a linear regression model on year and month of sale or square footage and number of bedrooms and using these three four variables they created a simple linear regression model and that's your sample submission.csv so you can also probably start from creating your simple linear regression model if you want to okay so this is the competition uh there is a discussion forum if you have some questions you can ask there there is a leaderboard where you will see uh your rank when you submit to the competition so don't go for this somebody has the solution um and yeah so now we can go to our exercise so you go to the tutorial and here if you click here you get the exercise and we just run the first cell as we have been doing so let's run this and see what we have to do in this competition okay so uh here it says some of the code has been some of the code you have written so far start by running it again so this is something that we have already done we have been doing for previous i don't know few days we are reading one file in pandas and trying to predict the house price using decision tree regressor and sometimes using random forest progressor and that's what we have to do today too but we have to do it for the competition so now you have to load the data and separate the target okay you loaded the data using pd.create csv and they provide you with the input training.csv in this kernel so you don't have to load it on your own so it's there for you otherwise you have to do that and then you create a so features is uh something something they have chosen some features for you and uh what you do is you create x you choose the features from your home data pandas data frame if you want you can print the head of the file but yeah don't need that and then you split your trading and validation set you create your training and validation splits with a given random state you have done that and now you uh are running a random forest model so let's just run it and see like a simple random forest regressor what does it give so it gives me a mean m a value of mean absolute error of uh 21 857 okay that's fine now uh the thing is here when i'm looking at this competition i saw an evaluation that they are calculating root mean squared error so i think i guess that's a better representation because if you're if you're doing it for the competition that's a better representation but second learn doesn't have root mean squared error it has mean squared error means squared error and here i will also import numpy import numpy as np okay so now i got mean squared error i got numpy so here uh i can also print validation rmse and let's do it using our favorite f strings uh let's take this from here and remove this and put it and instead of our valve mae it will be rf valve rmse and mean squared error okay and uh the same thing this thing can be copied here so one thing we are missing is the root so if you have mean squared you just take root of that and numpy dot square root function will do that for you okay now let's run it again and now it gives me an rmse of 31 619 so here if i see uh the leaderboard so if i if i am in 31 000 i will be somewhere in thousands right so that's also thing that you should do is you make your first submission and then you try to improve on it okay so let me just refresh it because i think this is eating a lot of memory so 31 000 i don't know 16 seventeen thousand wow you will be somewhere uh seven you will be near to eight thousand rank if you submit that model but anyway uh it's for learning right so here we create our uh model again on full data uh so there is a problem here uh we checked the score on the validation set and now we are training on full data so our parameters for the model may not be optimal but let's do it anyways for the sake of learning random for us regressor since it's a regression problem uh here we fit the model on full data and our full data is capital x and small y uh random force regressor uh okay and uh we should also specify random state random state to one uh okay uh yeah it's not callable because you have to use dot bit model dot fit um okay so now this this part is done now we will make our predictions on the test data so uh what is our test data test data first we have to read the file using pandas so we can simply do pd.read csv and then test underscore data underscore path and here we have created a list of features we have to use the same feature in the same order so now we will do test underscore data and features so we don't have targets for uh test data otherwise you also get zero score zero rmse right so now we can do full data random forest model dot predict and test underscore x and this should this should work this should give you the depth predictions so here you got all the predictions and now you can check the solution okay and now you have to generate a submission file uh okay so what that means so you have test uh underscore data now in test data you have the id column right and uh here what they do is they create a new data frame using a dictionary where column name is id and the other column name is sale price and this looks same as your sample submission file okay and then you save the file to a csv file with without the index so this is your test predictions this is your id from your test data so you can also write it like this if you want to it's the same and these are the test predictions that you just created here and then you save the file now once you have saved the file it should appear here in kaggle slash working so whatever you save here should appear here so i can just download this file and it's going to save it on my pc and now i go back to the competition and here i see the button called submit under summit predictions so i will go there and click on submit predictions and here i can just drag and drop the csv file that i just created so that's the one of one of the easiest way of doing it but i will also show you another way so i have dragged and dropped the file here and remember what my rmse was around around 25 000 or something like that 31 000 on the validation data so here now i make my submission so i'm expecting a little bit better than that but i'm not sure i'm doing it for the first time so now it uh runs your submission and now you got a bet a little bit better than that as assumed so now uh you can go to uh leaderboard and you can jump to your position on the leaderboard so 7500 7598 that's my rank if i click on my submissions here i can see how much my model scored now another way of making submission would be to just click on code inside the competition click on new notebook and this is going to load the data for you so everything will come here just like before and now since we are done with our assignment for today i'm just going to copy this thing and paste it here okay and i'm also going to now i forgot which one was which okay so now i'm also going to copy this part for training the full model uh i'm also going to copy the part for making predictions on the test set and also going to copy the part for making submission so now i have done all this i can safely close my exercise window this is the last exercise for your intro to machine learning and now here it also says that some features will cause errors because of issues like missing values or non-numeric data types here is a complete list of potential columns you might like to use that won't throw errors so they they have been very kind and they have provided us with uh some features that we can directly just use and experiment with so yeah today this is the last lecture for intro to ml and tomorrow we move to a new one called uh what was it called intermediate machine learning okay let's run this let's run this part first and one more thing that you should know that you had only 40 submissions or something like that so if i click on this one and if i go to submit predictions it will tell me how many submissions i have left for today it's because uh they don't want you to know like if you have a lot of submissions you can guess the scores right so they have only 39 submissions one thing i forgot to tell you is uh this part here public and private leaderboard so publicly report here it says it's only 50 of the test data the final results will be based on other 50 so let's say your test test data has only 100 samples so like uh this test.csb if it has only 100 samples they chose 50 of them randomly and they calculate your score only on those 50 and they show it to you and when the competition ends they will calculate the score on the other 50. so the score that you see might not be the final score but if you did everything correctly it will be quite close to your final score so here we have to load the data in a different way uh okay and now everything should be fine let's just run everything to see if everything is working good so we train the model and uh on the pull data so let's remove this let's just combine these cells and also the cell for saving the submission file and now we can train on the full data so what else can we do so with random forest regressor you have something called an estimators which is the number of estimators so if you have not seen my video on random forest regressor i would totally recommend you to go there take a look at that and also decision trees then you will understand what an estimators are so by default uh if i look at the help of random forest regressor let's see what what we get if i look at the help here i get n estimators where is an estimators max depth main feature rate okay yeah here it goes sorry 100 so let's see and try to uh see if if we change it to 200 what happens so my previous error that i got was 31 619 and now we see with 200 i'm getting 31.567 so it's a little bit of improvement uh if i do 700 estimators so if if you have a lot of estimators going to be a little bit slow and you you might also end up overfitting on the validation set so now i have 700 estimators uh okay let's let's just run this and uh now i don't have to change anything here in the test on the test part except this thing so here i can just do an underscore estimators equal to 700 and i can click on save version and it will save the version for me so uh now we wait for some time it's saving so all you have done now is taken the previous code and changed the value of n estimators to 700 instead of 100. now we go back and see uh how we perform on the leaderboard so we do see that we have some improvement on the rmse so is it going to reflect in our score let's see so after a few seconds or a few minutes you will see that this is successful so i can open it in viewer and here i can see my notebook so it ran on kaggle environment here is my notebook just ignore the last cell that's yeah for us for later use and here is the submission.csv file in the output section and now i can just click on submit since it's associated with the house price competition for kaggle learn users competition it's uh going to submit to that competition automatically and here it will tell you ask you if you want to write something about this or whatever so you could have written an estimator is equal to 700 but let's leave it okay so now we see we improved our score by one thousand so if we improved our score by one thousand how much rank did we improve oh three thousand yeah so we improved our rank by three thousand so is there anything else that we can improve they have given us a list of features not all features are always good but what we can do is we can we can try these features and see like if they make any difference to our score so let's let's convert this to a let's stop feature so i have quickly added comma at the end of everything and i will close this here and open this here let's copy this to our features so these are our old features let's keep it in a comment uh if we need it later so feature is equal to now these are our new features wow so many features right um okay so yeah if you want you can format it in a good way i i like to do that so let's see okay so now it's more readable um so these are are the features that we have and uh i don't know if okay yeah we do we do have some some of the older features here maybe we have all the old features here so what i'm going to do now is i'm just going to run it with 700 estimators for obviously training and for tests it should be the same and i'm going to run it on all the features that we got and let's see what happens so my notebook has failed i got a message failed and now we will look at why it failed so let's see what the error was some okay it's giving some value error input contains nan infinity or a value too large for detect load okay so they gave us the feature but and they also told you that these features will work but they didn't tell you that in the test set you can have missing values in these features sneaky so what we are going to do is we are going to do something very simple we are going to replace all missing values by now this is tricky again by zero let's say for now because we will learn about how to handle missing values in more details later but right now let's just fill all the missing values test data go to this underscore data dot n a that you can use to fill all missing values by minus one let's see and then let's let's just run it and look uh in the kernel post make sure everything is working and we can also see if our validation rmse has improved or decreased so my validation rmse has seemed to be improved by quite a lot now it's around 25 000 so now i run it on all the features and the test data and let's see so this is the model is training again on full dataset and uh it takes a few seconds maybe or several seconds and now it's done and one one more thing that uh i can show you here you can you also do output dot head and look at how your submission file your final submission file looks like the first few rows just send it check so we save the version again now it's queued and it will run it seems to be done we will open the viewer and we will click on the output and go to submit let's check because we had some improvement locally so now we can submit and hopefully it gives us a better rank than before okay so it does give you a better rank than before uh so your score has improved if i go to my submissions i can see our first submission was 21 000 then i had twenty thousand now i have sixteen thousand and rmse should be low it's like error lower the better and if i go to the leaderboard now i i have my rank under one thousand now maybe we can try something else um this time we had tried an estimator 700 we tried some of the features here but we can also try all of the features if we want and then see what we get but instead of doing that we will do just we will do something else we will just write uh try another model called gradient boosting and see how that performs so like here i have gradient boosting regressor so this is my rf model and now i will say gbm model don't worry about it you will read about it so but it's nice to see uh what difference can it make or cannot make and here also you have an estimators this is like a it's based on decision trees just like random forest and uh here i can just do gbm model gbm model gbm valve predictions and we can print the same things or maybe i just remove the validation mae i don't need it and [Music] here i have gbm model so this will be gbm copy this and so we didn't have to make many changes we are still using the same set of features we have not done any kind of exploration of data that we should do and that's uh i'm leaving that task up to you uh and you can also choose more features if you want and now let's try here what happens so x is not defined so where did we go somehow x got deleted x equal to home data features okay and let's see what is the rmse score from both these models so you can see the validation rmse of gbm model is 22 000 for random forest it's 25 000 so we do have some improvement and then uh one more thing that we can quickly try is just using one of these models uh so let me just remove all of this and yeah we do need this and our final prediction predictions uh will be a combination of these two rf underscore valence for predictions and plus gbm underscore val underscore predictions so it's don't worry about it i'm just showing you to take average of two different uh model predictions and we are just checking if it changes the score uh so yeah okay one bracket is missing and now we run it again and see we see what happens okay so now i get two one five two three which is a little bit better than both the models so what i'm going to do is i'm just going to take this model uh take this whole thing from here and i'm going to paste it here so rf model and here i have rf model and rf model and this is your first set of predictions and your second set of predictions will be the gbm model so you also need to fit it on full data set dot fit x comma y and here press spreads to gba model and your final test underscore threads will be test underscore threads 1 plus test underscore that's two divided by two so this is one way of combining output of two models um so with it rf we've got gbm and i hope it should work so we can just save the version and then we wait okay so this this is done now and uh yeah 21 500 and we will just submit this and see how it scores and with this we will also end today's video so if you go to your um let's open this page and if you go to the machine learning uh tutorial for from today uh click on back course home uh there you have done everything in the intro section and you can then view your certificate so here i got 15334 which has been an improvement on the previous scores so we have also improved our score quite a lot just by changing few things in the model without even touching the features and that's what i wanted to show you today go dig into the data look at the features build some more new features uh look at how to handle categorical variables and see you in the next day and if you like the video do like do comment to subscribe and tell your friends about it see you next time goodbye

Info

Channel: Abhishek Thakur

Views: 7,161

Rating: undefined out of 5

Keywords: machine learning, deep learning, artificial intelligence, kaggle, abhishek thakur, machine learning intro, first machine learning model, decision tree classifier, decision tree regressor, 30daysofml, kaggle 30 days of ml, random forest, gradient boosting, submitting to kaggle competition, house prices regression, combining multiple models, improving score in kaggle

Id: IsVmI7HZ9DU

Channel Id: undefined

Length: 30min 20sec (1820 seconds)

Published: Thu Aug 12 2021