Kaggle Competition - House Prices: Advanced Regression Techniques Part1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all my name is Krishna and welcome to my youtube channel now in this particular video many of you are are basically requesting me to create a video on Kaggle competitions so what I have done is that as promised yesterday right I have actually done a cattle problem statement in front in within four or five hours and you know I completed it I have submitted it in the cattle website itself so I'm going to show you that problem statement what I have done but still it is Justin in the initial stages I need to hyper to unit more you know have to apply a proper algorithm to it try to play with each and every parameter still but apart from that feature engineering feature selection and a simple model creation has been already done and the rank is also good which I'm just going to show you and the best way to start is that I found out you know cattle competition which is named as house prices advanced regulation techniques and it is ongoing so that is the reason why I have taken this particular example and I'll show you with help of a very good things that so basically I will try to show you what all you have to do in cattle what is skagle all about you know I'll do so I will introduce so initially so initially I will try to introduce you to cattle what cattle is all about how the data set is basically present how you have to program you have to write the programs and after writing the programs once you get your output how do you have to submit it so that you get a ranking in this so let us begin so I'm going to take this particular problem statement which is called as house prices at a vast regression technique so inside this you just have to first of all you have to log in make sure you log in without logging in they'll not be they'll not allow you to basically download the dataset so after you log in this is the description of the project that you would be seeing okay all the information are basically given what all things are there in this particular dataset how the evaluation technique is done so on what parameters they are going to calculate the equity how is the submission file format I'll just tell you about the submission file format and again there are tutorials there are a lot of informations about different different cattle competitions also in this now the main thing is that just go to the data click input data and now here you'll be able to see that how many data set you basically have so here you can basically see what all files they are they are like test dot CSV training train dot CSV file there is also a data description of this particular file if you want to see some more information what all this particular file consists of remember guys this training train dot CSV is basically your training data so you have to train your model on this particular data after training you basically have to credit for this test data okay and then after you do the prediction whatever output you are basically getting you have to submit in this sample underscore submission dot CSV or the one common thing that you'll find between the sample underscore submission and test dot CSV over here whatever ID which is your unique identifier that will also be present in this particular span sample underscore submission dot CSV let me just take an example and show it to you so first of all we'll go and try to see what the training data basically consists of I'll go and click over here so just a second okay now here is my training data you can see over here the whole complete training data is basically over here how many columns you have this first one is basically your ID you have different different features and again about this particular problem statement guys this is housing price advanced regression technique you need to find out the price of the house okay based on various features that you can see over here ok and it has literally around 81 features it has literally around 81 features and there are many many you know there are many many categories there are a lot of null values how you're going to handle that a lot of things will be basically over there and you need to do a lot of stuffs in this so I have taken this particular problem statement so that you'll get a complete idea about it how it is done now the next thing is that after I go over here and see about the training dot CSV you can also see test dot CSV I'll just go and show you how does sample submission file dot CSV look like so this is the ID of the test data itself and this is the output that you have to predict and give it to them ok so after you make this kind of CSV file you have to just submit in the predictions and that files CSV file you have to just upload it once you upload it you'll be getting a score based on that particular you will be basically getting a rank now as we had already seen in the evaluation technique of this you can see over here in the evaluation here in this evaluation they are basically saying that they're going to use root mean squared error they are MSE so if you know about root mean squared error it'll basically comment decimal values and you know based on that particular decimal values only and that you cannot see it you cannot directly calculate it all you have to do is that you have to make that submission file format and you have to just you know submit in this particular webpage itself after clicking the submit predictions then only you will be able to see now in the leaderboard if you see there are so many people who have actually got very good ranks you can see point zero five point zero eight point zero nine nine and one of my if you see my my rank currently two five to one so I'm getting somewhere around point one four one and this is just I have written the code in four hours guys understand that thing still are not done hyper parameter optimization and this will take time it is not just our four hours work okay so after getting the first type this was my first trial I have uploaded it I have got a very good score what I think because you can see over here more than 4500 people are have participated so somewhere around 4000 384 people have participated and yes I was able to get to five to one just in my first attempt now again this is just not my first problem statement that I have solved in Kaggle I have solved many but I'm just giving you an example in this I have not applied any hack a parameter optimization I've just followed a feature engineering all the feature engineering work I have basically done I have done all the feature selection but again let me just show you what all code I have basically returning so you have to just go to the data and make sure that you click on this download all button ok so once you download it you'll be downloading all this whole files you can just click on download all all the files will get downloaded and let us go and start how we can basically solve this particular problem statements ok so to begin with what I have done is that I have imported numpy pandas matplotlib pipe and C ba and there is a reason why I have done this I'm going to do some analysis later on but not right now in the initial stage how I have basically solved this problem is that I just handled the missing values of handle categories I have handled you know some of the features which features I have to drop okay I've just done some of the things like that and after my next stage after now I I know what is my score now in order to make it much more better I'll be doing a lot of steps which I'll be showing in my upcoming videos but currently how I got this particularly Creasy that I'm trying to show it to you so here it is I'm uploading a trainload CSV file and this is my data set head which I see over here they are around 81 columns okay 81 columns and they are somewhere around you know 4,000 not 4,000 1400 records I guess let let me just have a look in so I'll just make a cell above okay so here you can see that it is 14 60 records 81 columns okay and this particular code is basically to see the this SNS dot heatmap DF dot is null YT cable falls this is basically help you helps you to su see the null values with the help of you know heat map I'll show you all the examples so let us see initially I have this shape okay what I can do is that I can use a function called as DF dot is null dot some okay so first of all let me just execute it once again so I can show you by putting it the front of you okay so initially what I did is that I just try to see and always remember initially we should also always see that how many null values are basically present if I execute okay it is giving me some error it's no attribute some okay this should be a function I'm writing it as dot some okay once I do this you can see that there's a whole lot of features you know whole lot of features where you are basically getting a lot of missing values so in love Lord front lot front edge which is a feature they are on 259 missing values in alley they are on 1469 fitting missing values and similarly if you go down there's something like fireplace Q garish type garage you're buried garage finish all are like more and more of features here you can see pool cue see fence are so big numbers 14 53 1 1 7 9 so what I have done for all these features that are having very number of a large number of missing values because just understand this the total number of records that we have is somewhere around 14 16 and if I take this examples over here 14:53 one one seven nine one four zero six so many missing values are there so what I've done is that I plan to drop all these columns pull QC fends miscellaneous features you know because it is not necessary to do that I'm not dropping this okay not this also five-place qu but instead I'm dropping which is having more than you know know where where where your missing values is more than 50% at least in that particular record so I'm dropping those so I'll drop in the future you'll be seeing I will not drop lot frontage because there's just 259 missing values but I'll try to see what I can do for that so initially this was the problem that was basically happening when I was trying to solve it and it was not easy guys I have to do a lot of things they were like 81 features now just if you have 81 features it is difficult you just understand directly which without any domain expert knowledge right which feature to drop which feature to not to drop so for that what I've done is that I've dropped features like Allie and some of features which you'll be seeing towards the down where my missing values is more than 50% like this kind of features are deleted it so what I have done is that after that what I'll do I'll go and see what is the heat map now you can see that this in this particular heat map the white line basically shows that there's all our missing values okay this is all our missing values now based on this you all you can also tell me that okay I don't require this pool QC miscellaneous value because there's so many missing values in this Sun definitely it is impossible to predict directly right it is very very difficult so what I'll do is that wherever I have this kind of situation I'll be dropping this columns itself so let us go ahead and try to understand so initially my shape is around 81 features now now lot front age you know that lot frontage if I go and do or D F dot info let's see this if I do D F dot info execute it you can see that a lot front edge is basically a floating value okay 64 oh they are also lot number of null values over there and what I have done is that initially without judging anything without you know understanding about all the features I've just taken the mean and telling that fill all the n/a values with the frontage means okay that same column mean and I'm basically replaced the missing values over here yeah now remember guys you should go line by line okay you should go feature by feature now since we have 81 features and obviously you will get confused to make sure that first of the feature you target one feature like not front age then you go with respect to the other features okay so that is how I have done otherwise it will be very very difficult and one more thing that you have to do simultaneously use try to handle the touch date also you have that touch dot CSV and remember all the feature engineering that you are doing for training data you also have to do it for the test data okay so simultaneously I will go and execute my test data I am going to read my test dot CSV so this is my shape why the shape is ATS now see in my training data have shape as 81 write 81 features and remember this sales price over here you have if I go and show you my dataset over here um at the end of the column there will be something called a sales price this sale price is basically a dependent feature you need to find out this particular value or based on all the other features price of the house with respect to all the other features in the test data you will not have that in the test data you not have that okay so don't worry about the code guys I will be giving you this particular code in the github so you can basically do a lot of stuffs with this okay and try buy your own try to do more extensive things now I am again checking over here test data frame dot head and if I go to the last column you see that I am not having that sales price in this case I had sales price right before after sale condition so after that what I'll do I'll also go and check the null values over here now see in my previous stage you could observe something right and I check the null values over here right you could see that my leverage my left of my MS zoning did not any null value whereas over here in my test data set I mean zoning is having some null values okay now this is again you have to so I'm telling you you have to simultaneously work along with test data and train data both okay now for MS zoning let us understand what this ms zoning column is all about if you go and see over here ms zoning is basically an object type object i basically means that is basically a categorical feature okay so if you if you go and see the data set dot head and if you if i move it over here you can see this ms zoning is categories they are around three to four categories like RL and all if you want to basically see this you can just write it as DF of MS m capital s and just press tab you'll get it dot or you can basically use i think you can use value on the scope counts now let's see whether this will get displayed now in here you can see that value under comes basically gives you four five categories within them so what you can do is that in order to handle the null values over here okay what I have done is that I've basically taken the mode of this categories no more basically means that which is your most frequent category that you get replaced by that yes so you will be seeing that that will be my first step that I will do over here after I have actually done my frontage mean so I have already shown you I'm going to compute this frontage mean right so I am going to do this frontage mean and I'll get executed so I'm handling all the nine values of lot front each column where I am is am handling the missing values by replacing with the mean of it okay and I have D F dot drop Ally why I am dropping the Ali column again there is there is so many number of features that are having null values over there so that is the reason I'm just dropping this Ali column okay make sure you do this and simultaneously will go back to the test data in touched it also we are doing this and again now see in that in the training data we do not do anything with Emma's zoning but here we are performing the mode of it so in order to perform I am just writing test underscore DF MS zoning dot fill any test underscore DF m in zoning dot mode of zero okay mode of zero basically means I'll take the mode of that particular column and it will be replacing for on all the nine values okay and remember guys first of all I'm handling a missing values I have to handle the missing values okay that is the first step in feature engineering then only I can decide something else now after that I'll go back again to my final project over here and you can see away what all I have done I have actually arraigned as you can alright one inbuilt function one one one one custom function where you can write all this code but I was going you know for a feature by feature I was just seeing features by feature and I was trying to solve this particular problem so you will be seeing that wherever I saw that the feature work category features I have just replaced it to the mode okay because I am NOT analyzing it in the initial stage you should never do that okay you should try to understand about the data but later on see now I have actually submitted I know what is my accuracy now I have to try to reduce that particular error that is basically coming over there right now in order to do that what I will do is that I like start exploring more about this particular feature but currently have just taken the mode of all the catechol features I have replaced it so that is the reason you will find for all these features which are basically catechol feature I am replacing the non value this mode of zero if it is more than 50% are simply deleting it so you can see over here I will be executing everything like this directly but make sure you go feature by feature again see guys the first feature that I went was with Emma's zoning right because this was having null values then I went with lot frontage you have to go just by feature by feature and you found out that a leaf the number of missing values with more than 50% is I have dropped it in order to drop it you can see I have written code over here also I am written a code over here right so similarly I am dropping pool QC fens Michelin has features with ax is equal to true this is done and now you can basically see my shape this is my shape I'm also dropping the ID column because IDs unique identifier I don't require it so I'm dropping it now if I go and see my summation you can see that now I'm having very less number of null values only in some of the features I will have so in here I have basically having eight you know eight records that I have null values eight records here also for this particular feature we have in null values I go down I don't see any null values there okay so I'll be handling this two also now what I'll do is that I'll again create this heat map to see because again there are so many features and some some sometimes I may miss some of the feature that are having null values like this feature which are not visible over here right so for that what I will do is that I will just create this heat map and let's see whether there is some null values yes you can see some of the null values over here right this color is basically null values so in order to handle that I know my column where the null values is present what I am doing I am basically making again the mode again that is the catechol features then again I am trying to see with respect to this heat map again I see there are some null values okay now you can observe that there very less number of null values remaining okay so again I have done for one more feature and finally what I will do is that after after executing this line you can again if I go and click over here you'll be able to see that there is very very small number of null values around eight okay so what I can do is that I can finally drop the records if I want okay because there's very less number of null values for that particular feature I can but sir I can also find the record and try to use the mode similarly okay so I have done this now you can see my shape is 75 now my D F dot head is over here now what I will do is that I will go and create this columns now see guys I have handled all the missing values here right here I have handled all the missing values I know that my handling of the missing values is based on mode and for integer variables are done it with the help of integer mean right but now why I have done that because I just want to start with the problem okay I have still not done any statistical analysis we will do that in our next section because I know my accuracy over here now I have to make that accuracy better now after that after handling the missing values now I go and I have to handle all the category features so I'll write down a comment over here so that you'll also be able to understand and categorical features right I'm going to handle category features the various various ways of handling category features first of all what I did is that I made all the note of all the catechol features and this all category thing about this particular competition is that this all catechol features are basically having just two to three or four categories max okay so what I thought of is that can I write one function wherein I will be considering I'll be taking the skeletal features I will apply a get dummies you know pan does not get dummies in order to convert the catechol future into dummy variables and then I'll append that same variables directly to my dereference right so that is what I was planning to do that so for that what I did is that I started creating a separate list of all the categories so this is my list of categories and I created one column now if I go and see the length of columns and not 39 category features in that 81 columns you know now I created one function which will be handling all the features and converting that into categories so this is basically my function okay and this is my function after converting you just have to provide the list of columns here and then it will be returning you all the category features directly into this okay sorry all the data frames concatenated with the catechol features because you can see where i'm concatenating it okay now after that after you do this right make a copy of your data frame into one variable so it is always good because this value should not be getting changed again and again now after doing this after doing this okay what I have to do is I have to run this function right I have to run this function I have to run basically this function but before running this function does I observed one more thing one more problem statement now if this all same method this is all things that I have done over here I also have to do for my test data now for my test data when I was doing and when I actually saw some of the features in test data what was happening is that in some of the features they were multiple categories and those categories were not matching this training leader that was a problem let me just give you an example what I am trying to say suppose consider that in one of the column that I have like this column B BS he condition suppose this is a category Chur I mean my training data said I just had three categories okay I just had three categories but in training data or in touch data for the same field I had four categories now just understand if I am using gate and the SCADA means I will basically be create getting create for that three categories I'll create two dummy variables right whereas in this particular case here I'll be creating three category variables so what since we don't know in the training data set that whether we are having the complete whole some categories what I have done is that I will take take this test data concatenate with the training data okay contact innate row-wise with the training data just understand this very important thing so what I will do is that first of all the reason why I am concatenating is that after concatenation right after concatenation you can see you in hell I have written the code okay I'll just tell you about the code in a while so after I concatenate now I apply the pandas don't get underscore dummies which converts into a one heart encoding to the entire column itself now if I combine both training and test data set I know that I have all the specific number of categories within each and every feature and it will never never increase after that okay so that is the reason why I will combine my training and the test data now before combining that I have to perform all the operations that I have done to handle the missing values like how we did it in the training data set so that is why I have created this different file now you can see over here I have done all these particular steps all these particular steps you can see that I have applied mode mode mode mode for everything you know all the categories to handle it and why I have applied mode I have just seen that because they are around three to four categories only and there are some missing values jump to honey to replace your third mode I'll try to have a look I will try to see to have a look whether I can change this or I can replace with something else but currently I've just selected as more okay if there are integer variables are basically used me now after this what I do is that after completing after handling all the null values I will convert into a CSV file so that this CSV file is stored after handling the null values in short because I have to combine combine the data from this test dataset into my training data set so that I can apply my my one hot encoding right by using pandas not get underscore dummies so here after executing this this file will get saved in my same local folder wherever I am executing now you can see over here when I go to my training data set okay till here I already explained you I have taken the column of all the categories I have created a category function which converts a category variable into one hotting codon now what I will do is that I will go and leave that same formulated test dot CSV now you can see that my test dot CSV is basically having 74 columns and this many columns this many rows okay and this is how my test data looks like now what I'll do I'll combine my train data which is present inside my DF and my test underscore DF okay row wise I'll try to combine it row wise so in order to do that I will just write P d.com cat DF common test underscore DF with the axis is equal to 0 and that variable is basically stored in our final in this coordinate yes okay now when you go and see this my final score DF dot shape is some way around two to eight one which is the combination of word training and test and my records are 75 remember in my test data set I don't have a column called a sales price so if i'm concatenating for all my test data it will have none and values you remember that now I can easily apply my this particular function which converts all my category features into one ha 10 good so for that I am just calling this function over here okay and I am giving my list of columns over here itself now here you can see that for which all columns it has basically performed it okay now after performing this you will be seeing that I will now have to 35 columns created from that 75 column after applying one heart encoding the last final part that I want to do is that I will remove all the duplicate columns by this particular code because I don't want any duplicate columns see if any duplicate columns are basically there that basically means that both are internally correlated right and both are having the same importance so I am just deleting the duplicate columns from this and my final DF is basically created and now you can see that my final DF is basically having two two eight eight one records and 175 columns initially how much columns I had I had two 35 columns now I have 175 columns okay so this is how I have basically done I have converted into a category chur okay now if you want to basically see my category future I can also show you that so let's go and see this so this is my final underscore DF dot ten let's see now this is how my character features look like you can see that if I go to the last you all have values like 0 is 1 0 is 1 0 is 1 okay now again what I will do I will divide this into my training data set and touch data set again I know how many records were there in the training data set I'll take that much how many records are there in the test dataset I'll take that patrasche dataset I'll drop the sales price column because I don't have any all the values on man over there and my axe is equal to 1 in place is equal to I am doing that now you can see my shape in my test data set it is 1 4 5 9 comma 174 now what I will do is that I will drop the sales price for my training data set in my X train I'll just from the straining that does it I will create my X train and why train right so X train is basically having sales price I mean drop all the columns apart from sales price Y train is basically having only the sales price now what we need we have our extend and white-red the best thing is that is not applying algorithm not here I have just selected XG boost I could have also selected random forests but I want to try with random forests XE boost with hyper parameter optimization but initially I just wanted to try it out I've use xgb regressor x g boost regression i've done this and this has got executed my n symbol techniques over here I was trying for random forest regression which I will do it later so once your function gets executed once a classifier gets fed you can also save it as a pickle file so that you don't have to train it again and again because the training will take some amount of time and after the what you do is that you just use classified or predict on your test data set and now you will be getting a wipe read okay so this is your prediction data set now this particular data set remember the submission not seriously fine okay so what I will do is that first of all I will convert this Viper head into a data frame okay then I will read the sample underscore submission dot CSV file I will take the ID column from this sample not submission dot CSV if you remember guys we inside this particular data set right we have an ID and sales price I don't require the sales for I sales price I have actually computed it I will just take this ID column I'll put up my sales price over here so for that I am writing these two lines of code and finally I am so converting this into sample dot sample and is husam you shall not CSV now after that once your submission file gets created you just have to go over here click on submit prediction then they will ask you the file to load over here you just have to click over here then what you have to do is go and select your sample submission dot CSV upload it that's it after uploading it write the description where and do make submission as soon as you do the make submission it will go and compute the score and you will be able to see your score over here in the leaderboard section and whatever rank you are getting okay so currently you see that I am getting somewhere around two five two two right I am definitely sure I will be able to come in to top 100 because I have to do a lot of things over there some of the best practices that I will definitely follow and I'll make I'll also give you the solution how to do it in the later videos but right now dontel here the best thing is that I was able to do this in just four hours and that was the most motivating thing for me you know I had to do a lot of stuffs you know understand how that training data was the how the test data was remember this becomes difficult as the number of features in increases I have just not selected only some specific features and perform the this particular operation instead I have selected all the features and I have not try to do that so I hope you like this particular video I'll be sharing the code along with this both file which you can basically do it but make sure you download the data set from here and try to offload it okay try to do something new in that I try to upload it so I'll see you all in the next video have a great day ahead thank you one and all and have a wonderful weekend thank you guys
Info
Channel: Krish Naik
Views: 154,138
Rating: 4.9037395 out of 5
Keywords: Kaggle competition, greatlearning, Deep Learning, coursera, upgrad, appliedaicourse, machine learning, Machine LEarning
Id: vtm35gVP8JU
Channel Id: undefined
Length: 31min 21sec (1881 seconds)
Published: Sun Sep 08 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.