Gucci Data Science ep. 1 || Kaggle Competition - House Prices: Advanced Regression Techniques

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey python bros fightin bobby here and today we're gonna be doing something just a little different we're gonna try to solve a data science challenge for the house prices in kaggle stay tuned and let's figure this out together hell yeah yeah damn science gucci data science all right house prices advanced or maybe not so advanced if you're listening to me regression techniques cool so we have like 79 features or something and um it'll take way too much time to go over each one of them but they're basically um categorical data of houses stuff like what type the street is we have ordinal data such as what's the quality of different you know features such as the overall material and finish of the house we have purely quantitative data like uh the square feet that the property is and so forth and so on so we have different data types and we have to kind of engineer something though resemble model they will forecast sales prices moderately accurate i mean let's not aim for the skies here so like i said this took way too much time to do and i'm going to be doing a code review to kind of give you uh you know the quick and dirty approach of how i solved this challenge what i did and what can be improved all right let's start so we have a test set and a train set we're going to be using the train set to train the model the machine learning algorithm and then we're going to be using the test set to forecast the data and try to see how we deal with the results all right so um i won't stress too much on the first point because this these are just a lot of libraries that i've um imported in order to do the various operations that i needed to numpy and pen and pandas like those two are a must sci pi for graphs sklearn for all the machine learning stuff you need in python matplotlib for other graphs and um that's pretty much is it's minor miscellaneous stuff so let's get into the code so we have two csvs which we're gonna um upload with train and test from row one to one thousand four hundred and sixty as our train set and then the rest is our test so oh i haven't run it you have the distinct pleasure of seeing how my code runs step by step so if we put the head basically the first five rows we'll see what we're working with we have a lot of stuff we have right off the bat missing values and um it sounds like a lot like it's not properly displaying so it sounds like there's a lot of features all right now that we've received some head giggity um let's go and see some data exploration um so in case you're wondering how you can get like comments in your jupiter notebook you just add something new like a new panel you go to markdown and you start with three hashtags yeah that instagram full and then you just write some bs nobody looks at and then you just enter it and it will look like that all right so data explore expiration i started with the correlation of all the features with the idea that if they have uh basically i wanted to see if there's any multi-collinearity but to be honest what's more important i wanted to see which are the factors that contribute the most to the sale price we have a target vario variable called sale price it's what we're going to be forecasting so let's see what impacted the most so i'm taking the squares because i want to see what impacts it the most and i don't really care if they're negative or positive at this point so we see that the overall quality the general living area cars and area basically impacted a lot so garage cars and garage areas are two things that are very uh intercorrelated so if you have a large area you're gonna be able to fit more cars so that means right off the bat our model has multi-culinarity which sucks it means it's not gonna be very accurate because all the factors are going to be fighting for you know for value and for weights and it's kind of going to us up but we'll see how we deal with that a bit later now let's paint a pretty picture i'm not going to go too in detail and in depth of this function literally i just stole it from somebody's code that was used to visualize but um basically basically what it does is it graphs and that's pretty much all i need to know i'm going to feed it up my target variable and the train set and um let's see what the output is see so you see the skew here to the left and how our data isn't like it isn't a straight line has like a belly like barely after you drink like seven beers and like you're feeling a little bit bad and you're wondering what you're doing with your life and uh um and there's also outliers so that means our data isn't um like the best right now and we can use some techniques to kind of deal with that um somebody once told me that the world is gonna roll me and uh we need a normal distribution of at least our target variable so what we're going to do is we're going to use a log logarithm logarithm logarithm locker room logarithm logarithm yeah we're going to use that to kind of uh convert the data in a more you know normal distribution now that we've done that it's basically the same data but it's kind of different but it's still the same and now we're gonna use the same graph to kind of see what's happened and as you can see the data is kind of centered at at you know 12 which doesn't matter for now but we see we have a standard distribution we still have the outliers but given that our data has like 100 like 1500 rows of data it might you know not be the best idea to remove roles that we can be that can be used to learn to actually learn and gain insights into the model and we see that the regression now can kind of be built a little bit more um linearly and um seeing it that you probably are getting where i'm hinting at we're probably going to use a regression so what's the rage with regressions like why am i going to use a regression now so i don't know a lot of models um or maybe i do but they don't want to be associated with me uh let's just say a number of models have been through my bad giggity um so yeah a wire regression because it's literally the first thing that i learned on my data science journey and because it represents a relationship of factors to another factors through weights what's the main challenge of regressions they only use numeric vary values you can use categories so you can use red purple or green you need one true seven bajillion and so on and regressions are kind of easy to understand it's kind of easy to understand what they do why they do it and what they use in order to achieve it and uh why make life hard you know let's take it easy so let's address some of the data challenges that regressions come with we're going to concatenate our train and test sets in order to have all of our data manipulations take effect on both of them so that we don't have a mismatch in what we do and uh we're gonna drop the sales price in order to not uh impacted by any of our uh modifications you're gonna say well if we drop it how are we gonna use it well earlier in the code i use um logarithm which basically is going to store the sales price onwards for the model it took me a while to understand that but i guess that's how python works python be easy bro easy so um we're gonna drop and then let's visualize it's not here even if it were here we probably still wouldn't see it because it's hidden somewhere in the columns if we want to see if if it's there we can call it by calling our data and the column name see no sale price here all right so data prep let's deal with the missing values and um features that are in there analyzing the data basically manually i noticed that alley basement quality and condition exposure garage and so on and so forth like all these comps they had missing values so what's unique about them well this is uh this is purely is maybe it's bias i don't know this is purely my idea so to say i won't say it is scientific or some but if there's no data about the alley maybe there's no alley and the same about the garage pool and fence so i'm just going to fill all the missing values with none so there is none and by doing that basically see ali has no none none oh i say all of these properties they don't have a alley continuing with you know data prep or data cleaning or whatever it's called some features consist of numbers that are actually categories such as overall condition quality air build so if you don't change your build to um basically to your category it's gonna the linear regression is gonna take the ear and it's going to multiply it by something to actually figure out the price and that's not what we want we don't want 2 000 times a constant to be in our model i mean just sounds wrong and even though ears are integers in this type we're going to be treating them as categories and in order to do that we just convert them as strings so they're you know words we're doing that for quality years and years sold in months old so basically the dates so um i shamelessly saw a function here uh for missing uh for having to basically have a table which is going to show me all the missing values in my data set feel free to steal i mean you're just recycling something that's already been done no need to reinvent the wheel and here we can see all of the categories that have missing values how many of them are missing what's the percentage and what's the data type now why did i need the data type because if it's a float then maybe i can take the mean if it's an object then maybe i can take the mode like the most common value yeah i mean fair enough so here we have some more data cleaning um it's important to know that there were some ordinal scales oh before i continue so these things i can just delete they were functions i had to build because i didn't know how um things were done in python and i thought there weren't functions to remove values from lists or binarized columns and so on and so forth so being the dumbest i am i wrote them manually by myself turns out if you look long enough there are solutions that work better faster and you don't have to worry about issues in your code so guys take my advice don't be stupid research i'm pretty sure everything that you want to do has been done and somebody's going to give you a really really easy way to do it no need to spend eight hours on writing functions so going through the data manually i noticed that there are some ordinal scales what does that mean we have values such as excellent good typical things that have a natural order we can have excellent at the top good at the bottom and then typical you know in the middle and so forth these things are called ordinal scales and we can order them we can transfer them to actual numbers so here basically what i've done is i've gotten the columns which exhibit these this type of behavior i've run some functions but like i said there's easier ways to do that and basically i'm removing the duplicates i am reordering the data based on a numerical value or order and i'm assigning that to it and i'll use that in a bit let me just run this piece of code and here comes some actional actual data cleaning i've gone through all the columns that are left and i filled them either with zero for the numerical ones or with you know what with what value that's basically their default value and you can see here i've left some you know comments saying that these are what's typical or what's the default value and somewhere where i ha i don't have a default value i've put the mode so the value that we see the most common so that my model doesn't assume you know outliers or just doesn't favor outliers now that we've spent more time cleaning that i've cleaned my room in my entire life let's see if we have more missing values see no missing data yeah cool so the linear regression cannot work with missing values and that's why we needed to fill everything that being said let's do some future selection and engineering based on our domain knowledge um here i've combined um total square feet based on the square feet of the basement on the first and the second floor and then i've had the ear built and remodeled basically concatenated so i create all of the permutations as categories if if it was built in the year 2000 and it was remodeled in 2011 then the new uh concatenation is 2000 2011 so we have a new unique category and then you have the basement which is the first four and the same second floor i meant here not the basement but square feet but mistakes happen little accidents happy accidents bob ross man yeah oh gee so bathrooms are basically the number of bathrooms like i don't know how many bathrooms these people need like do they like 17 times a day or whatever but you know rich people so let's run that and that's pretty much our um feature engineering which we're trying to eliminate some multi-culinarity because these categories are very well connected to each other um that being said we're gonna create some dummies like five of them which are which what they're gonna do is we're gonna check if the house has a pool if it has a second floor if it has a garage whether it has a basement a fireplace and that's it we're gonna use a very simple method with a lambda which is gonna aggregate the data for us and we're gonna create new categories pretty cool now that we have most of the dirty work done it's time to actually build the model but the linear regression cannot use categorical values so what are we going to do we're going to create dummies so all of these categories are going to be created to dummies with the panda get dummies function actually side note wrote a code for it the first time because i didn't know that existed and boy was that a mistake but um we're going to basically what the model is going to create it's going to create a column and it's going to put 1 or 0 based on whether the property has that as a feature and i think it's going to be easier if i just show you so the final features at the end you're going to see we have your build remodel was it built in 2009 and remodeled it 2009 no see these are just dummies they're either one or zero the concatenation of the ear build created just exploded the models with categories we have like 12 000 columns but we'll deal with that in a bit and at this point we don't really need the index so we're just gonna go ahead and drop it or delete it and we're at the point where it's actually worth to split the model into um our test and our train x is gonna be our test we're gonna have the final features up to the row 460 and our sorry there's going to be a train and our test is going to be you know the rest we're going to use x to train and then we're going to use the train model on tests in order to forecast values that we don't have it's important to say we actually have uh the sale price for the train but we don't for the test so all of our validation will be at the end when we upload our result to kaggle i've effectively created a new problem that's what i'm good at creating new problems not solving them so i've heard the problem because i have too many columns and i don't know which of them actually contributes to the price and which not um this is where kind of uh pca comes into place what's pca it's a principle component analysis it basically shows which categories contribute the most to the target variable and like you can learn about a little bit more like i have a very vague understanding of what it does and just stole some piece of code to kind of you know use it because like i said we don't need to reinvent the wheel we just need to be able to achieve what we want to do so um here we we're gonna import some like like i said literally stealing this why do we scale because pca wants us to scale and we're gonna fit x so um we're gonna run this basically what we're saying is we have 12 013 components and we're gonna fit them in the pca model um we're gonna get an explained variance which is how may how much of the price variates based on the factor that we um we've chosen and then we're gonna plot and i think this is this is kind of where we're getting into territory of we finally are getting something this is like part of the model as you can see and you probably know the pareto principle like eighty percent of the variance is explained through twenty percent of the features and vice versa we don't really need like twelve thousand twelve 1200 um features so we're probably gonna get around 250. that's why i've put it here as this pca um the pca analysis is showing me that around 250 is going to give me what the price of the property is made of and we're going to be fairly happy with that because this took like a shitload of time to like get to work like this was the devil seriously anyway so around 250 components make the 80 various um variants and what i'm gonna do with the next uh piece of code is i'm basically gonna say take the number of components that make the variance and just give me a new data set from them see i've gotten the variance based on the square root so the ones that are a positive or negative with the highest variance i've created a new variance a new column and i've said you know i want my data set the variation metrics to basically tell me which are the columns that i'm using the most that contribute the most to sale price and here we have them they're quite a lot they end with the ear and we have also the type pretty cool so now we have our features x new is our new data see it has 250 columns and they're only the ones that are the most necessary so we're re-splitting the data because i kinda was too trigger-happy on you know splitting it before turns out we need some more manipulations and we have our train and our test set it's time to build the linear regression oh yeah finally regress me daddy all day every day i'm regressing more and more just some corny jokes so we're gonna have um to split our um data set on four components basically we're gonna have sixty percent be our train and the next forty percent is gonna be our test and this is only on the train data why so we can kind of cross validate if our model is doing good before we actually take test it on the data we haven't technically seen you see the shapes are the same so we can actually use it for the model so um i'm going to take a small break here and and tell you i use the linear regression which plotted um the data quite nicely but turns out i missed a lot of the multi-culinarity and in order to do that one of the ways that you can deal with that is through a ridge regression it's the same as linear regression but basically it limits the amount that the values can uh the the basically the model is influenced by the value so it deals with the multi culinarity in a lazy way i'm i'm sure we can like improve that but let's just get the job done right so why is my alpha 12 or 20 well what is um because google told me 20. and that's the extent my knowledge is so here we're literally just building the model based on the sklearn library we're going to have our x train t and y train t so this is our train data from the test and our train data and our target from the test and we're going to do some predictions now in order to kind of see what we did based on the trained data as you can see we kind of ordered the values ascending and we kind of have a good basis for regression now right off the bat you can see that these values here and these values here and this value and this value and this value is going to up our model like these things either should have been removed or um there should be a way to deal with them but honestly i don't know how and at this point i'm gonna shrug and i'm gonna say i guess we're gonna have to you know take that into consideration so let's see how our tests of our train model did and as you can see pretty good i mean you can draw a straight line so we can do a regression pretty cool pretty cool now that we kind of know that we did something let's evaluate the model we're going to use a bunch of stuff but basically uh this is what we're going to concentrate on because that's what kaggle uses to measure the model and we have around 17 you know inaccuracy so our model is 83 accurate which honestly come on come on come on come on that's pretty good come on so let's predict the train values here we're gonna do the prediction based on our test so this is actually what we were aiming up until now and let's see what we did and oh no these are not prices these are logarithm values we'll need to switch them up before we submit them because they're still normalized all right so here we're going to create a small data frame so that we can export the sales value and here we're going to do another graph to kind of see what we predicted versus what was in the in the train set and here we can see that you know we're kind of within what our model basically uh was in the start i can see that i have that in our model we had values that are pretty high and are also kind of low and we missed basically all of them so right this graph shows you where your model is kind of not the best but it kind of also shows you the funnel which you predicted and how it compares to the data you had all that's left is to reverse the logarithmic values of our predicted values which are done through an exponential bam ascend assign ids and then basically we're going to export to c csv let's have a more unique name and here's what we'd submit so let's do it we're back where we began kaggle's website so let's submit our result and see how it did bam it's going to take literally a second and once i hit so this is my video submission and once i make my submission bam our score is 0.16 our model is had 0.17 so actually our predicted values are a little better than what our model did on our train data this could probably be due to the small sample size or maybe because of the ratio that i used for the test and train split on the training data but nevertheless we've got a score and we actually managed to solve you know the challenge so how did we do so our result of 16 3.43 is gonna be around somewhere here so around 3600 place there are like 5 000 submissions so that would be what come on come on that would be like 360 times so that will put us in like the 69 of people would solve this case i mean that's not too bad come on i mean not last um definitely i don't have an error that large so i managed to do something you know that's all right and if we scroll to the front we'll see that there are people that have models that are literally oh my god that's like a thousand times better than my model so there are things to improve i could have done feature selection better i could have cross-validated better i could eliminate multiculinarity better i could use better models and i'm sure as i could have done a better job which i will do i am sure as i will be able to do a better job in the future and i will come back to this example as part of the growth and that's kind of what i wanted to end on you know focus on growth focus on you know competitions challenges push yourself to do more and to do better and don't worry too much about the result it's all about experience and i feel like i'm growing on my journey towards data science and i'm sure you will in time too i'm python bobby as usual remember to say gucci and today i'm happy you are with me on my journey towards gucci data science hell yeah fighting bros that's it for me stay gucci bros stay gucci
Info
Channel: PythonBobby
Views: 1,592
Rating: 4.826087 out of 5
Keywords: Kaggle, Competition, House, Advanced, Regression, Techniques, reddit, learn python 3, Learn Python with Bobby, learn python for data science, machine learning, deep learning, data science, kaggle competition, kaggle competition solution, kaggle competition tutorial, kaggle competition live, house price prediction, house price prediction kaggle solution, Learn Python, Learn Python, learn python in 10 minutes, python, python tutorial, python programming
Id: sOlg5AYk4uA
Channel Id: undefined
Length: 31min 19sec (1879 seconds)
Published: Wed Aug 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.