Kaggle Top1% Solution: Predicting Housing Prices in Moscow

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so what's the point in this analysis what's the value to the bank well hopefully to mitigate risk if they could predict housing prices they would get a better indication of the risk they're taking on when they're issuing a mortgage so they don't want to issue a mortgage say a house is a million dollars I released they think it's worth a million dollars and they just you the mortgage eight hundred thousand and it turns out the house is only worth six hundred thousand so all of a sudden they're underwater so that's kind of one of the reasons why this would be a great model for me and then it would give a better indication you know how their portfolio is other loans that they made one of the issues is prediction are we actually predicting prices because with our analysis we have the the fortune of having hindsight or I thought yeah so we're using indicators that aren't available like for instance GDP the GDP numbers don't come out for another 90 days after so it's kind of tough to say that you could use as a prediction so I think we're more estimating prices than actually predicting prices so I don't know based on this analysis we're actually they could actually use it to predict housing prices but anyway so the data Moscow I think has about a dozen different districts and we're fortunate enough that be able to match up the sub areas to each of the districts this is just showing housing prices everybody seen this before with log transformation the 1 million outliers I ended up just taking a mouse straight up in a this price is square meter this is a really great chart for a time series but it still just shows you the distribution of price per square meter it does kinda look normal normally distributed but that doesn't mean much because there's compasses different areas that have different price per square meters so this is kind of observations by district you see that we have a lot and we'll call it novo and both in the Train and the test novo sticks out and it's funny novo is actually one of the lower priced districts novo and Trotsky those are the they're kind of outside of main Moscow and they it brings out if you just look at the median prices those kind of weigh down as you tell further ok another thing just missing data novo has the majority of the missing data you look this they're both on the same scale this is all missing data and this is just novo so you can see that most of the missing data is in that one area you'll see later on that there's another reason that kind of helps explain alright missing missing novo again this just shows that it's still this is all districts and this is novo these are the more important like live square it's just that data in over area one of the reasons I've found is that novo is more owner-occupied than the rest of the areas so most of the transactions are the owner occupier versus investment and the investments it's not a very hot spot like less than 3% so if you notice in a lot of the data you have ones for number of rooms you have ones for kitchen square feet per square meters and they're heavily dependent on owner occupier and I think that the data owner occupier just isn't there they didn't record it they're just not as stringent or whatever another interesting thing is that novo maaske wasn't incorporated as a district as a one I was incorporated as a district till 2012 so maybe that has something to do with it I don't know you know it state state would have been a great variable to have because it says the how good the apartment is but it turns out it's only good for investment so investments you can see the distribution of the different types of state but when you look at just the owner occupied heavily once does that say that all of the owner occupied places just aren't as good I don't think so I think the data just is bad so kind of just throws out the state variable for the owner occupied so it's just one of the issues so a lot of the inconsistencies come back to this fundamental thing that owner occupied the data just isn't there so as everybody knows that this is our training data and this is our test data you see right after well leading up right to the end of our train data oil prices drop in sodas housing prices I've scoured the internet for this this is the average or median square meter price Moscow part residential then we took a little bit I was going to use this as a way that just prices and the child tell me that you couldn't use it because it's fundamental to the answer it's against capital rules so I didn't go with it this is the prices by sub-region you can say out central is right around the Kremlin downtown those are the most expensive you down here you have general something the one who gives with the Z it's the cheapest and then outside consists of novo and Trotsky and they're both at the low end I wanted to use this too if I had just predicted on the test data with just a price per square meter at this time I scored like a point four eight on tackle with nothing else I don't know if that's good or bad but something okay so looking at housing prices prepared oil like there's this is these are both indexed at one at the start of our of our training data and you see oil drops and then the housing prices drop if you just looked at the median housing price or training data you'll see that it drops right here in 2012 but that's largely because there's a spike and the Novo transactions so housing prices didn't drop it was just the composition of our training data as well as this one looks different as just you see it drops again because nuovo jumps back up alright so future engineering inside I had this idea that maybe there's I could get in neighborhoods so I looked at the number of unique observations and I saw a bunch that had 12,000 and it turns out all these have the same number of observations and they're all in the same sub area so to me it may make it seem like I can group them by neighborhoods and the look if it's 12,000 turned out to be 12,000 as well of the observations are in the neighborhood of five or more so this is more of a localized way you could do KNN this is what I did for K and then on some of the missing variables I try to get all the data based off of or the best was whatever the neighborhood said what led me to find this is that was looking at the columns that had floor greater than max floor and I was looking at the Kremlin distance and so I searched I was like oh well there's a bunch of 17 max floors in the in the Kremlin neighborhood group and ours with 17 floor in mimic for like five so that's kind of odd so anyway so I think these are neighborhoods and I have more time I would drill down upon that future engineering looked at kitchen size heard of life this is the unclean data there's a lot again these are one said that the kitchen size with the size of the entire house didn't make any sense so I and they then put it as I read Peter desert I've been median which was like 0.3 or so this is the unadjusted that's dick in the clean version life to full again you see the some of the wrong data it said we're maybe not wrong we just don't know that if life were never like equal is full that wrong or is it just means the apartment is the entire size of useful space so you looked at that variable and also looked at for to max for you know find out how high they are in the building it turns out none of those are really that useful in predicting all right as a future we used X cube is defined the most important features that cut its off here but we use the top forty I think for from this so kiss this is keep it simple stupid so I wanted to look at the most simplest model we could think of I did a two variable multi-linear regression just using full square meters and sub area I don't know if that still counts as two but since I can we find it anyway the r-squared was point five seven and whenever I submitted to Cairo I got a point three eight I think which is two variables it's not bad that we added some more variables R squared didn't improve much but our archival score again we ended up getting a point three four or six of those six seven variables and then my Passover regret day hello okay so uh yeah grant the plumbers because I'm going to talk about pipelining a little bit so who here did yeah good well who here who here worked in who here works in Python Python who did multiple linear regression in Python so you need to apply a pipeline if you're going to use kind of any multiple linear regression or singular the support vector machines because you need to scale your data especially if you're going to use a lot of different variables that have different ranges so what I'm going to talk about is information leakage now if you have information leakage that's bad if you have personal leakage see a doctor was my joke so this is an example of improper processes so right here on the top you'll see for cross-validation what's happening and on the bottom this is for set prediction so this is an example of improper processing so what you do is you typically split your data into a training and test set you throw your test set away you leave it alone you're not going to use it until you predict however what you're going to do is you're going to split up your training data into training and validation folds however if you scale your data first on the entire training set your input putting in information into this scaler into the valid from the validation field which is your test set for cross-validation then using the like for example support vector classifier fit on the training folds and then predicting here this is using your cross-validation and then you go through your test set prediction and you're going to do a scaler sit on your training data and then you have your support vector classifier fit and support vector predict so this is bad this is an example of proper processing you no longer have the validation fold used in your scaling okay so you have information leakage which is vital important because you can start to get see correlations that are not really true or in existence so I did a quick example of this using Python so bear with me there's a lot of code here so I'm going to slow it down so well you can see here as we've scaled our data we scale our data according to the training training set we transform both the tests and the training data accordingly and then we do a grid search over different alpha or lambda parameters using Ridge regression and we're searching over the different parameter grids doing cross-validation folds of 5 and you can see here here's the best parameters you get an alpha one and you can see here's the r-squared on the train and test set actually not too bad or is it because what I used about I think 12 features in this and none of them actually showed any linear behavior so I did this on purpose I didn't want any of the features that have any linear behavior but I'm seeing for multiple linear regression or Ridge regression a positive correlation you see young that's bad so here is the example of using it with the proper pipelining so you need to use the make pipeline package from SK learn pipeline which will apply the standard scalar Ridge and then apply the parameter grid to it and it no longer takes in the cross-validation fold into your data and what you can see here is we get the same best parameters which is not always the case when I first did I did like 40 features and the those parameters actually do change and you can see no longer will have any kind of correlations the linear model does not fit the data which is what I expected so that makes sense good good - - yes you can't actually have a negative R squared score look it up on SK learn just meet your models bad so because of that we went we went away from multiple linear regression and moved on to tree based models because we thought you know let's make sense so we went through we went through the most simplest one and that is decision simple decision trees are known to overfit if you unrestrained it so what we did was we first looked at the accuracy of a unrestrained decision tree on the training sets 100% which is what you expect unrestrained and then the test is 22% on the r-squared so that's an example of an unrestrained overfit so then we did grid search CV and the advantage of decision trees is you don't have to scale your data so that pipelining no longer becomes an issue um and so we did a parameter search these are the top three models all given a brown the same r-squared validation score plug that in and then the accuracy is on the training set 62% and the accuracy on the test set 66% so we're no longer overfitting our decision tree kind of cool I'm going to go and over I get two more sled okay so then we did went on to random forest which is known to less overfit again using the same parameters as our decision tree we did random forests unrestrained just put it in default parameters I think it's ten estimators ten trees that it does and then we see an unrestrained random forest so it too can also over fit we then put it into the grid search see B well you can see here on the x-axis is the number of trees that we use in the random forest as a function of the max features and you can see the R Squared's accordingly so our best one was a the number of trees with 700 so not the maximum and with eleven maximum feature so it can choose up to max eleven features in the random forest and then that gives us a much less over fit of the random force however this is computationally expensive to do this type of grid search it took eight hours to run with three of my cords processing so yeah that's why you need some big data so time to get boosted XG boost okay well this is a this is what everybody does it's a big black box so we really didn't try to solve but you know underneath that is going to be nice diamonds so we did the top forty three features we removed time because in was worse on our K goal score we use price per square foot as a predictor it's fast it's nice for Kegel however it's not very interpretable so you know it is what it is and we get a score of thirty one point three one two yeah close questions so there's different definition he has good he had the same way I had a guy question him on it and he sent over from SK learned that are square to be negative if it so if you are like a linear model to do but maybe like a linear models predicted non Kona if says it nice can learn the decoder negative can you explain that in R squared over B there you are squared negative check out the SK learn documentation and you'll find it can just means your models it means you yeah yeah there's different definitions of R squared I didn't know that I was wrong with you yes that it's not one yep so there's your some homework there's just some homework and learn about learn about information leakage look it up you don't want that yeah my main point yep so you need to scale your data accordingly based upon you can do min or max you can do standard like you know minus the mean divided by the standard deviation on all and just to get them all on the same scale but the order in which you do that is important you can't just scale all of your training data accordingly you have to do it within the cross-validation otherwise if you're splitting your data up into different folds you don't want the validation fold to be included in your scale just as we don't transform our test data based upon the test data that make sense like you have to there's this code here so you want to do your fit you scalar fit to the training data and then you transform both the training and test data according to this set so you wouldn't do a scaler dot fit test right otherwise you're changing the shape you're changing the shape of your data yeah yeah this right here it's the R I mean it's the R squared I should say R squared not accuracy I'm bad it should be either one - but this is the R script yeah yeah apologies from extra boost yep chicken and the egg Jana you're a martyr yeah it was a favorite process yeah yeah I mean I put the when I started doing the tree based models I used the clean data I was given from my group I'm put in the top 20 features and then for that I was like I was satisfied for the rent decision trees in random forests and then I went mmm looked into XG boost and just started adding more and more features and saw the score get better and better and so you know it was a data science driven but it with cackle driven so but our main goal was to truly understand the process of you know um so what I did here is I didn't even touch the test dataset from the later years I did a test Turing split on the training data yeah on the training data and called that new test my test data so I never even touched the test data yeah yeah and the whole purpose was of this part was not to do well on Cagle but to demonstrate the principle behind information leakage because yeah right right and we're I mean this is what we're trying to indicate is like when you do cross-validation you're only sitting on the training folds and not the validation pool yeah yeah so that's why the pipeline I think package is very important yeah [Music] yeah [Music]

Info

Channel: NYC Data Science Academy

Views: 16,371

Rating: 4.8381505 out of 5

Keywords: data science, machine learning, kaggle competition, kaggle solution, top 1%, prediction, XGBoost

Id: W530d2ZdbJE

Channel Id: undefined

Length: 25min 46sec (1546 seconds)

Published: Mon Jul 31 2017