Predict housing prices in Austin TX with tidymodels and xgboost

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is julia silgi and i'm a data scientist and software engineer at rstudio and today in this screencast we're going to walk through the data set from this week's episode of slice the semifinals the semifinals during which i was eliminated but um it was a interesting data set on predicting home prices in austin um using lots of different information about the real estate listings and um i i during the episode i started using some of the text information to create like a higher lower price indicator based on text and did not get it done had r crashed on me and made some mistakes so in this screencast what i'm doing is walking through what um that model that i didn't quite get finished and see what kind of results i get so let's get started okay let's walk through this data set from the um semi-finals semi-finals i think semi-finals uh semi-finals of sliced so i've got the um training data here and it just uh i think this is okay yeah that's that is what um i saw earlier as well so this is a data set like i mentioned of um real estate listings in austin texas we've got information on you know where it is does it have a spa what was it built um how big is the lot information on the public schools in the area and what we're going to predict is this price range so it is it is not it is not a um numeric value it is a bend um uh like like price that is that has been bend so is it you know between zero and two hundred fifty thousand dollars two fifty three fifty and so on or um you know in is it or is it above um six hundred and fifty thousand so um they picked these bins so they're sort of even-ish although there's more in the middle less on the ends which you might expect so and so this is a multi-class classification challenge not a regression classification challenge so let's do a little bit before we get started on the model let's just do a little bit of some um of some exploratory uh uh visualization so we can if we want um let's let's call price range and we can use for example the price the parse number um function which is from read r and um force a force it to be a number so what it does is it finds the first number of each of these and you know if we wanted we could add a hundred thousand dollars no that's too many there we go so that it's kind of like in the middle of each of those ranges if we like and we can um now say let's do longitude [Music] latitude and let's say z equals price range like that and then we can use one of my favorite little um functions in ggplot 2 which is the stat summary hex to make a little hex map and if you put you know we set that a little bit transparent and um i don't know bins more more bins than the default and then use a scale fill virudis which i think c is the right one for continuous like this and what we get here then is a a nice little map of austin let's bump that up to 50. that's maybe too many there we go i like that okay and we can change you know the lab so fill this is that's the mean that's the default in stat summary hex is the mean we could put another argument here to change it and make it like median which of course in this case would not do very much because we've got these bin numbers but um and then we can say title equals um price like that and so what we have is um a map now of price across these real estate listings in austin so we've got um the you know latitude longitude we see these expensive houses here these less expensive houses across over in the east and you know kind of to the south a little bit here and um so this distribution you know it looks like the spatial information is going to be really important for the um for the model so let's um let's save this uh this plot object here and then let's let's make a few more plots but the same kind of maps so i'm gonna i'm gonna make a function like this let's call it plot austin and i'm gonna call let's see i'll need a variable and a title and let's um let's take this and plop it in there and oh we don't need this part because we're not going to be doing um plot and this is where var the variable will go but we are going to um use this from from like tidy eval we're going to use this um embracing um uh this these embracing notation to say that what we're passing in is going to need to be parsed using a tidy evaluation and then the other thing we'll need to put in here is the title which i will pass in as a just as a string so for example now we can say plot austin what else do we have in there let's let's remind ourselves [Music] you know i bet average school rating is pretty interesting and let's say school rating like this oh right so gotta okay so here you can see this is much smoother probably because these instead of individual houses we're probably looking at um clumps of you know school districts or schools i'm sorry not school districts but you know like where schools are zoned for and so again you know if you want to look at like the price and then the school rating right like these are really related right so school rating i'm sure is going to be really important for price so let's do a couple of these and actually let's let's load the patchwork function package and let's do something like price plot plus um that one and then let's so so this means i'll put those two on the top and then let's do a couple more so let's say um let's do that year built which will probably have a different distribution here built built and then let's do um [Music] let's do the lot size lot size square feet like this and let's say the lot size like this all right that's looking pretty good oh it looks like there's some really enormous those are probably not correct actually those um like if we look at this here these are probably not right these um i mean i don't know it's possible i guess but i bet they're not right those like really enormous lots let's let while we're here let's um let's just take the log of that uh and let's let's put that here log all right so let's look at this all right this looks pretty interesting to me okay so price and school rating super super correlated right good schools or where the expensive houses are um year built is pretty interesting because it's different they're like the old the old um homes are are in the center of the city and then as you move out farther you see the newer homes the newer homes we've got we've got this sort of east-west gradient and we've got this in out shape and then here uh you know we're affected by these really small lots which may be um you know condos and townhomes or maybe uh bad data to be honest but we all we kind of have this gradient from smaller over here to larger over here so on the west side out over here we have real like big lots bigger lots over here and maybe some problems with the data i'm not entirely sure but we could explore that a little bit more to learn it learn about it okay so the this kind of relationship these kind of relationships are what we want to try to use to build a model to predict price remember price bend is not thing we want to predict we want to do it with this other information like um like what are the schools like how big is the lot size when was it built and whatnot so let's um let's get started because one thing that is here is let's do it like this let's say um slice sample n equals five and let's select the um [Music] description i believe okay so we have oh come on we have the descriptions from the real estate from the real estate listings you know they're telling us maybe the address custom modern home charming remodeled home master down the downstairs the master's downstairs perfect opportunity to get in highly desired ivana stunning inside and out you know so these are these descriptions and so um during sliced i attempted to do this but then sadly ran out of time i um you know r crashed on me actually because i think i was i did something i made it angry so what i want to walk through here is um how we can maybe find the words that are most associated with price and then use that to create a dummy or indicator variable we could use this text data directly tokenize and use you know use that those tokens as features but instead what i want to do is show how to use some like a separate analysis identify words that i'm most interested in and then use that to create to create a small number of features just like one or two so this this is um something i don't think i've shown like that that how to do and that often um you know it's an approach that we often might be interested in doing so let's just to find these words that we are interested in let's take that raw data let us um let's do the same thing we did here to get um a numeric sort of center of those bins let's um uh tran it let's tokenize and transform these description words into um [Music] a tokenized tidy format and let's remove the stop words like so so let's call this austin tidy and then let's show what are the most common words like so okay so home kitchen room austin new large two bedrooms and three bedrooms contains so those are the most common words if we um you treated the um the text as features um these would be the tone so you know some of the tokens that would be the most common but instead let's find what words are changing with um with price so let's take let's find um let's find like the top hundred words so we're gonna do that same thing we just did sort equals true [Music] um i don't want i don't want those 1 2 3 4 kind of words because that actually is already in the data set as number of bedrooms number of bathrooms so let's get let's take those out i'm not so interested in those and let's take so that that will then have an n and let's take the top 100 words word so let's call this top words so with you know various kinds of filtering these are the top words used in the um in the real estate descriptions in these austin listings so bathrooms park access entertaining counters counter tops neighborhood et cetera et cetera now let's um so let's count word price range so we're saying how often is each word used in um each price range bin let's complete that in case there are some that are missing whoops price range like that so we'll say fill equals list n equals zero so we're going to count up so we say um how many for each price range um how many times is each token used let us then find a um so let's group by the price range and let's find a price total like how many words are um used total for each price range and a proportion which is the n divided by the price total and then let's filter so we only keep the words and the top words these are the only ones i want to analyze further let's call these word frequencies okay what does that look like so so this is access in these price ranges spins how many times are they used how many words total are used to describe each one of these price bins and so what proportion of the total words used to describe houses in this price range are the word access and then next for appliances so what we can do with this is we can train a set of linear models so let's nest we want to nest everything except word like that all right that worked um and then we are going to train a model so what the model does we're gonna use per map so we're gonna map over that data and we're going to use a generalized linear model and so what's going to go on the y side is n uh oh sorry it's a comma n out of price total so this is it's like um it's like binomial right like if this these are the successes and these are the um failures i guess or total possibilities um how does that depend on price range so does does the does the success do the number of successes like the proportion of successes does it depend on price range that's what we're modeling here and then let's see if i can get the dots right i've got a lot of parentheses going here so the dot here is for the data and then i have to say family equals binomial i'm pretty sure all right i think i've got something messed up glm family binomial yeah i don't something i have something messed up okay model map right oh yeah i don't this is wrong okay there we go okay so great so i think this will train all the models right so fast just little linear models and then i can say model map model tidy so this will tidy all the models so i have little data frames instead of little glm objects and then i can un-nest the models like so and all i want uh are those i don't really care about the intercepts i just care about the slopes like that and um i i trained a whole bunch of models there 100 of them so it's a good idea for me to adjust those p values p values at 100 maybe it doesn't make a huge difference but but you know it's getting to be a lot and if i let's let's arrange like this let's call this word mods for word models so i've trained a hundred little models and um so these are the estimates that are um the biggest is that right yes and so these words are the words that are most associated with high price outdoor custom pool office sweet gorgeous if i do it the other way mods carpet paint clothes flooring shopping these are the words that are most associated with low price this is pretty interesting actually so a cheaper house you know and the listing is going to say hey it's got new carpet and paint it's close to shopping whereas expensive houses you don't say those things about them the expensive houses you talk about the pool and how gorgeous the pool is and the custom outdoor i don't know kitchen or whatever i'm not sure um okay so let's make a visualization of this and we're gonna make something that is actually similar to a um to a volcano plot if you are familiar with that so we're going to put that estimate the which is the effect size on the x-axis and the p-value on the y-axis whoops and let's um let's put a line at zero a dashed line and then let us put these points on it um i'll make them a nice color that i like uh okay i don't know something like this okay so ah so plots like this usually have the y axis on the log scale for just exactly the reason of what we're looking at here and so this shape here um is is you know if you've looked at volcano plots this is what we this is the shape that we get right um so we've got um these positive values which are with high price the negative values which are with lower price let us pop on the um the words so text i'm going to use from gg repel i am going to use the function geom text repel i'm going to say label equals word equals word and i'm going to make the uh the fonts match what i was already using so there's a lot they're overlapping they can't there's too many of them especially because i'm zoomed in here pretty big um so a great so over here we see these exp words associated with high price these words associated with low price over here um it's just very interesting to me um how we can see these differences and so these are what we're going to try to use in our in our machine learning model of predicting price we've used this other supporting analysis um to identify them and then we're gonna we're gonna you know incorporate this into our machine learning model so let's make a data set of higher words so we have this word mods um let's let's take um you know take a a threshold on our p value and let's take a um let's take let's take a uh sorry um the top 12 words here and let's pull out word if i can type that and let's do the same thing for lower words so these are words that are associated with high price words that are associated with low price and so we need to change this to minus like that so high price is so higher words these are these words right that we've been looking at lower words there we go um ah so these are probably um uh uh you know townhomes condos if they're maybe i am guessing tile you only care about new tile and an inexpensive house which is very interesting um and we can we can we can look directly at these changes if we want so for example let's look at the high words whoops filter a word and higher words and then we can put um i forget what's in here okay so we can put um price range on the x-axis that proportion on the y-axis let's color by word oops and let us um let's make this pretty thick because we don't have that many points here just we only have as many points as we have price range bins oops geom line there we go and let's facet wrap by word like so okay nice so let's make scales equals free y so we can see these and so that x continuous let's make um let's see from the scales package use let's use dollar like that i'll i'll zoom this in so it looks a little better scale this is continuous as well um and this is a percent like so and we can um hey sometimes it's nice to see where zero is on all of those and so we can make them all have zero like that okay so let's take a look here so this is pretty interesting so expensive houses are more likely to talk about how many car garages they have three car maybe four car or more custom gorgeous pool um outdoor increases a lot so see how these all increase now let's look at the lower words and these you know just the the opposite story right um carpet easy minutes like location it's pretty interesting to me that the inexpensive houses are asserting how important location is but it's so clear how important location is in price based on the maps that we looked at so it's like these inexpensive houses homes are asserting in their descriptions how important their location is how great um their location is and how how easy and close and how how few minutes and it is to where they're needing to go but obviously they are in a location that is less expensive pretty interesting if you ask me okay so we did it so these things higher words lower words that's what we're going to use in the model so now let's do that let's get started on our modeling this is going to be a little bit of a longer screencast because i did a little bit of some a nap like ahead of time analysis to do some feature engineering so let's load tidy models so i'm going to take that uh training data i'm gonna i'm not gonna use that city because it was uh you know i can show you again right here but count oh um i saw this as well when i did um you know i'm not gonna i'm not gonna bother with that that's not so useful and while i'm here i am going to change the um description also lower i guess i could do that in the recipe in my future engineering as well but i'll just do that right here for convenience and i am going to do such stratifly resampling let's call this austin split so i'm first splitting between testing and training and i can use this split to create a train training set so this is a training set i can um use it to create a testing set and let's also create some resampling folds some re-sampled folds i'm going to create cross validation folds because there's quite a lot of data here as you have probably noticed i'm going to create the resampling folds from the training data i'm not going to create 10 because this is this is a big data set and i'm and it's already gonna take a while to train as i could have experienced from the sliced um you know the slice episode this past week um i'm i'm just gonna do 10 i mean five not the full 10. let's call this austin folds like so all right so training testing resample full so this think about this as spending your data budget you have a certain amount of data and you have to decide how are you going to budget it you so you allocate a certain amount for training a certain amount for testing and then you can use that training data to create simulated training and testing sets which we call analysis and assessment sets to be able to choose between models tune models and whatnot all right now it's time for the feature engineering part of this model so i am going to use the glue glue collapse function on that higher words vector that i have and i'm going to use um that because when i do that it collapses it into something i can use as a regex pattern so let's call that the higher price pattern and let's call this the lower pattern and we're going to do this with lower words words like so so this i mean this i'm going to use this in the recipe and let's start the recipe okay so i'm gonna i'm gonna predict price range basically with everything just the whole shebang just just throw it all in and stir it up see what we get okay so what what is in this training data i'm gonna um update the role of the uid and this because i it's not a predictor right so i'm just gonna call it uid so you can call this anything if it's not predictor or outcome it will be not be used in the model in a workflow the next thing i'm going to do is i'm going to use this right here this right here i'm going to use a feature engineering step called step regex or right that's how i say it regex i'm not sure some people might pronounce that differently so what we're going to do here is we create a new dummy or indicator variable based on regular expression so if you look down here at the um examples this is about like rock cover or ground cover and what it's doing is it's saying hey if you see the pattern rock or stoney um call make a new thing called rocks and if and make it like yes no yes no rest yes no um and so that is how that is how this works so we're going to do it twice step step regex we're going to do it on description pattern equals higher pat like so and let's call this high price words and then let's just do the whole thing again for the low price words low price words like so and then i i don't want description anymore i'm just going to remove it because i'm not going to like tokenize and wait by tf idf and all that kind of business i'm only using i'm not teaching treating each of the tokens like a feature i'm just identifying does it have the high priced words does it have the lowest price words and if not um i'm just going to remove it um i do have if we look at um if we look at the training data right here we've got a we've got a um a categorical variable here we've got a lot of i think this will just get turned into zeros and ones and the rest of these are numeric but um let's uh let's do step novel on home type step unknown on home type because actually [Music] i think i should forgot to show doing this but there is a little bit of some missing data in home type see just a little bit of some missing data and so step unknown step other on home type i'm gonna up up i'm gonna up the threshold and then i will um step dummy uh you know i'm gonna do step all nominal predictors just to catch um has spa i don't think it's going to though i think it's just going to change those to zeros and ones and then let's do step um in zv all predictors predictors like that oh i'm gonna do one hot equals true like that because that can sometimes help in tree based models austin recipe is what we'll call this all right let's run it like there there we go okay and let's so we've got our feature engineering which is all set here now let's combine this um with a model so once again uh i was i use xgboost xgboost tends to perform well in a situation like sliced um i think so what i'm about to show here i think is what i show is doing at the end so i'm gonna do a pretty big tuned model so i'm going to say tune the tree depth troon um min n tune m try tune sample size [Music] tune the learning rate so tune a lot i'll tune a lot of the things um uh in some of the other some of my other recent um screencasts and blog posts i show uh some ways to make to tune faster if you want to get to a good result a pretty good result quickly but this is like a let's try to do like um uh very thoroughly so we're gonna because we're gonna tune a whole bunch of stuff and now let me put these things together let's call it x xgb word workflow and let's take a workflow work flow and let's add the recipe and the specification like so and we got those all these things together so this workflow now um is ready to be tuned we do have to decide we have a couple options when it comes to what are we going to tune here i am going to this time i'm going to show how to create a custom grid because i do i do actually want to do that um i could just say like 20 and then it will automatically try to find 20 possible parameter combinations for me to try but here i am going to um specify something so instead of going from 1 to 15 which is the default i'm saying hey don't go all the way down to one i don't want to try like stubs because i i mean i at least know from what happened already on um a slice that like uh stubs are not going to do me any good like little tree stubs so let's do let's do these bigger more complex and i don't want to go all the way down to um i don't want to go all the way down to 1. let's do 5l and then i think like 10 would be good i can i can double check prep and let's look at what we have here so what m tri is saying is like how many of the um [Music] yeah yeah i think 10 is good um how many of the columns do we want to try and then let's do a sample i think i do sample if i do sample prop like this and i don't need to go down to point one let's do only go down to um 0.5 i'm just like not exploring the space as much because i don't i i guess i guess it's cheating a little bit because i know a little bit from what um what happened on slice what will work well but to be honest that is what happens a lot anyway so this is a grid see here i'm saying try 20 things xgb grid like so uh so if i let me save this as a grid so we've got um see it when by grid max entropy what i'm saying is i'm saying try to cover this space in an efficient way without doing a regular grid so it often works well for this kind of result alright so now let's get started on this i am going to use fine tune again which has racing methods the reason i'm going to do this is because some of these combinations are going to not turn out well and i want to throw them away quickly i don't want to keep going with them for forever so i'm going to save these in a result so i'm going to use tune rate let me load this whoops let's use tune race anova like this so i'm going to put in my workflow i'm going to put in my resampling folds i'm going to put in my grid i am going to put in some metrics for it to decide um uh how like whether it's good or bad this particular a challenge used a multi-class log loss and then let me put in a little bit of some logging so it can tell me when what what it uh eliminated here and let's get started on this so this is going to take a little bit even with my my parallel processing running so what it's doing is for each of these i have five resamples and it's trying 20 of these possible parameters so we're training a hundred xu boost models right here it's not going to train all 100 because it's going to stop and throw away some when they turn out badly but it's going to keep going so let's pause this and then i'll come back when it's done all right all right it's finished um and so this this uh logging that we see here is what we get um what we get because of um this this uh argument right here so it tells us you know by the time it got to this fold it had eliminated half of them here by this fold it eliminated some some more so we can view that that race as it were by looking at um this plot here um as you can see you know like these are the ones that are the the sets of parameters that are obviously bad and so the tune race anova what it says is like don't keep going with those they're no good let's let's not keep evaluating them let's only keep going with the ones that look good and so that those are these these ones that are down here and so um we can you know keep going with these results and do things like we can use functions like show best which will let us see these and we um so we get you know this is what it the re-sampled results on the um on the training set here and we can see you know here's what we get in the here are the parameters that we have here and um then the next thing we can do you know select best and whatnot so that's that's what i'm going to do next i'm going to take that word workflow which was tunable i'm going to finalize it with the best option from the um that result i'm going to choose the best option by log loss and then i'm going to use the last fit to fit one time to the training data and evaluate one time on the testing data like so so this gives us a last fit object so it is it's updating the tuned workflow and then fitting one time one last time on the training data and then it will evaluate one time on the testing data so we're we've we fit nearly 100 and i guess less than that but many many extra boost models were fitting one last time all right so here is that result here um so this object contains a couple of interesting things um predictions it contains predictions so the predictions here are not on the um [Music] testing on on the training set they're on the testing set so if i run this what this gives us is predictions on the um on the test set here and so we've got pre but we have a predicted probability of landing in each one of these price spins so i can for example compute the log loss we've got the uh the real price range and then i have that um that predicted probability of being in each one of these price ranges here so let's copy this like that all right and so this is the multi-class log loss that i get on the testing set so this is you know like what's most equivalent to what's on the leaderboard on kaggle and this is a little bit better than what i got um during sliced so it would you know been nice to get even a little better than that but for one model this is this is um not not bad pretty good pretty good so let's see let's let's explore a little bit more so we can see um what what like what is driving our results here so we can do a confusion matrix by looking not at the uh not at the looking at the the predicted class the hard class predictions so you know it looks like this we can also do a an auto plot here like this and this is where we start to understand some interesting things about these classes notice in the high class the high price bin um how right the model is the model had a pretty easy time of predicting the highest um priced houses probably because of um where they are um but maybe because of other things like um you know schools or whatever but then as we move this way it becomes harder and we're starting to do a worse and worse job of predicting um of predicting the the price spin so it's harder to get these you know these these other bins predicted correctly we'll be able to see the same thing if we do a um [Music] copy this if we do an roc curve so let's do roc curve like this we could do i'm on autoplot but that puts them on separate yeah instead of that let's uh let's do one minus specificity oh my gosh what am i doing wrong why doesn't i like this surely i don't have to type the whole thing out there we go we'll say color equals what's it called dot level and let's say geom path and let's do like a chord equal on there like so yes and labs color equals uh equals null like that okay so let's let's wow that's that's pretty interesting so the fact that the um the highest price bin is so much farther in the corner um it you know it shows us again how much easier it is to identify those expensive houses than these less expensive houses so that is um that is pretty interesting here um so let's let's do one more thing let's look at the variable importance so i am going to extract the workflow from that last fit object like this so that's a trained workflow then i'm going to extract the parsnip object from it and then i am going to compute the um variable importance using the vip package so this is um this is model based variable importance based on all the trees that are in this model so it's going through computing all the variable importance based on those a thousand trees and um okay this is this is super duper interesting so spatial information lot size schools you're built and we can we move down here and um notice that high price words is not in here um so the the model is not trying to use the high price words i bet you that those high those high priced homes are just um are just pretty easy to find and that the model does not need you know that information to try to find it the model did try to use the low price words to try to do a better job of finding the um the lower uh price homes and you know like it's probably why i'm doing a little better here than in those models i did during sliced where i did not get this working um and and you know had my my my troubles and all that kind of thing notice that low price words is about the same as um you know like whether it's a single family home versus a condo or a town home a little more important and like more important whether it has a spa or not so it's kind of like on that on that order and less important than these guys up here um so i think i think that's i actually think this is really interesting this shows how um models like these can you know have the have the freedom to choose which models to give different amounts of weight to and um uh and sort of give these different relative importance um in this way all right i think that was pretty interesting how um the model used the lower price words but not the higher price words so identified like that those more difficult to find categories in this multi-class classification challenge so this model did perform better than the than the ones that i that i trained during the episode of slices did not have this this um text information incorporated which is also pretty interesting just by a little bit but you know a little bit better and um if you wanted to do even better if you wanted to make a model that performed a little better some other things i might try might be to balance the data set maybe use up sampling to balance them so that the model can do a better job of recognizing all of the different um the different price categories and of course um to ensemble the model to instead of just using one model to use um several and two to wait them so i hope this was helpful and i'll see you next time

Info

Channel: Julia Silge

Views: 8,233

Rating: 4.9370079 out of 5

Keywords:

Id: 1LEW8APSOJo

Channel Id: undefined

Length: 51min 53sec (3113 seconds)

Published: Sun Aug 15 2021