Create a custom metric with tidymodels and NYC Airbnb prices

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silly and i'm a data scientist and software engineer at rstudio and today in this screencast we're going to use this week's data set from sliced the competitive data science show on kaggle to highlight two main things the first one is how to create a custom metric in tidy models the yardstick which is the patch the package and tidy models that help you evaluate the models that you're building has lots of different metrics built in but it's also extensible and then it lets you build custom metrics that you may need on slice this week the models were being evaluated on root mean squared log error because uh what was being predicted is the price of airbnb um listings in new york city and they're distributed in a um you know in a very log normal kind of way with lots of you know low to medium prices and just a few very um high prices and um that's not a metric that's in tidy models uh by default but we can show how to make one you first want to make like a metric a vector representation of any metric and then you want to make a method for a data frame for it so we'll walk through how to do that the other thing i want to highlight is how to take how to combine you know structure data like numeric predictors categorical predictors with unstructured data like text and how to how to combine them as one so that you can use them together in predictions that you want to want to make so um in this case that's uh the the name of the airbnb listings so let's take all that together and then walk through how to build a model to predict prices of airbnb listings okay let's get started modeling this data set of airbnb prices so i've got the um the the train and data set here in my working directory i hope train yep there we go so i'm going to call this um train raw so uh like i said in the intro this is the data set from this week's uh sliced um episode and it i thought it was pretty interesting for a couple of reasons so let's get started first with a little bit of some exploratory data analysis and visualization so the thing that is being predicted here is this price variable price here and it is um the price the like nightly price of these airbnbs in um um in new york city so if i do this so notice that it is super um skewed there's just a few that are very very expensive but that lots are down here at this lower uh at these lower prices so this is you know really common actually a lot of um and a lot of different domains that we might work in so it's really good to know how to work with this in a lot of different ways notice that we are we've got some um there's probably some zeros or something in there although honestly i've never seen an airbnb that had a price of zero let's look at price price the mean is here let me stretch this out so we can look at it a little better price yeah there are some zero price airbnbs in there which that's a little strange but okay um some some people will just let you come and stay there isn't that nice of them so we'll say there is a there's like a burrow let's look at train raw there's a burrow here i think it's called neighborhood group and let's say alpha equals 0.5 let's make fewer bins so these are all on top of each other now so let's make it position equals identity like that um all right so you can see how much more expensive the manhattan ones are than say brooklyn so say in brooklyn if you're looking for less expensive and manhattan are the fancier which i think this makes sense um also we can see the difference by number so you know there's a min there's much fewer in queens if what you actually want to see is it by um uh you you don't want to see the more you don't want to see it by count you want to see it by density we can do that which is can be useful as well like both of those are useful i think but different views and let's just clean this up a little bit more and say fill equals null and x equals price per night like that and so i think we could do either count or density there and both of those are useful and helpful um maybe i'll take the count i'll take the density out like that because that gives us kind of a helpful view of what's in this modeling data set we definitely have this you know this logarithmic characteristic of the price that we're going to be um modeling and i think we should take that into consideration both in the modeling and there's a custom there's a metric that um if we if you go here to this link you'll see that the metric that they're evaluating on actually takes a it's a it's a log um a root mean squared log error so we'll make a custom metric for that let's just make um a map always fun to make a map of things like this so we can start with just a um a nice basic map like so we we have longitude and latitude information we can say color equals let's say log price like this say gm let's just put points on a lot of points here and we'll say scale color um one of the viridis so this one is for um how does this one this one is for color this one is for these dumb these are for different options so i think i want this one like that so this will take a moment oh no actually that was really fast so let's so here's new york i love it manhattan brooklyn you can see the park very nice i love it let's cut let's um copy paste and let's just change this a little bit so instead of color here let's change it to z equals log price and let's um change it from points to stat summary um hex like this uh i don't exactly remember how it works let's look stat summary hex so i think the way we do it is we send in a function that's right so we send a function we can do mean so i guess if we do nothing it'll just take the mean and that seems like it will work pretty well and i think we have to change from color to one of these other options because we're doing like it's doing fill now so i think it's this let's see if that works that did not work it did not do the fill um i just used the blue color so what have i done wrong scale stat summary hex scale color wow it didn't do the viridis color what if i do this does it do it no this i think is for discrete it's not just it's not doing it at all um maybe i have to give it a function like mean or median oh come on i want to do oh silly silly not in scale color scale fill there we go i like this but i can't see much of so i could probably take this function out let's make this let's make these bins smaller so we can see more of the city that's looking pretty good labs fill equals let's say median price log let's make it like that all right that is looking pretty i love it so you can see that you can see the um [Music] uh oh that's not median that's mean because i didn't give it any function so it's mean this is looking beautiful and i love it so you can still see the park here you can see areas where there are not very many um uh airbnbs and then we can see kind of these areas of very high airbnb pricing the areas of the low b airbnb pricing and where areas where there aren't any so i think this is super dense and information and i like it a lot all right um okay so now let's move on and let's build a model so we're going to build a model and then we're going to talk about how to evaluate a model with a custom metric so let's load tidy models if i can do this and let's um so i'm first i'm going to split i am going to split here because i want to um use my some testing data here so what i'm going to do um so the the this custom metric that they talked about evaluated on the log scale but i want to train my model on the log scale i don't want to train on that on the price on the original um the of the original uh scaling there because it is so so skewed so i'm going to say mutate price equals log price like this here and then i'm going to do initial split i'm going to say strata equals price like so uh so that will be on the log scale so let's call this my nyc split here and then i'll do nyc train and then we'll do the same thing for some make some testing data and then let's make some folds and i will say the full cv on it will make folds from the training data i'm not going to do all 10 just so this doesn't run for forever um if i was training a model to like put into production or something i would probably do all 10 but for um being faster like this is still a lot this is still a lot okay um okay so now let's talk about feature engineering so what is in this data we've talked about some of the geographical information let's look at what else is here we have some text information the name let's take a look at what we've got here we've got the name nice new nice bedrooms and two bedroom apartment small tidy bedroom and duplex um we've got some ids host names um neighborhood group which is like the um yeah i already showed you that that is the um the borough neighborhood which let's see how many there are of those name um nyc train so in the training data set that i made there are 212 and there you know some of them there are very small numbers so there might be even more that are in the um in the um the test set latitude and longitude which is what i use to make these maps room type room type of which there are three um price which remember now is on a log scale minimum nights numbers of reviews reviews per month how many how many listings the host has and how many nights it is available a year so let's start putting together some um some feature engine hearing here so i'm going to do a recipe where price and i so i'm just going to do a demo here um so definitely i'm going to do latitude and longitude i'm going to do neighborhood i'm not going to do the burrow neighbor it was spilled yeah as if i were british well this is new york city um room type um i am gonna do that minimum nights i am going to do the number of reviews i'm going to do that availability what is it there and i am going to try to learn something from the name so what i'm going to show demo here is how to do some um how to combine a rectangular um what you might call you know tabular information i'm going to combine it with some unstructured text information that i will turn into rectangular information okay and the data here is nyc train like that so let's see what i'm going to do i'm going to say step novel on neighborhood just to in case there's new levels in the um [Music] testing set i'm going to say step other on neighborhood because that's too many levels i'm going to drop the threshold though because it's set a little high for me this and now i'm going to do some text processing so i'm going to load the [Music] text recipes package and i'm going to say i'm going to tokenize no token eyes the name of the listing i'm going to remove stop words from the name i'm going to set a token filter on the name meaning i'm just going to keep there's a couple ways to do it i'm going to just keep the top i'm going to go low here just real low like 20. i'll do 30. i'm just going to keep like the only the top 30 tokens after um removing stop words just to get something working here so i can demo this and then i could wait by term frequency or by tf idf either one would be a good starting place here i'll wait by uh with 30 you know i might as well do this nothing we're not we're not really doing that much here we're just kind of getting that information into the model so let's run this and um just so i can show you what this looks like let us prep it and yc recipe so what it's doing now is it is um uh making sure we can handle any new refactor levels collapsing factor levels and then it's doing these things the this handling of the text data and let's show you what it looks like on new on what the output looks like so we've got latitude and longitude we've got these others these other levels that have been replaced for the un the infrequently used levels in neighborhood and then these things are look the same as what they did before and then we have all these new variables so we about 30 of them things like bright cozy nyc room studio village loft so if those are in the name of one of these of one of these airbnb um listings they get a one or two this head home twice and if not it gets a zero so these are kind of like indicator variables now is basically what i've created like zeros and ones here so that is the data that we're going to use as the input so what i for a model i'm just going to start really sort of basic with a bagged tree model i like this for getting just getting something started so i'm going to do a bag a tree let me load this so i've got it bag tree i'm going to bump up min and just to you know get something going i am going the i'm doing a classification tree here and i am going to um [Music] you know just i'm not even going to go very high here i'm going to do 25 and set mode regression because i'm predicting that price i'm predicting it on a log scale and let me make a workflow here i'm going to call it bag workflow so i say workflow and then i will say add a recipe to it this this nyc recipe nope recipe and then i'm going to add my model which is my bag tree here and then i will i'm going to set a seed because i'm it's doing um you know bootstrap aggregation for me and i am going to say bag fit i'm going to fit the workflow to my training data so i'm just going to fit it one time this bootstrap aggravated tree one time so this kind of model has a lot in common with a random forest a lot in common with extremist models is based on trees but it is um oops it did not work um oh right oh gosh so there's zeros in there so i'm just gonna add a dollar to all the prices um so let me try that again [Music] um so this is so since i'm dealing in dollars adding one isn't that unreasonable of a thing to do um uh adding one when you transform something on a log sale with the zeros is not something that the statisticians in your life are going to be very happy about but i think it's a reasonable choice in this case because i'm dealing with dollars okay so the great thing about a bag tree model is you get variable important scores for free because you say like how many times was it used you know like uh how important is it and all the splits so room type super most important like is it the whole apartment is it just a room um and then we've got this like these things are um uh where it is latitude longitude neighborhood then we have um uh we have information about that came from the name um room then private so this is probably you know these are probably things that drive the price down private room minimum nights availability um number of review studio so is it a studio apartment i don't know that probably drives it down i'm not entirely sure but um this is the information we have on um the very like what is most important in the scores so we fit this now one time so how do i know how good this model is and if you know i was if i was competing on sliced how would i know um how well i was going to do relative to other people and if i was you know trying to put this model into like to evaluate it in the real world how would i know how good of a model this is the way that i know that is by using those re-samples that i made um fitting it to all the re-samples where i have analysis sets and assessment sets and i fit them to the analysis sets i evaluate on the assessment sets and i see how they did so i'm going to fit the that same model i just fit one time to all these re-samples these folds that i made let's call this bag results like this and then i will collect metrics on them bag results like this so let's run this talk about what's happening here so i set up to use a parallel back end on my computer so what's happening is for these folds that i made i'm so five times i'm going to fit the model on these observations and evaluate on these fit the model on these observations evaluate on these and go through and do that five times and then i can have five estimates of performance for my model so it will you know you saw how long it took to do it one time so it will take five times that long to go through and do that this is a model that goes um that is pretty quick this is a pretty big data set um so it will go through and do that and then after that what we will do is we will look i just used the default metrics notice here we go okay so i just use the default metrics so what this rmse here is it's root mean squared error so it's root mean squared error on the log scale so now this gives me a pretty good idea of how i would do relative to other people and certainly relative to um uh you know like if i if i changed the model i could understand if it got better or worse the um the thing is that in um in the um uh in the in the um in on slice last night they weren't evaluating on root mean squared error on the log of the price they were actually evaluating on root mean squared log error which is different so let's walk through um uh let's walk let's get some test data first where we have predictions then let's walk through how to make a custom metric so i'm going to use augment on that bag fit remember that's i fit it one time to all the training data and so i'm going to augment onto that testing data i meant let's call that a test result so let's do that one time so remember i'm still on the log scale here i'm on the log scale so i'm i'm making a prediction test rs like this so i have um the price here and the prediction here and those are both on the log scale so i can in fact make a graph here and show this to you so if i put price pred um let's see let's make gm point there's still kind of a lot so i'll set the alpha lower let's put a uh let's say put the slope equals one line here color let's make it gray let's make it transparent so let's look at this so oh that didn't work at all there's no points uh what why didn't this not work oh oh i need a plus okay now there's points on the line um so so they're close little you know like that's that's how well my my model is doing just this first model so it's an r squared of 0.603 you know so uh you know so that's that's how well i'm doing based on these um these uh uh predictors that i have if but notice these are still on the log scale the log of price so i can do x i can take oops um i will take put take these up to the um e to the these values and then let's just so we can see this i think this will be interesting let's put the color there and um now now if i hit if i plot this everything's going to be squished down again because i'm taking it off of the log scale so let's do scale x log 10 scale y log 10 and labs color equals null like this okay so now i put it back on the price i put it back on real price here so i can do um i can put those on dollar i can put that here just to emphasize and um so x equals so this is the true price and y is the predicted price like this and you can see like we've got more of that green color for manhattan up at the higher prices you know other boroughs down here so this is looking pretty good so and this shape here is um showing us you know we've got some heteroskedastic heteroskedasticity like where we have like uh more spread at the higher prices which is really common in my experience when you train a model like this where you're trying to predict a price or something that's like log log normal um but okay so if i want to know how how am i doing especially on the a specific metric that say sliced was going to use um is this it this is not it this is close to it but this is actually not it so if i want to make a metric i can do that in tidy models so we have a article here on tidymodels.org about how to develop your own metric but i'm going to walk through how to do that here so um we do need the arlang package loaded in order to be able to do that be able to get this through so let's um the first thing we need to do is we need to make a vector a vector representation of the metric so we instead of rmse we have r m s l e and then we have rms le vec like this and it's here let me let me do this because it's i i don't make mistakes if i do this there we go okay r m s l e vec like this and the um let's look at the rmse vec function just so we can um copy some of this and get it right so it will have the same inputs here it will have this metric set here will be the same except instead of rms e it will have the l so we're going to make we're going to make this this will be this will be for us that we are going to make right here it's going to be a function of truth and estimate um and instead of if if you go up here here and click through the the thing that's different is and instead of um truth here it is let me make sure i can get this right okay it's the log of truth plus one and an estimate instead of here we have the log of estimate plus one um and so the reason why it's not the same is because of how these squares and meats means and you know square roots and everything so the reason why rmse on the log scale isn't the same as this is how how because of how this math works out okay do we change everything r s l e truth estimate truth estimate so this is a metric vect template we can make this look a little bit nicer and we need the dots here i believe okay so we've got now our vector implementation so this we if we were to put a vector here and a vector here we would get out um a result like one number so now what we need to do is we need to make a generic rmsle let's let's make a function again rm sle so this is uh we pass data in here and let's do use method for rmsle so we're making a generic and um msle we're going to say new numeric metric because it is a it's a it's something that works on numbers not on classes m-s-l-e it's a function of this function that we made up there and the direction here it we want it to be small smaller min mice smaller is better for this um it's not like accuracy where bigger is better it is a metric where smaller is better and then we just need to make one more thing and it's going to look very similar to rms e dot data frame whoops did i do that wrong uh rmse well i can just write it r so what i'm going to do is i'm going to make one more function function where i make the method so r m s e and so it has this all the same stuff so i'm making a method now for a data frame and it's just metric summarizer metric summarizer like this well i don't need that i don't know need to know what's in it i just need to know what you recall here so this is the stuff that i need i need this stuff i don't need the options because this is just normal like so and so my metric name is r m s l e the metric function is r m s l e underscore vec i just made it data data truth so here i think to have no we can just we can just look at the um article to see how we we need to use our lang here so that we can pass these as names so you do bang bang and quo and we see some of the this doesn't apply because it's not a class metric this one we're not going to work we're not applying this to um this doesn't apply either and we can we want to pass whatever however people pass this in here we'll pass this as well okay so i think this is it let's see if i did it right okay so i now i made a new metric so if i wanted i could make a metric set with our msle as well as um like a mean average percentage error and so like it's it's i can now put this in a metric set if i want however this is a metric that should not be applied to something that's already on a log scale like that doesn't make a ton of sense so what we want to do is we want to take these this this test data that i made and i am going to remember it has test um if i look at um price and the predicted price these are on these are on the log scale right here so what i want to do is i want to change them so if i say across c here and i want to change them so they're now on the um uh on the um back on the real price scale like this so now now i'm back on the real price scale so this one i bet is a single room you know this is a whole house maybe etc etc so they're back on a price scale and now i could do something like mean average percent error price print so this is telling me um uh you know this this isn't this is a um this is a uh a metric that is built into yardstick um it's in relative units so it's relative to the price how what's the at the absolute percent error mean absolute percent error that how wrong i am and now my moment of truth mean our root mean squared log error let's see moment of truth did it work no all right it's almost it's almost working let's see what i did wrong here um that's right okay so let's space this out a little bit squirt mean log uh mean there there okay that looks right to me function truth okay that looks right let's look and see what else i may have missed rms le data frame truth data data data there we go okay let's try this again let's try this again okay okay let's try one more time moment of truth this time okay 0.431 um and that's so that's actually the the actual metric that was being um uh evaluated on um on sliced and this is you know these are very close together and um like how exactly you know close are they we could say we can actually do this and see like how close or different are they for this particular example so we can say rmse price pred so this is before the transformation and after and so we can see how how it is different right um because the the math does is very close right because you're like pushing things before and after the um on and off the log scale that it depends on like how many high really high um price things you have and how like how the distributions are how close exactly they are um so if you look at slice this is actually um pretty competitive with how things are going and so as a reminder um i i'm not using you know xg boost this is a this is a very straightforward simple model um you know tree based model that's just like bootstrap aggregated um uh nothing super fancy is happening in um uh feature engineering here probably the best thing that's happening is that um i'm using some of the text information and demonstrating here how to combine rectangular information numeric categorical together with unstructured text information using feature engineering all right so i got the custom metric to work there um uh so that was the metric that was being used to evaluate um prices on the original scale on the like the scale of dollars uh on the on um slice this week um and you know it's a great example of how you can uh if you're in a situation where you have something custom or specific that you want to use to evaluate your own models you can use i do want to highlight that i transformed price to the log scale for my modeling because in this case with you know that kind of distribution i i in almost all situations i've ever been in the model will perform better if it is um trained with it on on the log scale so um and then if you want to make predictions you go back to the you know to the to the original scale and in this case the it was being evaluated on that scale with the metric that involved a log so kind of back and forth back and forth a lot there um this particular model um was not was you know not what would be considered a super high performing model typically but it actually did pretty good compared to what most people got um on the um during the competition and i think it's because of the way that um i was able to like incorporate the text information that's often a really powerful tool and um in tiny models the support that we have for that is quite good so i hope that this was helpful and i will see you next time
Info
Channel: Julia Silge
Views: 2,348
Rating: 4.939394 out of 5
Keywords:
Id: VwZKK6kldqo
Channel Id: undefined
Length: 40min 54sec (2454 seconds)
Published: Wed Jun 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.