Max Kuhn | parsnip A tidy model interface | RStudio (2019)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you [Music] so yeah I'm here to talk to you about a package called a parsnip that's been quite a long time in the making before I get started uh why is this called parsnip so either this other package called carrot see a ret and I was joking with some people my group that the first like artstudio package I should make when I do when the company should be carried like the vegetable so then you know he would be like I have a problem of carrot that be like what was a carrot was a carrot and then you know okay and then somebody would have been having was like oh you know Parsons a white carrot so we kind of felt like you know that's a good code name for the project and then we just kept calling it parsnip so it's personality so you know Alex stole like one of my slides that's cool so uh but you know I've been kind of like talking about this for like forever like what for me at least is frustrating with ARC or packages for modeling and it's almost always just it's never like the numerix usually it's it's about the the user interface right and the problem that Alex talked about is you know that struggle is real and that you know you start using our package you see our pack you're like oh I really want to try that out and then when you start using it you're like oh no and so you know sometimes same people inadvertently maybe because they don't know anything about our make the package very difficult to use so for example I was looking at one the other day which I tweeted about and got really angry about expected your your predictor data not to come in as a data frame but as a matrix and that by itself is not like a deal breaker but instead of having factors or dummy variables for your qualitative predictors they wanted to convert it to a zero-based integers right so that's the most uh Nora way of doing things and so you know I was like no I'm not gonna do that but you know there's a lot of and we do have some sort of like loose conventions and are about what you know your modeling package should look like but you don't have to follow them and they're not to be honest with you all that specific so as another example you know we have usually have the formula method for and then we have like the nonpolar meth non-formula method where you use like x and y's or arguments and so you can never really know when you start using a new package whether you have either of those or both right so that can be kind of frustrating one really really frustrating thing so sometimes you get a prediction back and I love the Ranger package but the predictions that you get out of the prick method aren't actually data frames they're like a specialized Ranger object which then you have to extract the piece out that you actually want and then I'll talk about glib net a little bit more but that's another good example where like you're fitting if you're getting classification models you could get like for a prediction you can get a vector a matrix or a multi-dimensional array back depending on the data and that's very frustrating the program with so so the point is that you know if you're going from model to model and trying different things you know you could get really frustrated and then here's the the same thing that Alex showed where you know just the variation in type equal is substantial so we want to solve that we want to you know I don't want to have to worry about remembering all this stuff for all the special cases when I go to do modeling and so I tried solving that or I did kind of solve that with carrot previously so carrot is like unified interface to models and that was written in like 2005 I think so but that's like 2005 code and that works pretty well but there's some is definitely not tidy right it's like the most untidy package and so what I wanted to do is sort of reinvent this sort of like model interface knowing all the things that I know now after implementing it for like 250 like models so parsnip is sort of that part of carrot where we're looking at like a unified interface it's really consist with a tiny verse and does some things differently that I've learned to appreciate over time alright so one thing we do in in parsnip is you know it's this unified interface but we decided to organize the models a little bit differently than before so what we do is we say you know what kind of model generally speaking you're trying to fit so are you trying to fit like Keener's neighbor model or in a forest or logistic regression or let's say linear regression right so let's define what the type of model is as opposed to using LM or good net or what have you and so once we have a specification for the model what we can do is we can then generalize how to fit them right so if I say this is on the next slide I think if I say say I'm going to fit a linear regression that usually means like slopes and intercepts right and there are a variety of different ways you can estimate that so in percent what we do is we sort of organize all these models and their interfaces by what you're trying to do as opposed the way you wouldn't try to do it it has a tidy interface so it's really consistent the pipe and all the other tiny model of packages that we have and also and this is a really big deal very much like broom we spend a lot of time defining what we think a predictable interface would be predictable in the sense that like okay if I do predictions on an object do I know what I'm gonna get before I get it in many are packages you don't know that so we spend a lot of time trying to come up with like you know convention or a guideline publicly so we published all the stuff and one feedback as to like what does return values should look like if you're gonna get them so if you want to you can follow I'm not gonna click on that but you can follow this modeling package guideline which you know we started right and some of its specific the teddy bear stuff in other parts of it are specific to just general modding modeling ideas and you can see what we we our decisions were there but you know it's not like completely written in stone so if you do have opinions we'd love to hear them so you can file a get issue and we can discuss them there's a one post about blog or about parsnip on the teddy bears blog we have another one sort of queued up after the conference that's more about the inner workings of parsnip and how it does what it does the way it does it so you know one thing about ggplot2 and recipes is they kind of defer evaluation of things so for example like if you create your ggplot code and then if you don't assign that ggplot to an object and you hit enter what you're doing is you're invoking whether you know it or not like the print function on that object so ggplot actually doesn't do anything really until you explicitly like printed whether you use the command if you save its you like an object and do print that's when all the the stuff happens with Gigi plus there's drawing stuff I'm doing stuff right and the same thing with with recipes is you define a recipe but you don't really do anything until you need to prepare or reuse the recipe right so once we start deferring the evaluation of what we want to do it opens up a lot of doors to make maybe the workflow a little more sensible so the way so let's say let's say you have some data let's say you have I don't know some data on cars and they're maybe they're miles per gallon and maybe there's like I don't know let's just say 32 data points in the data set so you know as an example hypothetical example so we have a like a meta package called tidy models and you can load that and you'll get deep liar and ggplot and parsnip and some other stuff so you know what we could do is let's say we want to we want to fit a penalized regression model like Ridge regression you know if you're used to neural nets this is like a way to K model but we use like l2 penalty for the statisticians in the audience you know in what we'll say is like we want to fit a linear regression and then we have a little bit of details about the the specific specifics of that and let's say we kind of know like what the penalty should be like really you know fairly low value of like 0.01 so what we can do is we can find a specification for that model using this linear underbar reg and say the penalty should be this and if we print it out you know it doesn't really have much detail because we haven't specified much detail and a lot of times the detail comes in terms of like how are we going to fit this regression so at least in parsnip I mean as we speak we could fit that using LM or glim gnat or spark or Stan or Kira's right and that's just once we've implemented so what we did is we kind of decoupled you know the estimation procedure and the package that used to to accomplish that from the actual specification model all right so you'll see for example if you want to use glib net we can pipe in the what we call the computational engine and again the computational engine is sort of a mashing together of the type of estimator like is it least squares or is it using Bayes using like you know Karis with the the model package that we would probably use okay so you know the the engine might be like LM or gloom net or Kira's or Stan and that kind of thing and also one other thing about the engines is they don't have to be in are right I mean with reticulate and all this other stuff you know we know how to like farm an ore has always been really good at this form out the computations to a different language or platform or something like that so again if we're not doing like immediate execution of these things that allows us to set up all the things we need to do to get more general results back so let's say we start with that regression model specification and we say we want to fit glim then you don't really ever need to use this function but I wrote this function called translate and what that does is say well okay you said you wanted to fit this kind of model now that you're saying you want to use it with glim that like how would that actually work and what translate does is it prints out like a a template or a shell of what that code would look like so if you look down here you can see okay we're using the glint function within the package we don't know what x and y are and i should also say that limit only has x and y so if your data has dummy variables or you know you start off with a data frame then you've got to do the work of you know creating your indicator variables converting it to a matrix and all that stuff to get little bit to work so the underlying code would use that XY interface for glim that lambda is the penalty function for that particular function and then since we know we're doing linear regression and automatically sets the family to regression right so this is like the the template of what it's going to use for code when it translates the model specification do the underlying engine code that's going to be used and you know we don't you know especially Clement we don't usually need the the data to make that specification right so up to now I haven't used any data right this could be on empty cars it could be on let's see there's only two datasets there's empty cars in Boston housing right so right so you know at this point I haven't said anything about the data and I'll show you a counter example of that in a minute but once we have our data we can actually fit a model to this specification and so you can use the fit function to you know you know you give it a formula and the data set and it goes out and fit your cologne net model or if you if you do want to use the formula method because that's more convenient for you even though glim net doesn't have that you know parsnip will do the same thing that care and other people do is they say okay we'll do that work to generate all the dummy variables and track all that stuff that we need to and keep those pre processing objects and then it'll fit the actual bloom net model and have everything it needs encapsulated into that portion of object to make predictions and stuff in the future right so you can use fit for the for the formula method and then fit XY if you just want to give it X&Y alright so when we do prediction for example I shouldn't dog lemon too much but I mean it is very frustrating that you get very different data types under different situations so if you're just working with like one data set you might not notice this but if you do any programming with bloom net there's like a ton of like if-then statements like you know if the number of levels my outcome is three then I have to do this versus one I do that and so what we what we have is this you know this idea that you ought to get a very formally formally defined output back when you make predictions and so for regression at least our first approach of that is to kind of follow what the rim did is you're going to get a tipple back and that Tibble is always gonna have the same number of rows as the input data set and I'll show you an example why that matters in a minute but in that table in that table for regression the value or the column name you're always gonna get is gonna be called dot pred right no matter what model you use that you know how the model works whatnot you're always going to get a table for regression with one column of creb so in this case i'm fitting the model to the first 29 rows and predicting the last three so we get three rows back so why would why would I make a big deal out of this well with a lot of like common our prediction functions what they do is they use omit either explicitly or silently and so you know if I have a hundred data points and three of those rows have missing values and I make a prediction I get 97 rows back and then like whoa like now I got to figure out where the missing three rows are I want to merge that into a data set you know I got to do some extra work to get there and so you know we became very very frustrated with that in this idea that you returned the number of rows that you start out with if I induce a missing value in the first data point and fit the same model and do the prediction I always gonna n/a back okay so you can always just like bind columns this to your data frame not have to worry about is it going to match up or not now one other thing about gloom net which is really awesome is glim that has this penalty parameter and I specified it here and that's kind of unusual thing to do with gloom net one really cool aspect of this particular model is it can fit using like one model of fit it can get the entire path through an entire spectrum of lambda values so all the lambda values for that model are kind of encapsulated into that glitter net object so when I get predictions for clim net I could say well give me this lambda or that land or this other one and fit it over and over again but the smarter thing would be did not specify lambda and basically get the glenda object that can predict if any lambda at once okay and that's a really cool aspect of that model the problem with doing that with this package though is it gives you like a bunch of labeled columns and you have to kind of trust that the lambda values the penalty values that you're predicting at correspond to those and so you know I the first time I looked at this like you know it doesn't see anywhere that they're gonna go into increasing order or decreasing order so you know it's a little bit it's a little bit scary to use if you're first start using it so since Glen net can return for the same row multiple predictions at different lambdas what do we do with this like you know I start with an input of three rows and I come out with an input you know an output of three rows how are we going to take care of that so what we do is we basically produce a list column so in this case there's 80 possible values of lambda that it fit and when I make the prediction instead of using the predict method because not all models have this feature we use another function called multi predicts and there's a silent lambda value or argument here for clim that in it by default uses all the lambdas and so what that does is in this particular instance there's any possible lambdas so what you get is you get a table back for each row and that table has two columns in 80 rows and so if I look at the first row of that data set that table and remember the holdout object here had its first value had a missing data point in it and so what it does is it basically gives you a table with eighty missing values for the dot pred and then all the corresponding land is that you would have gotten predictions for if you hadn't had missing data maybe more informative to look at the second one and then you can see if we look at that particular table the first five rows of it you know we have our predicted values across all the lambdas and so you know you might say well like jeez you know I can't reach you plot that so the good news is like you know understand Toddie are we'll just you know simply just make that happen for you so that's like a nice little feature so what we want to do is we want to have these like define standards so for another good example is if you're doing like quantile regression you might want to get predictions on like I don't know like 10 or 15 or a hundred different quantiles so rather than you having a program your way around that now you get back a table with those values in it and you don't have to worry about doing special things for quantile regression it's you know versus going net versus something else one last thing I'll talk about is this idea of data descriptors so you know if you think about like Reina forests which a lot of people have seen reinforce the main tuning parameter something called M try and when random forest goes to build a tree M tries a number of randomly selected predictors it will choose as candidates for that split so if you say m try equal 3 and you have 100 predictors it will choose a random 3 out of that 100 as candidate that our candidate variables to split on and then it gets the next split and chooses another 3 and there's good reason sounds like a weird thing to do but there's a very good reason that they do that and it actually has a big effect upon performance so when I said like we usually don't need the data to specify model that's really not true for a random forest the my is related to the number of columns in the data set so you know how would we we're starting to think about like jeez how would you write a specification that does involve the data but you need something to use the characteristics of the data so we have this thing called data descriptors we have these like you can see them listed here these little functions that can capture different aspects of the data if not the data themselves so if I want to know how many predictors I have we have a little data descriptor function that will calculate that before and after or any dummy variables are created okay so what you can do is you can use these little functions in the model specification so let's say I'm gonna fit a random forest model and I want to use 75% of the predictors whether I have ten predictors or a thousand visitors or whatnot what you can do is you can start the specification for random forests tell it you know let's say you want to tell how many trees to use and then you can say well give me the number of predictors before dummy variables you know find 75% of those and make sure I get an integer by using floor here now when you run this like when you run this model specification of course there's no data here right so what it does it just saves that expression and whether you're fitting this model on the entire training set or you resampling with let's say cross-validation would be about 90 percent of your training said it does this calculation every time all right so if we use translate here what you see is the first thing you see is reinforced substituted instead of trees the thing you had would have had to remember for the reinforced package is entry and then you still see this expression here that has not been evaluated yet okay so when I go to evaluate that using my data you get a value of seven so it runs at that only at that point and substitutes the value in there for the data that you're actually using when you invoke the fit command and you can see you know we have a one column for the number of outcomes and then you can see that we actually get the right value here so there's a ton more to talk about with parsnip but you know I'm gonna get the stink-eye from Davis if I keep talking so you know just a quick look at the things we're thinking about we want to think about how we're we're talking about models right so sometimes how we talk about models is related to the way the data is structured so you know I used to do a lot of work where I had repeated measures on a particular experimental unit so like if you have a database with your customers in them you might have multiple rows over time for your customers or if you're in a clinical trial you might follow patients over time so if you have like a really simple repeated measures experiment where you have a single sort of like clustering effect or random effect like patient or customer you might want to fit some sort of model excuse me you might fit that using a random intercept model or you might use like a hierarchical Bayesian model or you might fit that using like a correlated airball like GE so what we could do is we could have the different engines reflect something about the model and the experiment design for that model and so you can try all these different things and they kind of in some ways estimate models are either identical functionally like identical or extremely similar in in spirit and so try different things so it does have to be like rain of forests models sometimes we can you know confound that with a type of data that we're using okay so thanks a lot appreciate coming [Applause] think we have time for one question I thank you for this you know really really exciting and very useful development that's it's really great to see it I'd really like to use it in production now in November you have 0.01 of parts Nippon unsere on would you recommend that we actually go towards the customer with models built in there or how long do we have to wait that's kind of a tough question is three four years tops five no more than five no I mean so okay yeah this is a good question I mean it's the first version we want to see what people think about things people encounter it's been out for a few months we've had almost no github issues which hopefully means people are using it they're not finding any but the main thing I would say about that is parsnip in our sample and a lot of other things are pieces of a wider puzzle and you know two or three of those things are not exist yet so for example like I have here we want to integrate parsnip with recipes and things like that so we're gonna have I don't know if it's gonna be called a pipeline but we're gonna have like a pipeline object that then you can use fit on that so I wouldn't say that like parsnips aren't ready for primetime I just say that well you can use it but you'd still be lacking a lot of things that you would get otherwise so you know I'm hoping by the end of this time next year let's say that you know if you were to ask that question I'd say yeah we're good so in my brain that's what I anticipate the time ladder would suggest but that's where I thank you [Music]

Info

Channel: RStudio

Views: 3,933

Rating: 5 out of 5

Keywords: Max Kuhn, parsnip, rstudio, data science, machine learning, python, stats, tidyverse, data visualization, data viz, ggplot, technology, coding, connect, server pro, shiny, rmarkdown, package manager, CRAN, interoperability, serious data science, dplyr, forcats, ggplot2, tibble, readr, stringr, tidyr, purrr, github, data wrangling, tidy data, odbc, rayshader, plumber, blogdown, gt, lazy evaluation, tidymodels, statistics, debugging, programming education, rstats

Id: ZFTjroC8bTg

Channel Id: undefined

Length: 22min 38sec (1358 seconds)

Published: Mon Sep 23 2019