Robust estimation with tidymodels bootstrap resampling

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is Julia Sookie and I am a data scientist and software engineer at our studio and in this video today we are going to use this week's tidy Tuesday data about beer production and we are going to use we're going to show how to use resampling to to make robust estimations of quantities that were interested in this other specifically in this data set there's information on the materials used to brew beer and we are going to look at the relationship between how much malt beer producers use and how much sugar beer producers use and we're going to look at the relationship between that and use bootstrap are you sampling to estimate that in a robust way so this is a great video if you are want to understand better about what bootstrap resampling is and how it can be used less than a predictive context and more in inferential kind of kind of context so let's get started so we have this data that is about your production let's call this some brewing materials raw as we get started and let's look at what we have here so brewing materials raw so we are in this video we are going to use this data about the materials that are used to brew beer and we are going to instead of doing like a predictive model we are going to instead use this data to understand something about the relationship of the stuff in here so this is this gonna be a little bit of a different video it's a great video for getting started with understanding resampling methods and resampling methods are important all across statistics and machine learning so I think this will be interesting to look at so let's look at let's look at the type that we have there so there are there an even number of roads of all these things and we have things from hops to barley and rice but let's see let's see for each of these months here this column is measured in barrels barrels of whatever these things are let's see what what do we have in here what is use the most so these are totals and then the thing that is them use the most is malt so malt is used in I think almost all beers I mean not in gluten free beer but and basically I mean maybe there are some beers that are not made with malt but I'm gonna say that almost all you know is very common obviously it's one of the biggest components here of the the brewing materials and then the next highest one is sugar and syrup so sugar and syrups is also used a lot in making beer so let what we want to do here is we want to use this data set and let's answer the question how much sugar do beer producers need per barrel of malt so how much how many how much of a barrel of sugar do beer producers need to for every barrel of malt that they use and this is gonna be you know four in this whole dataset which is a which is over many months and years like how much given that we you know they like I'm sure every rear producer has this you like their own recipe and gets uses different amounts of these different things right like hops and everything but we can use the data about the usage the brewing materials use to understand overall an estimate of how much sugar beer producers use per a barrel of malt so let us so we've got this brewing materials raw here and let's start getting this ready so let's um first let's let's first just take a look at so we have let's put on the x-axis let's put let's let's we've got year and month separated out here so let's make a date column and I'm gonna do with this in two steps first I'm going to paste together year with a dash and month and then - oh one like that and then that will give me a new column called date right here which is a character and then let me in the next step let me say date equals date year month day date so now this should be a now it's a date here so now so now it's like the first of the month for each of these things that I have here and let us put date on the x-axis and let's put the number of barrels which is this month current value on the y-axis and let's let's let's try some points here like so so let us do a little bit of filtering here so let's filter so that the type let's look at a couple of things here let's look at let's look at the malt and malts products let's look at the sugar and sugar syrups and let's I don't know that whoops let's look at hops which one so dry hops are used more than the extract here and then let's make color equals type like that and so this should give us a okay okay nice alright so this is helpful this is helpful so we see the malt and Moll products going up and down over time over the course of a year I guess they make a lot in the summer and less in the winter summer winter summer winter we see the sugars the sugar is going up and down the same amount oh I might be wrong I about malt being used in like all beer because it looks like it's kind of trending down a bit over time people must be using other things besides malt sugar doesn't look like it's trending down so much but we'll be able to get an estimate here and there's something weird that's going on past 2016 it looks like the data is being counted in a different way and or something different is going on there so let's say year is less than 2016 and let's take a look there and then the other really weird thing that's happening is on in these December months the like hops I think I think an error has been made and some of these is sent December months like hops cannot be that you can't use that much hops to make beer and it doesn't really make sense that malt and sugar are that low so I believe those months to be in error I could take out I could take out let me let's let's do this a let's say month is not equal to 12 that would take out all the December's months and what year is this year is not equal to 2,000 not let's see a let's say in let's say and not in here and let's say 2014 at 20 2014 to 2015 like so and let's see if that get did I do that correctly yeah okay so we got rid of those points that pretty significantly to me look like they were an error we did I heard of all those December's I think I did months comma month and not and here does this do it no gosh okay no it's like this well okay so in the interest of time let's just take out not months and not and not in let's say let's do it like this and year in 2014 to 2008 I think we had 2008 8 2 2014 like so [Music] whoops-a-daisy so now I go back to the 2018 and I think I needed the double LANs there and year in 2018 2014 oh gosh alright rather than boring everyone with me hassling with that which of course is something that can be solved we're just going to take out the December's here and keep going on all right so these are the so we've got hops down there the sugar which we do see like seems to go up and down but not as dramatically as the malt like so so that's a little bit of a exploratory graph that we have here um next let so let's let's see so let's take this so this is going to be our brewing filtered like this and we will go like so great and let's take this filter data set and let's reshape it so let's take the date column that we made the type and the month current here so this is and we actually don't yeah okay this is fine so the the date type and month current so this is barrels of that material that is used like so and let us now we're going to reshape this so that it is pivot whiter so we're gonna say names names from so the names come from the type column and the values come from the month current column month current like this and so now we have for every month what is the malt the sugar and the hops like so and we can let's let's clean up those names clean names like that alright great so let's call this brewing materials brewing materials like so alright um and let's um okay so brewing materials let's see what for example what is the relationship between the malt and malt products and the sugar and syrups and let's um let's put some points on there like so and see what that looks like and we see a we see you know kind of a kind of you know that's that's tip-top word but it is certainly not a tight correlation but it does we do see that like on months when brew breweries or beer producers are using more malt they're using more sugar and when they're using less malt they're using less sugar so let's put a let's put a let's put a line on there let's just put a straight line because I that that doesn't look like um that doesn't look like I would want to you know learn anything be on a straight line from that probably even if that so we've got this here where you know and this this is a this is a fit line through this and we can you know and this is representative of a model that I'm we can go through and fit here in a minute but the this kind of model that we have here is is based on a lot of assumptions and a lot of statistical assumptions that may not actually hold when we in this real data that we have and so what we're gonna do is we're going to show how to use how to use resampling a resampling approach to get a better estimate so before we do that let's um let's let's just fit a simple a simple straightforward model using like using our modeling fundamentals in our so let's say we want to fit a linear model just using ordinary least squares and let's say that the sugar the sugar and syrups how much sugar and syrups do we use for every barrel of malt I'm going to put a 0 here in the for the slope meaning thing I don't want to fit a slope because I I'm gonna tell I'm gonna like a fitness model assuming that with the assumption that when a when a beer producer in some month if there was no beer if there was no like we don't need sugar if there's no malt like if there's no if there's no malt being made being used to make beer then we are not going to need any sugar because the malt is the big the big the bulk of is what most of the beer is made of so we would put data is our brewing materials here and let's call this beer fit like so and let's fit this so we can do summary beer fit and get see what our results are so here's here's our estimate and and some standard errors on the estimate for so what this is is how much of a barrel of [Music] sugar or and syrups does a beer producer need for every barrel of malt that they have so that's what we're that's what we're estimating here us and we can also load broom I'll just rode tiny models here because I'm going to use our sample as well so if we load tiny models and we can we can also tidy that fit that beer fit beer the and get this information out in a nice data frame that I can I can keep and save and get at these results here that I want to so this would be a this would be the BA if we wanted to fit it one time to the whole data set using the fairly you know strong assumptions that go into ordinary least-squares but that's not what we're gonna do that's not we're gonna do here so in this video instead we're gonna show how can I get and like how can I get this estimate and how can I get some kind of confidence interval on that estimate instead of just fitting one time but instead using bootstrap resampling so we're gonna use a function called from our sample called bootstraps so a bootstrap free sample is we have all of these all this data about beer production and we we a bootstrap free sample is when we take that dataset and I draw randomly from that data set with replacement so that the the the bootstrap free sample is the same size as the original the original data set but it has duplicates in it and and so that it's like a new creative newly created data set so I do it I don't do it one time I do it many many many times and we can use these to understand the characteristics of our relationships better so we're going to use the bootstrap the bootstrap tree function from our sample and we're gonna put brewing materials in and we're gonna make a lot we're gonna make a lot of them so we are gonna say let's let's make a thousand of them a thousand resamples and we're going to use a function we're going to use a function here a parent equals true when because the analysis and are the entire data set so we're gonna need to keep this in there for what we're gonna do later so let's do beer boot like so and we'll let's set a seat on this so that this is reproducible and then let's make these boot straps beer boot okay so we've got a thousand rows here I'm in splits we have the information on what is in the what is in the analysis and assessment steps but we're actually not we're not using that so well maybe we are okay let's let's go on and see what we're doing here okay so we made our so this is about our bootstrap resamples and this is the ID here for of what we're doing here so let's keep going so we've got beer boot and now what we want to do is we want to we want to fit a fit a model to all of these bootstrap resamples here so we're going to make a new column you can just keep models in columns in list columns much like meet this list column contains a split object our new list column is just going to contain a model so we're gonna call let's call it model and we're gonna use the per function map and what we're gonna map over so when you use the function map the first thing you say is what are you mapping over and then the second thing is what are you mapping what are you what are you what is the function that you're applying to the thing you're mapping over so let's so the thing we're mapping over is splits and the thing we are the function that we're going to apply is this function right here LM this is this whole shebang right here so let's take it I think I got the right number of everything here so we're gonna fit the same thing here but it's it's instead of data equals the brewing materials we're gonna say data equals dot I think that's what's right so let's see is that right yes that went really fast I think it's not data equals dot or maybe it is let's keep going mate yeah not that whoops not there yeah actually I think that is right so let's so now we're going to in here we're gonna make a new column so there's the model and now we are going to make a new column let's call it coefficients info where we keep the coefficients and we're going to map again but this we're gonna time we're going to map over the model and the thing we're going to do is we're going to tidy those models so that we can get out this for each of the thousand new data sets that we made so let's call this ear models like so and run this so what's happening is instead of fitting the model one time whoops it did not oh yeah so there's the models so instead of fitting the model one time to the actual original data set what we're doing is we are fitting the model a thousand times on these thousand created data sets that are based on the original data set but are but are instead created with this this a replacement drawing like random sampling with replacement and so for each of these a thousand data sets you know we have these these these these estimates for what how much sugar do you need for a barrel of malt and so what this helps us do is this is much more robust to assumptions of what's going on in our data and you by you know creating these simulated data sets that are based on our real data set we can get a more robust estimate of what of a thing that we are interested in all right we allowed me let's save this we got to save this beer co-efficients like so so now we have the coefficients like this and now we can now this is now data like our the output of our models is data so let's let's start to evaluate what we think we have here and some of the great some of the really bad benefits of dealing with in the Tidy models ecosystem even if you're not you know um training some really fancy predictive model but instead doing something more inferential like this is um is being able to treat your you know these data you treat the results of your models as data and be able to handle them in some nice ways so let's start by you know I'm just kind of interested in the estimate in the distribution of those estimates like what what before you know we got one value when we when we trained at one time but what do we get now when we when we use resampling instead and do it a whole bunch of times all right let me um let me get a better look at this okay so this is very helpful so we can see how broad the distribution is get us in so that get a sense of where the central part of it is and so this this set of data that is all of these fits can be used for to help us understand how much so what it like to get a sense of what is an estimate that we can give and then what is a sense of the very of the variability of that quantity that were interested in we can in fact get actually just you know get confidence intervals from this set of data here um the way I'm gonna do that right now it is called you know it's from the our sample package and it's confidence intervals based on bootstraps which is just what we did and so what we do is we give it the we have this function the the models I think yep so this is why we did a parent set to true so we can give it the models so ear models and then the other thing that we have to give it to is where the statistics are and that's here in coefficient info I think this is right let's see yeah okay this worked so what we have to give it is where did that where what has the bootstrap resample so remember beer models has that here it has their samples and then which of the column names has the has the info in it okay so here we go here's our here is our here our bootstrap confidence intervals we have an estimate and a lower and an upper right here and so we can so here this is what we say we say like how how that you know this is the answer right here how much how much of a barrel of sugar does a does do beer producers need for every barrel of malt that they used to make beer and it's like it's like a you know five to one you know they need five barrels of malts for every one barrel of of sugar so and we have and we have the the bootstrapped confidence intervals here on as well so this is really powerful way to get robust estimates of some quantity that you're interested in we also can see we like visualize how different these are by going back to this this models data frame that we have and I don't let's see so we all will do is we'll we'll make a new column as well so the tidy broom has three verbs that are used a lot tidy a glance which says just tell me stuff about the model and then augment which says go back to the original data points and add stuff to them so let's let's use augment here so we're gonna say augmented equals map model augments the nice thing about well I don't know it's nice sometimes not nice sometimes LM the model actually contains you know all the data it was trained on so if you just say if you just augment a model it'll give you the predictions from the model there and so and then we can uh nest here to get all those augmented augmented things a thousand might be too much because we're what we're trying to do here is make a visualization let's um let's do 200 instead so let's call this beer aaaghh like so and let's run this and see what results we get here so so for every so we have the IDs we have the IDS here the IDS are the bootstrap remember that the IDS are the bootstrap free samples and then for every ID we have all the training data that was in that all the points that were used to fit that that in in that bootstrap resample and so what we have here is what are the values what are the values as we go here and here here's the for example the real value and here is the fitted value real fitted real fitted and so on like that and so what we can do with this augmented data set as we can pipe it to ggplot and we're gonna put just like we did a you know at you know when we looked at it here when we looked at this thing we're gonna we're gonna do this but instead of fitting a line instead of fitting a line we are going to put the lines from the from the fitted column so we're going to say let's we're gonna make a new s here where X is the same but Y is equal to fitted instead of the real value and we are going to we want to make a whole bunch of lines that connect together so we have to tell ggplot which lines to connect together and you do that with the group argument and group is ID here grew so group is the bootstrap resample and let's uh this is gonna be a lot of lines so let's say let's make these quite transparent and let's I don't know let's give you some kind of nice color and a turquoise II kind of mood right now so let's see what we have here nice okay let me make this bigger so we can look at it and get a nice view of what it is that we're seeing okay so the points are the real points from the that we got from the the beer production dataset and then these these sets of lines are when you do boost draw free sampling when you draw with replacement from the data set and then you fit a line to it what kinds of different lines can you get and this this this set of 200 lines shows you visually the kind of where the lines can be how different they can be in slope and how how much kind of variation we get there so this is a way to visualize sweet so we we visualize the distribution of the parameters we've got our confidence interval out and then we we were able to in fact visualize what the fits look like well we did it he use bootstrap resampling to estimate how much sugar beer producers use relative to a malt and this one seems like head with the right amount so that's good and mmm you can use this approach when we have when we want to make a row or get robust estimates of the kind of quantities that we are interested in so I hope this was helpful I hope you enjoy anything you may having and I will see you next time

Info

Channel: Julia Silge

Views: 3,826

Rating: 4.9741936 out of 5

Keywords:

Id: 7LGR1sEUXoI

Channel Id: undefined

Length: 32min 30sec (1950 seconds)

Published: Wed Apr 01 2020