Fit and predict with logistic regression for bird bath observations in Australia

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silgi and i'm a data scientist and software engineer at rstudio and today in this screencast we're going to use this week's tidy tuesday data set on bird baths in australia and we're gonna um train a model to predict whether we see a bird um at a bird bath a bird of a certain kind um and uh at a bird bath and we're gonna control for whether the bird bath is urban or rural this screencast is going to be a good um a good one if you are you know more getting started with tidy models if you are maybe even getting started with modeling in general because we're going to focus on some kind of core skills how do you um how do you combine a model specification like you've chosen kind of some kind of model to use together with a feature engineering recipe as we call it like a like a um a set of pre-processing instructions and maybe how can we build up the data pre-processing recipe to do more if we want that we want to do then we'll we'll fit we'll predict we'll show how to you know evaluate the model um and how to you know predict on new data so let's let's learn about birds okay let's learn about bird baths in australia so this is this week's tidy tuesday data set and it is um pretty fun i think pretty interesting so let's take a look at it we have got um quite a bit of data here and if we see what um what variables we have this this survey this was like citizen science data on what kind of um birds were observed in these different at these different bird baths um india was taken in a couple different years in urban and rural situations in different regions in australia and then there's lots of different bird types that are that were observed here so let's look at a couple things so how many how many bird baths that are urban and rural do we have okay so these n a's i think are um a summary rows yes they are and we actually this might be helpful we can use it to find the most important birds because there are a ton of different kinds of birds and for this modeling that we're going to do i don't think we're going to um be able to handle all the birds we're just going to be able to look at some of them so let's look at um [Music] uh let's do a range minus bird count like this so these are the were the birds they're the most of in this data set and these na rose if you go back and look at the excel file these are like summary rows here so if we take um [Music] by sli uh bird count we take the top say we take a slice of the top let's try like 15 rows these are the um the birds that are in this data set the most and if we pull out the bird type then we just have a vector of these birds so let's call this top birds these are the top australian birds that we can use here so now that we have this let's um let's do just a little bit of some exploratory data analysis so now let's do the opposite let's we don't want we want the not this out those summary rows from the excel file that this data was created from um [Music] oh yeah let's also do bird type type in oh bird baths bird baths bird type in our bur top birds um vector that we have and then let's do something uh let's see let's i'm interested in differences by the urban and rural situations the bird baths that were in cities versus more rural areas and then by these different bird types here and then let's summarize bird count you know what before we do that do we count them up i don't think so because i think the way this is uh so notice here we've got lots of zeros lots and lots of zeros lots of zeros and then there's a one when they see a bird i think the bird is on the count is only ever um one let's look at this so if we pull out bird count and then we do summary yeah it's only either zero or one and it's mostly zeros mostly most of the time we don't see any birds or other of those specific kinds of birds so if we um say mean bird count if we did something like this this would what this would say um you know what i'm going to do groups drop like that so um this is saying in rural over the you know over the bird baths where the this citizen science um survey was done in rural bird baths uh the you know probability uh they're the yeah i think it's that probability that people saw an australian magpie was about 25 percent of over all the um over all the uh times places and locations and bird baths that they looked it's much lower for the crested pigeon um and we've got you know a little bit higher for the eastern spine bell so we've got zero if there was no bird there and it looks like there was a one if there was a bird there so let's call this bird parsed and then let's um let's make a little plot with this so let's say bird parsed so let's put um bird count here which remember this is like a mean like a proportion on the x-axis for type on the y-axis and then let's put points here um let's make them big let's make them let's use the aesthetic of color to apply to that urban rural like so let's do this so this is the beginning of a plot that we have here and let's kind of iterate on this a little bit so let's that on that um [Music] x uh x axis this just so we remember like this is a percent here and we can do things to the labels like the x this is the probability that people saw a bird and we can take off these other ones like that okay and notice that um you know these things switch back and forth but like the superb fairy ren which is that is a lovely name i am in love with that bird's name it was much more likely to be seen in rural and urban areas whereas the noisy miner again that's an amazing name it was much more likely to be seen in urban than rural we can kind of emphasize that a little bit by putting some some segments between it i think this will be the best way to do it so let's um i'm going to make these a little see-through i'm going to make these um gray and make these a little bit thick but i'm going to use different data here uh to to like draw a line from one line to the other so the data that's being passed into ggplot right now looks like this but the data to draw a line from one point to the other will actually need to have different um ends different x like x and x end so i'm going to use um pivot whiter um so i'll say names from urban rural and values from bird count so this is this is the data that we'll need to go here so let's say data equals this like so and then we will need a different aesthetic for this geom segment so x will equal i think it doesn't matter what order it goes in we could say rural x and end equals urban and then the y these are the same y equals bird type y and equals bird type like that so we're saying is draw a line from this point to this point from this point to this point and so on so let's see if this works it does but it would help if i put the rest of my plot on okay all right i like this plot this is kind of like a lollipop plot i think people would call it like that and you can see that the colors by flipping back and forth they tell us is this a bird that you see more in cities or more in rural areas so you know lewin's honey eater and gray fantail in eastern spine bill these are all these are all birds that we see more rural areas the rainbow lorikeet and the noisy miner and the pied kurawong are birds we see more in urban areas so this is the kind of information that we're going to use in our modeling let's um let's create let's call let's create a little data frame we're going to use for modeling so it's basically going to be this like so let's see what we have if we do that then i am going to um [Music] i'm going to change that bird count data i'm going to say if else if bird count is equal to 0 is greater than zero i'm going to say bird otherwise it's no bird and then i'm going to change these things from characters to factors in modeling we often really do want factors not character things all right so this is going to be our data set that we're going to use for modeling and so our idea is we're going to model bird count were there birds or no birds um based on some of these other characteristics actually well let's just we're going to really dial it down we're only going to use the stuff we in that plot we just saw urban versus rural and bird type and see what we can learn there so like i said in the intro this is a good video if you're just getting started with tidy models and we're going to really step through things from the beginning so the first thing that we when we deal with modeling is we need to think about spending our data budget so we have quite a bit of data here so the first thing we're going to do is we're going to split our data into testing and training so by using the function initial split i create what is called um a split object so it has an analysis section that's like training an assessment section that's like testing and then this total is how much data is there altogether here if i call the training function on the bird split that gets out the training part let's assign that to bird train if i um call the testing function on the split um that gets me the testing data so let's run that and we can see what it looks like bird test notice it looks a lot like this data we started with but there's less of it because we have um we have split it into three quarters of the data going into training and one quarter of the data going into testing so that is the first bit of spending our data budget the next part of setting our data budget is we're going to create some resampling folds this is big data so i'm going to do um stratified resampling i'm going to use v-fold cross validation i'm just going to do the default which is a sensible default in a situation like this tenfold crossfa cross validation let's call this bird folds like so all right so now what this does what the resampling does is it takes the training set and it creates 10 simulated data sets created by cross from cross validation from the training set that we can use to train different models and compare them and see how they're doing and without ever touching our testing data which is a precious precious resource okay so this part here we think about as spending our data budget now it's time for us to talk about the actual setting of our actual model so i am going to use um just a basic logistic regression model here we recently changed the input for a tidy model so that or the interface for tiny amounts of that this is actually all you have to do now for logistic regression um which is kind of nice so we are this is uh equivalent to a a glm binomial family equals binomial logistic regression model and now we can start thinking about what do we want to do for um what do i do for feature engineering here so i'm going to use a recipe to set up the feature engineering for this i'm going to start with something simple and then i'll be able to build it up so i'm i'm going to declare my outcome and my features my predictors urban rural bird type word type and then i'm gonna say um what data am i using here so this recipe right here what it does is it says hey i'm about to do some feature engineering here's what my outcome is here what my two features are and here is what my data looks like so i'm i am setting up my data my my future engineering and my data preprocessing a logistic regression model needs um uh doesn't probably would doesn't want me us to have these factors but instead probably should have our our data in a numeric form so it can do math on it um so we would probably want to um change these to dummy or indicator variables so here so we can change everything that is nominal which was our urban rural and our bird type and then change it to um dummy or indicator variables so let's call this recipe basic like so let's run both of these things so now we have our two pieces that we need for a model are our way to estimate our model and our feature engineering recipe right now neither one of these have been run none of these have been estimated in any way so we could do that manually but instead let's put them together in what's called a workflow so workflow is a convenient way to put together a feature engineering recipe or formula together with a model specification in a in a like a like like to connect them together as if they're lego blocked so that they're easier for you to carry around so if we do this now we have something called workflow basic that has these things together we have a a recipe we have a model and it tells us a little bit about what is here so we could fit this model this workflow one time to our training data but this is where those resamples come in we are going to fit it instead to um those 10 resamples and the reason we're going to do that here is so that we can get a um a better estimate of how this model is going to perform instead of fitting at one time to the training data and then having to immediately go to the testing data for any kind of comparisons if we fit it 10 times to those re-sampled folds we are able to get statistics about how the model is doing and then we can we can compare you know this model's maybe the next one we're going to try let me add one more thing here just a control function uh control resamples so that i can say save save threads pred equals true and this way um whoops i can do a little bit more with um i can do a little bit more with my results now so now what we have here this looks a lot like our resampling splits that we had at the beginning we have um 10 cross-validation re-samples but now we have metrics for each time that we fit this workflow up here to this data so the data the the each the model was fit to this data and then evaluated on this data fit to this data evaluated on this data and so forth and so on and then these predictions in this column over here you can kind of see it let me move this over so you can see it a little bit better notice that it has the same number of predictions as are in the assessment set so we are making predictions not on the data that was fit we are making predictions on the data that was held out of each of these sets so that was what goes on when we use resampling it's a really good way to be able to get um more accurate estimates of how models are going to perform um okay so let's look at um there's a function called augment where we can um what augment will do is it'll take these um it'll take uh like the assessment sets here and take all the data that we have about it and then add on a prediction so it's going to say you know if we know bird type and piyo regions and survey year and all that information we had that weren't even predictors um let's add for those each of those rows let us um let us make a a prediction so we have the true value here and then the predicted probability of seeing a bird predict probably of not seeing a bird and predicted class given a default cut off so augment is really useful you know you can just imagine all the different kinds of analysis you could do from here a sort of basic one would be to make an roc curve here so we say the true the column with the true value column with the predicted value here and we can you know you can compute that or you can then make a plot and we can look at what we get there wow that not impressive right you know the data that we're working with here is um i'm trying to predict you're gonna see a bird or not just based on two variables right so you know maybe we don't expect super results but um just notice how this is barely better than guessing right we're very close to this dash middle line here it is not doing super great um but we can we can see this you know we got to the result pretty quickly by um you know looking at our results here by looking at our results so that is at least good to be able to get to that information easily i think it's time for us to try to do at least a little bit better even with this data that we had if you remember this this plot that we made is it the last one yeah we know it's not linear right like it's not like oh yeah let's make each let's make it linear with rural and urban and it's not linear and independent right like we know that there are interactions between the birds and whether it's rural and urban right for some birds we see more in rural areas and some birds we see urban so we at the very very least need to add some interaction terms um this is uh you know just just we're just kind of walking through some of the basic things that we can do here with recipes to give you a starting place so that we can learn from here and keep moving forward so we can take our basic recipe and we can continue to add steps so we can add interactions here the way that we add interactions between these things is that we so if we have um this this uh this basic recipe that we made if we were to prep it here we made dummy variables um from urban rural and bird type what prep means prep is the analogy for a rest for a feature engineering recipe to fit as a model so prep is to a recipe as fit is to a model where you want to estimate for this feature engineering recipe what are the um what are the quantities that we need to be able to apply this to new data so you know before we said oh all nominal predictors we don't really know what's in there and now we do after we call prep we know that it's we know what's in what's in this data so if we want to make um an inter we wanted to say hey um i would like to include interaction terms in my feature engineering recipe the way that we do that is that we we do it with with a anonymous function kind of situation here and we can do it with the names of variables in this case we're creating new variables like if i were to keep going here and use not just fitting um or prepping but also applying to new data which is like predict but in a recipe it is um bake you'd see we have all these new um all these new variables urban rural urban bird type noisy minor bird type red brown finch bird type satin bower bird and these are all zeros and ones because these are dummier indicator variables so if i want to make interactions set up interactions between all these i'll say starts with urban rural interacting with starts with bird type here so the the way the reason why this is nice is because i i can just start with my initial variables i don't need to know you know what happens and let's say i do you know other stuff in between where i filter stuff out or change levels like more complex factor handling i can always i can just say like okay well what do you have right now what do you have right now let me deal with that so this is a really fluent way to be able to make a recipe that works well so let's call this recipe interact like this and then we are going to make a workflow interact and workflow and let's put the recipe interact and that same glm spec and i'm going to do the same thing uh let's make our results fit for samples so i'm going to fit to their samples this inner this workflow to the folds and i'm going to say control equals control preds again okay so i'm doing the same thing i made i added a step to my r my feature engineering recipe instead of just having those things all independent of each other i now have added interactions um i made a new workflow that has the same model but a new same model new feature engineering recipe and i'm fitting those to my resamples notice i'm not just fitting to my training data i'm fitting to my resample so i can have a better estimate of what um uh how this model will do and that you know looks way better look how much better that looks we you know we can measure this with um roc we can measure this numerically with area under the curve of the roc curve but we can see this visually just in a really big way right so this model i think is much better so we could keep going using feature engineering adding other variables to make this better but um i don't want to keep going forever so let's just say um we just wanted to use those two those two um uh features for making our model and we so this was good enough for us let's say we wanted to keep going here now so what we would what we could do now is we could fit instead of fit resamples we could fit now our um [Music] our workflow with interaction terms to the training data and let's call this bird fit like this so we have now fit our workflow to our training data and now we can do lots of things with this fit this fit model this fit workflow fitted workflow it has a um has a it has a now a fitted preprocessor an estimated or prepped preprocessor it has a fitted um model and now we can do all kinds of things um we can [Music] you know we can predict with this bird fit bird fit on the test data bird test like that um no bird no bird no bird this this model might not be super well calibrated it may be predicting all no bird um because the the probability of seeing a bird is pretty low um we can augment like we did before let's actually make some new bird data um actually yeah so let's instead of model evaluating the model on the test data let's evaluate the model on new data so let's make a typical let's see bird type bird type let's say those top birds and then let's use crossing from tidy r let's say urban rural and let's put those levels that we had in the real data urban rural like that so this new bird data um this has the real those birds we're interested in that we put in the model and then urban rural so these are these are our inputs right um so let's here let's um we can do a couple things we can say augment bird bird fit with new bird data like this so it's saying if you have these things what do you predict what's the probability of seeing a bird we can also um so augment is kind of like a nice basic give me the predicted class give me the probabilities you also can use predict bird fit new bird data and you can say you can have a little more control here and say type conf say we want confidence intervals so we don't have this you know this data anymore but we do have um for here for uh 95 confidence intervals we have what's the lower confidence interval for that we see a bird the upper confidence interval that we see a bird so let's put all this together so let's um let's augment so augment is a great thing to know how to use predict of course a great thing to know how to use you can look up in the documentation all the different kinds of predictions you can do and the great thing is they have all consistent um in like inputs and outputs let's call this bird threads like this um and let us now make a nice plot of this so let's put um [Music] the prediction on the x-axis the bird on the y-axis make color urban rural like so and then let's um put points on and then this will be very similar to this plot that we made up here let's paste it copy paste everyone's best friend so now this is now a predicted probability of seeing a bird all right um but not only do we have points we have error we have um uh confidence intervals so let's do it like this a yes x min equals pred lower bird x x max pred upper bird and then let's make these look pretty nice by my opinion from my opinionated stance okay what did i do point to is that pretty good yeah that's pretty good yeah let's make them not let's make them a little wider okay that's pretty good okay so these are now from our model the predicted probability of seeing a bird of these kinds in these places urban and rural places so we used our feature engineering our modeling to be able to do this let's we can let's actually throw this all together let's call this p1 and let's call this one p2 and then let's actually i am a huge fan of the patchwork um package and we can throw these together and see what we have here okay um this is pretty nice so let's look at like a couple of these so the superb fairy wren on the on the left-hand side over here we see what we observed how how often did we you know at the all these different bird baths did we see um the superb fairy wren um in urban areas in rural areas and how does our in our model over here what is the oh shoot i predicted there we go how often does the model predict that we will see it and what are the um confidence um intervals uh upper and lower confidence intervals here and we see that we've got some pretty um you know we successfully modeled our data which is always um a good a good feeling we're glad to be able to see this and you know you we can notice some things like the satin bauer bird you know for whatever reason we don't see a big difference between urban and rural areas um the australian magpie we don't see a difference in things like um the uh rainbow lorikeet and the noisy miner are urban more urban birds and the superb fairy ren and these you know these three down here are more rural birds all right so during um exploratory data analysis we saw that there was at those those interactions between bird type and urban and rural status we saw different kinds of birds in urban spots than rural spots and then we were able to incorporate that information into our our feature engineering recipe our pre-processing recipe it really helped our model you know no surprise right like it really helped the kind of model results that we saw and then we were able to um fit and our our data or so fit our model to our data and then predict on new data either like the testing data or new data that we created from scratch and we were able to you know get out those um those probabilities i've seen different kinds of birds in different kinds of spots so i hope that this was helpful and i'll see you next time
Info
Channel: Julia Silge
Views: 2,893
Rating: 4.9703703 out of 5
Keywords:
Id: NXot3Q0QtGk
Channel Id: undefined
Length: 36min 40sec (2200 seconds)
Published: Wed Sep 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.