Get started with tidymodels using vaccination rate data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Julia silly and I am a data scientist and software engineer at our studio in this video I'm going to show how to use this week's tiny Tuesday data set on MMR vaccination rates and schools to to show how to train models to examine how to look for differences in groups how to do more sassy inferential kind of analysis this is a great video if you're wanting to know how to get started using tiny models and if you want to know how to out how to get out probabilities from classification models and how to answer questions like how can I tell using the data that I have if there are differences and and if differences are something I can be certain about and how to share those kinds of information with others and plots alright let's get started so today we're going in this video we are going to use tidy Tuesday data about MMR vaccination rates and this is gonna be a great video if you are just getting started with tiny models it's gonna be less machine learning oriented and more stats II inference oriented let's get this uhm paste it in and started so this data is about it's a it's school level data about vaccination rates and for every school it has like the name of the school what state it's in information about how big the school is it has an MMR vaccination rate like what proportion of the students I believe the kindergarten students are vaccinated and then an overall vaccination rate and it looks like when the there's no data it's coded as minus one and what we're going to do is we're going to train a classification model showing for the MMR vaccination rates so whether it is above or below the commendeth threshold the recommended threshold it turns out is 95% like is are the are the kindergarten students are 95% of the kindergarten students do they have the MMR vaccine so let's sum and we are gonna do this looking at comparing States by States I'm going to build a model that can say are there differences state by state and which states are different from each other and can we quantify how different they are so let's build a let's build a data frame that will do that for us so we are going to start here from our measles data and we are going to let's filter the ones that we have some MMR data on some it doesn't have -1 there and then we're just going to transmute and just keep we are only going to keep state that state level data a state and then we're gonna create a we're going to create a a a variable called MMR threshold I'm going to use KD pliers case when and say when MMR is greater than 95 we're gonna call that above and when it is all the rest of the time we're gonna say that it is below and this and then since we're gonna be doing modeling let's say mutate if is character factor so that we have our anything in this case which is just one thing but anything that is a character list changed to a factor so that we can do some modeling with it alright so now we have measles DF which is 44,000 rows when we have here okay we've got 20 ish States and we have MMR data for them all right so let's um let's just get a real quick we can we can just do skin measles DF real fast to see what this is like there I mean there's not a ton in here but we can see that you know it's California Illinois New York are the highest looks like about over two-thirds of the data is above one third is below let's since since we are doing slightly more inferential work here let's make a let's make a a plot showing us what in the data what do we see as a relationship between the state and the this this threshold that we here I have here so what is the mean or what is the hat what percentage of schools in each of the states have a have have are above this threshold let's arrange this so we can see it what is at the top whoops no and mama like that so Illinois New York Pennsylvania are at the top they have the highest rates let's see what's at the bottom Arkansas is really low extremely low Washington State North Dakota is that right or are we missing data or something weird let's look at Arkansas real fast let's pull mmm are okay yeah there's just a ton of examples in like the 70s and 80s like a lot of the a lot of the schools are below the threshold so it looks like that's real let's make a quick visualization so we can so we can see this so let's let's put state on the x-axis this this percentage on the y-axis let's make it colorful because that is always nice and let's make a bar chart we don't need to see the legend in this case and that's a proportion so let's it's always nice I think to put the make make it look like whoops scale label did I do it wrong what did I do wrong here so uh if we do this oh not colourful scale why continuous label oh not a double plus there we go okay let's look at that okay looking it's a good start let's um let's put these in there in alphabetical order right now let's put them in let's use fact reorder to put them in order of their have what proportion of the schools have that have how are above the threshold and let's flip it so that we can read some of those levels there okay so Illinois has the highest Illinois New York Pennsylvania Arkansas is incredibly low Washington North Dakota Maine and so forth I live in Utah and we're kind of in the I mean that's shockingly low half but that's in the middle of the distribution ok so that is what we are going to use modeling to understand how certain are we about those about these values and how how certain are we about differences between states so we're gonna use the tiny models framework to build a model that's gonna help us understand this so we are going to this is a classification model we're going to say a model which schools are above and which schools are below that threshold so we're using the logistic regression function it's a function from parsnip it it has mode equals classification we're going to set the set the engine for this model using GLM so just a straight-up classic a general linearize model version of logistic regression and then we are going to fit it we're gonna fit mm our threshold threshold explained by States so this is a very straightforward model and we are the model where the the day the data that we're training on is this measles DF so let's call this G LM fit like so and it's going to fit it and these are the results that we get so we get this is the you know the print method that tells us what it's doing and we can um we can tidy the fit tidy the fit method here to show us these results here and we can see like which p-values are oh gosh sorry p-value is less than 0.05 who would like to look at that threshold and see like which states are different from the the base level would be Arizona which states are different because it's the first in alphabetical order which states are different from Arizona at the p-value at that threshold at that which does the model think are different and so Washington so Arkansas you know it has a big value here like it's really big Washington so it's um it is likely to be different than Arizona Washington is also down here and everything else we see is is it on the other side meaning that it is um has these higher values of being above the threshold being a being above the threshold so that we see these values and so that's good so um you know logistic regression models what even are these numbers it's all the logistic scale and right like right what even is this so one of the great things about this high D models framework is how how consistent predictable it is to get out different kinds of predictions and we can use this to talk about this in a way that is much more transparent especially when it comes to talking to stakeholders so let's say you like wow look at these great estimates what am I gonna what am I going to tell people about these numbers so what we're gonna do is let's let's make some predictions on some new data so for example let's um let's use the crossing function from Ty dr and let's let's make for every state that we have so unique in measles measles DF state and then for yeah yeah so oh no wait we don't need to crossing yes we're just gonna say we're just gonna say Tibble Tibble so we are gonna say for this is our new schools so our new schools are this table of all the schools that we have and we are going to because that's the predictor that we had with state so we are going to now make some predictions so we're gonna predict two things we want to predict the mean and the confidence intervals cuz that's the reason we train these models right is like what help me understand like what's the what's the variability given the data we have how certain are we about the proportions that are above and below the the the threshold in these schools okay so first let's do our mean prediction and we use the predict function on this fit that we have the new data is these new schools and we're going to say type equals probability like this so this gives us for every state the prediction that it is above and below we're less we're just interested in above so for Arizona the prediction the mean prediction that a school in Arizona above is this the mean prediction in Arkansas is this the mean prediction in California remember which is one of the highest ones is this we also can get the confidence intervals so we'll do prediction GLM that same thing new data equals new schools I should have copy and pasted and type equals confidence interval like this so now we'll get the confidence intervals like so so look predictive of lower above and predicted up or above so this is the confidence intervals for that for this class that we have right here and this boy if you've ever tried to get a confidence interval out of a logistic regression model this is so much better so much better than the other ways that that have existed before we am the the development of this tidy models eco system so and we can just bind this all together buying calls mean prediction find calls confidence interval like so and let's call this schools schools a result like so like here here we go schools result so now we have for Arizona the mean prediction here and then the low and the high confidence interval there and we can now make a nice little visualization that is gonna show us this so this is for let's go up it's gonna look really similar to this here we go let's paste here so state so instead of MMR we're gonna use this this predict and let's so let's make that oh yeah not MMR predict above like that like so this is good and now what we're going to do is we are going to add a let's add it on top of the columns let's add a error bars um we're gonna just put them on top of that column and so let's say the Y min is equal to the predict predict lower above and why max is predict lower oh I did a verdict upper I'm sorry upper above like so and I'm probably gonna have to give this a different color yes black let's make this color equals like gray I don't know 70 and I don't know this this is pretty good I guess like so so let's zoom on this and see oh that's too light um all right let's where'd you go is plot there you are okay so what so what we have here now is the proportion and then the uncertainty on the proportion the the of the measure of the variability on the proportion and so we can see California we measure much more precisely than South Dakota do does the model able to tell the difference between California and South Dakota you know we're able to get that from this from this visualization like that's what we're able to see here let's um so that's so that's what we get that's what we get out of this tiny models frame and the this this pretty being and any type of model that you use you are able to use the same kind of ability to get out the the the you know the probabilities or the confidence intervals in fact let's try to do that so this is this is a little this is a little fancy I guess but let's try let's say Oh GLM isn't good enough let's um let's train a Bayesian model so let's uh can I do let's see if I can do this let's see if I is it detect yeah that's what it is I think detect course okay so what we're gonna do is we're gonna take this model and we're gonna paste it down here and instead of GLM fit let's train a Bayesian model let's call it Stan fit and we set the option of Stan everything else stays the same well I guess we should set priors um you usually should set fryers so there's I think there's two things there's prior and prior intercept and let's make so let's make a let's make a prior disk and let's let's we can do like a normal here this is this is we should talk about so Anna Bayesian expert to say what if we should put here as our as our prior I am I do not claim to be a Bayesian expert so I'm gonna put I'm gonna put us the student function here which is from if it was just from the are Stan armed package T it says it says don't it's just it says you need to specify your priors explicitly don't even if you're using the defaults so let's see it let's use student T and I think like the how wide should it be like DF one will be the widest let's try I don't know let's try to it seems good okay so and then let's so then we put this in here prior equals prior just prior intercept equals prior just like this okay so this take a moment to fit let's see how fast this goes so it's setting up so it's using my Coors and it is gonna set up and it's doing our it's doing setting up these this this Bayesian modeling so it's fitting the same model and notice that I specified it very similarly I had just had to change the engine and then I had to tell it some things that are specific to that engine which are the the priors and now it is going to go through and do it's computationally intensive thing Here I am going to pause recording and let it go through because it actually does take a little bit for this to fit because there's all these different states in here that it's going to try to fit so I'm going to pause here and then come back once this model is done fitting all right we're back so this model finished training you can see here we got to the end to 100% and it finished it's um it's training here so we can see what it looks like here so here's a the median it and it tells me what it did you know it's a binomial and the kind of model that it did so um so what we can do now because we use this Heidi and models framework we can use we can use just almost this exact same thing that we did here instead of sand fit or instead of GLM fit we put in stand fit and let's change the name of these things to Bayes prediction and Bayes intervals because I hear I hear they're not confidence intervals anymore they're credible intervals and let's call this the Bayes result like this so now we got our Bayes results which is in the same format as our other results so this is the this is where tiny models really gets you far is when you want to compare different kind of models you want to have really predictable results that you are gonna know what you're gonna get back and so let's make a little visualization here at the result at the end comparing these results so we've got our first result that was let's class a model equal model equals GLM and then let's bind it to the Bayes result they enough babies Bayes result and let's say mutate model equals B stand like that and then let's get let's go back to this visualization and let's um let's paste it here this should all work and instead of a column let's let's go to a point let's go to a point and let's um we're gonna make the color we are gonna make the color the model instead of the state and we are going to anything else so let's take this off so that the the error bars are the same color as well and these are probably gonna need to be big to look nice and let's make this a little wider and a little see-through because that's usually nice and let's see what this looks like here oh that color color color equals model like this okay okay we did it okay so can we do this so that we can see any difference it's different it's difficult to see okay what so I can I can use Dodge to put these next to each other but in interest of time I'm just gonna leave this as is so we have the the states along the y axis here and the so now this week this probability that we got out it is it is the same we can interpret it the same way it's the it's the proportion we've modeled the proportion of schools that are above the threshold and and this is much easier for me to like talk about with shareholders right and it was weak I could I could fluently get it out of my model when I use the tidy models framework like this I'm notice I hope you can see how that you know using the stand model did you get us a ton here you know probably partly because of them priors I use or whatever but like I'm able to check it and see and if I have information that I would like to incorporate into my priors I can do that and then compare quite um using ggplot2 to be able to see what the differences are so that is what we did today we were able to explore and see what these differences are state-by-state in this MMR vaccination rapes in schools well we did it we use this data on vaccination rates in schools to be able to explore and learn that say states like California have high hype report high numbers of schools that are above the threshold schools like Arkansas have low proportions of schools that are above the threshold and we use the tiny models framework to be able to try different kinds of models to be able to get consistent predictable results out so I hope this is helpful and see you next time
Info
Channel: Julia Silge
Views: 7,504
Rating: 4.9448276 out of 5
Keywords:
Id: E2Ld3QdXYZo
Channel Id: undefined
Length: 25min 45sec (1545 seconds)
Published: Tue Feb 25 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.