Impute missing data and handle class imbalance for Himalayan climbing expeditions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silke and i am a data scientist and software engineer at our studio and today in this video we are going to use data from this week's tidy tuesday on um climbing expeditions in himalayas like climbing mount everest and similar um similar peaks and we're going to train a model using tidy models to um to be able to try to predict who survived these expeditions and who died um we're going to be able to show how to approach modeling tasks like imputation and how to deal with class imbalance when we do our feature engineering so this is um a pretty interesting data set so i'm excited to get started all right let's get started um exploring this data about um uh climbing expeditions in the himalayas so this is this week's um a tidy tuesday data set and there's several different data sets that we um that we have access to here um one of them the data set on the expeditions on the peaks but one of them is about the expedition members so who has actually been on these different expeditions to climb all these different peaks um um that are here in this um and this peak id and peak name uh columns here so we have expo expedition id member id and then we have a lot of interesting data about um these people who who climbed these mountains in the himalayas um things like how old they were when they made the climb um where what is their citizenship did they use oxygen or not and whether they died or not so this is a dangerous thing to do and what we're going to do using this tidy tuesday data set this week is build a model to understand uh see to build a model to predict the probability of an expedition member dying or surviving based on other characteristics of the person and the expedition is which is what we what we have here in this um in this data set so we have you know whether they were going in autumn or spring what year they did and then characteristics of the person and whatnot um so this is um so we'll be able to see uh how well can we predict that so um how um so does the information we have in this data set is it predictive of whether someone survived or not was able to um survive the climb that they were doing and um and then can we learn something from the model that we have so that is what we're going to do so let's um first let's use skimar to see what we can learn here about the data that we have and i'm going to do that again and stretch this out this way so we can i'm going to stretch it out a little more there we go i'm really zoomed in here so we can see this a little bit better because i want to look at so this is how many uh expedition members we have the number of columns we have overall and then let's look at a couple of things here so we have the ids for everyone we don't know actually what um what names go with the peak id for a couple of these we're missing sex or gender for a couple of these for a few are missing citizenship um we're um uh we're missing expedition roll for a little more um the these information it looks like we're not missing at all and these are these logicals so hired so this is um was this expedition member one of the people who you know decided to come and climb the mountain or is this one of the people who were hired and paid to come along with the expedition often these are people a lot of these folks are people who are native um sherpa climbers and so so this is how let's see so about 20 of the people in the data set are these hired people um or maybe the other so hired false yeah okay um the how many succeeded so did they actually make it to the top how many of the people made it to the top or how much i think the success is actually not did the person make it to the top or did but did the expedition make it to the top um uh was there oxygen used did it was it a solo climb which is very tiny and whether the person died so notice here that that is a very small proportion so this means this is a we're trying to predict something that is not happening very often in the data set so this is going to really impact the choices that we make throughout modeling throughout future engineering um pre-processing and modeling um we're gonna we're gonna see how this affects what it is that we can learn um year this is the year that the um that the the expedition happened this is the age of the person that is missing uh a fair amount and actually that's missing so much that we're going to practice i want to show and demonstrate how to use tidy models to do imputation so two things we're going to look at how to do today are how to handle class imbalance how to be able to try to build a model that is going to be able to do a good job even though we are predicting something that doesn't happen very often in our data and then how are how can we um uh impute missing data because i would like to see if age makes a difference for um whether someone survives or not but we have a lot of missing data there and i would really not like to throw away you know 3 500 data points if i can and then a lot of this other stuff is missing so we probably will not use it so for example i don't mind you know i'm probably going to filter out some of these things that are um that there's only you know 2 or 10 or 20 filtered out or that are n a that are missing but i am let's tr we're going to impute that age there so skimmer is such a great package to be able to just get a really high level view of what is in your data and after we do that we can start to do a little bit more detailed exploratory data analysis to be able to um see what's here first let's look at um uh year and see how for example um uh like did the number of people dying and while we're at it let's say did the number of people succeed the number of expeditions number of people and their expeditions um change so let's look at this all right so there's there's a lot of you know people didn't go every year there especially near the beginning so let's do a um let's round this to by every decade so say 10 times and then let's put a year fl floor oh gosh floor division 10 so this should round so now we have uh by decade so this these are the percent of people who died and the percent of people who succeeded by decade let's make a little plot of this so we need to pivot longer transform this from a wide to a long so we're going to do those two columns those are the ones we're going to pivot and reshape and say names to outcome and values to uh let's call it percent i suppose and now we can plot this into a um plot so we'll put the year on the x-axis percent on the y-axis and let's say color equals outcome like this and then we'll make a we'll make a line i'm going to make it a little bit transparent a little bit thicker and let's see what that looks like nice okay just to emphasize i'm not gonna spend a ton of time on this plot because it is i want to keep going but let's just make the labels show that their percent percent format like that okay nice okay so this is this shows us over over these decades that we have since about the beginning of the 20th century um looks like success is on its way up still not very high it's still below 50 percent of the um the percent of people who are on expeditions who succeeded oh i might be interested to check the number of people per expeditions over time i don't know um and then the percent of people who died so it looks like it was up and down there a lot back in the day um but then it looks like it's quite safe um i mean i don't know that's still that's still way i like i would not start a hobby that had that percent of people dying but um it is much lower now than it was in the past especially in those early decades so we kind of have success on its way up death on its way down over time sounds good yeah um we could do the same thing for um for age let's just change out age here and plot that okay so now this is age so we so in terms of who is dying age does not change much until we get to that older age group um i don't know why there are people that are being rounded down to being infants um are there were there really children who go up on these i don't know that would be interesting to look into but uh if there are people who are 10 or teenagers uh that seems a little more reasonable anyway okay so for the bulk of this they're um the the death the percent of people who die is about the same until the very oldest group um and the percent of people who succeed by age though wow that's pretty interesting it really goes down the younger you are the more likely you are to be on a expedition that succeeds the you know it goes just goes down and down and down like 20 40 60 80. the so the older you are the less likely you are to be on an expedition that has succeeded so that's that's pretty interesting that's pretty interesting okay um so that's uh success um death year age uh let's actually look at success and death so success remember i am let's let's do this just to make sure but i yeah so success is about whether the expedition succeeded and died is whether about whether the individual died because it is possible to die and have the expedition succeed but it looks like proportion wise um let's check this let's um let's see let's say let's say group by success and then mutate percent equals and divided by sum of n okay so more people died in the failed expeditions then died in the succeeded expeditions so you would rather be on a successful expedition you're more likely to survive if you are on a successful expedition that makes it to the top it is it is quite low in both though remember we're dealing with a um an event that happens rarely um so fortunately um uh uh you are like like most of these expeditions most of the people who go on these expeditions don't die it's a rare event but you can see we do see this difference here and we're able to look at the counts to understand that a little better um let's look this has lots of um of different mountains in this lots of peaks in the himalayas so um let's see what we have here um peak name all right so mo mo more people are um climbing mount everest but then we have these other himalayan peaks as well here uh there's boy just a ton a ton here so for our model that's going to be kind of too many to include in the model so we are probably going to and if i remember from skimmer that is this is we have some n a's in this as well so let's let's whoa let's filter to not n a and then let's use um i'm gonna say mutate peak name and i'm going to use a function from four cats called fact lump where i'm going to lump less common ones together i'm going to call it prop and i'll use a prop so that anything proportion anything less than 5 is going to go together um and just into another category and then i'm gonna count peak name and died so i can see here all right so now we're down to ten categories which is more reasonable we have the most um uh the the peaks that have been climbed by the most people and everything else went into other which you can see is now the biggest category and i can't i can't look at that and see that um off the top of my head so we will now um we will now uh group by peak name and what do we have all right it looks like um this is a bit high um mana slu i i should look out up how to pronounce that monoslu is a bit higher um so some of these have lower numbers of people dying everest looks to be higher than some of the other these other ones but lower than these others so um so we definitely see differences across peak that will be interesting to see but the last one um maybe we'll check before we get started on the modeling is a season so let's count season died here we've got an unknown i am gonna i'm gonna um only keep the four known seasons and then i'm gonna do the same thing that i did here this group and so forth like this and then um i'm gonna make a little visualization because i it looks to me um looks like winter is worse but um i i'm you know it's it's it's always nicer to look at a visualization and i could have done this with any of these but i'm going i'm going to do it here for this last one so i'm going to uh instead of um having a logical let's turn this into a um a character so when when died is true so when a person has died on their um expedition i'm gonna replace died with the logical vector died with the character died and all the rest of the time i'm gonna say did not die like so and then i can make a plot let's see let's put season on the x-axis percent on the y-axis let's fill by season we're going to make little little bars um let's see we're going to do show legend equals false we're gonna i think need to say position equals dodge here so they're all next to each other instead of on top of each other and i'm gonna make them um a little bit transparent and so if i do all this it's all just look it's just all together right so what i need to do is i need to facet wrap dyed like this they're just on top of each other right now and i'm gonna need to say scales equals free like this and ah there we go very nice oh the last thing just because it is y continuous labels equals scales percent format like so so we can look at this there we go okay so this is you know good to see most people did not die um but we see a pretty big difference from autumn spring summer to winter so winter is significantly worse than the other season so so this again is something that um we will definitely want to include in our model speaking of which let's get this ready okay so um we've been doing um we have we've just been dealing with that raw data set that we have that we started with and let's start to think about what we want in the data set that we are going to use so let's keep out that um unknown so i let's see i'm going to use um peak id instead of peak name i want to try and see year season for sure um i we didn't look at sex but let's try sex we looked at age um let's try citizenship like if we see difference between people who are have different national citizenship whether they were hired um or not success it really looked like there was a difference there and died which is going to be the thing that we are going to predict um so i i'm going to filter out those things that we knew were um we have some nas but but there were only a few of um i'm not going to filter out the ones that are n a for age because i am going to impute that there were a lot that were n a there and um i'm going to impute that instead of removing them and i could compute all of them but um uh all right you know i i'm going to i'm going to show how to just impute the one that i think that would be a better a better choice in this case to filter out the just the few that are missing uh a little and then uh impute the one that's mis that is missing a lot that's the that's the approach i'm gonna take here um uh i am gonna pre um uh train uh classification models and in um tidy models to do that your outcome needs to be a factor so um i can't have that die be a logical vector so i'm gonna do the same thing here except let's call this died and let's call this you know that's a little better survived sir vived i should have been nicer up there too um let's change everything to a that everything that is a character to a factor and that looks pretty good so we so we only we only filtered out just a handful here we which is that that was the point of only filtering out these things and keeping the age so let's call that members df like so so this is what we're going to use for our modeling let's get started on that whoops that's not what i meant to do there we go so let's load tidy models the meta package that has all the packages we're going to use for modeling here and the first thing we're going to do is we're going to set a seed because the first thing we're going to do is an initial split on that member's df and we are going to do stratified splitting or resampling on that that that that our outcome because um we want it to be balanced in that especially in this case where it is um a pretty rare event so we'll say the this will be we'll call this members train and then let's call this members test all right so we have the first thing we do is we split the data that we have which in this case fortunately is quite a good amount into training and testing so our training data whoops our training data has almost you know sixty thousand uh points and our testing data which we are going to set aside and use to evaluate our model at the end to get an estimate for its performance on new data has about 19 000 rows in it and they will have but they'll be balanced in who survived or not while we're here splitting data um let's set a seat again um uh there we go and let's uh create a set of resamples so um we so first we set training and testing now we're going to take that training data and we are going to um create a set of cross validate of what am i doing here of cross validation folds from it let's call it members folds so these are the um the expedition members so each one of these is a fold of data a cross validation fold where this is what we have total and then this is the held out set for this split here for fold one and so when we come and we use this this um resample we will train on all of the data except this and then we can evaluate on this so um we can use the resampling to compare models to um we can use it we could use it to tune models so the testing data up here we're saving for the very end to get an estimate of performance on um for for the future for new data but these resample things which we created notice from the training data we will be able to use to um during our whole modeling process to make choices um all right so the next thing we're gonna do so we we took our data we got it ready um we we split and then we made resamples the next thing we're going to do is we're going to do feature engineering or data pre-processing so we are going to explain um who died with everything else we have and the data we are going to use here is the training data members train so let's let's call this um members recipe like that and now we're going to start adding steps to our to our uh recipe these are steps that we need to take for um for the future engineering we need to do first let's talk about imputation there's different kinds of imputation we can do like mean imputation median imputation we can do um uh k and then imputation i so what the goal here with imputation is just that i don't want to throw that data away i'm not adding i'm not adding new information i'm just using the information that's already there in the data set to i'm not like nothing magical is happening i'm just using the information that's already there to tell me what to put in those missing data spots so that i don't have to throw away those observations which have important or i think probably valuable information in the rest of it in not age so i'm actually just going to take a really simple approach um i'm just going to use the median to impute that so what that means is it's going to say hey what's the median age in members train or in the you know in the data set i have i'm going to put that into age um you know you can there's other options there for imputation but in this case we're just going to do something um straightforward i am remembering that um some of these some of these things like um gosh citizenship i bet we haven't really looked at that yet so there's 200 different kinds of citizenship remember there were like three or 400 different peaks that is too many levels um if we try to um you know train any kind of model it's not going to be happy when we do that it's it's um it's not it's not gonna be a good time we're not gonna it's not gonna be a good choice so what we can do is we can use a function called step other and so it's gonna pool the in the less frequently occurring values together into another category much like we did with fact lump from four cats so i'm gonna do this for um peak id and for our citizenship i don't think there's anything else i need to do it for no everything else should be okay and um there are um there are factor variables here and i am going to use a model that can't handle factor variables as is so that means i need to create i need to create a indicator or dummy variable so i'm going to say take everything that is a factor or character everything that's a nominal data and turn it into a dummy variable except i don't want to do it to my outcome because i still need my outcome as a factor and then the last thing i believe that i need to do for um for feature engineering is to handle the class and balance so class and balance is handled we have a we have a separate um package in tiny models that hold some extra recipe steps and there's there's several different things you can do we can do things like step well first of all we could step down sample so we could throw away everything um that's not every we could throw away a lot of people who survived and then have a much smaller data set um i think i listen we're not going to do that this time instead we're going to try to uh to do an up sampling if you do plane up up-sampling um uh what often can happen is that um a model will just really memorize the um especially like a more powerful machine learning algorithm like a random force or something we'll just really memorize the few examples that you have so instead of step up sample we're going to try a um a a slightly more sophisticated option we're gonna let's use um there's there's different ones like t step b smote um steps moat um so let's see what it does is it uses nearest neighbors so it's going to try to um it's going to it's trying it's going to make new examples of p of um people who died using nearest neighbors of the examples that we do have so we want to upsample on the the people on the um member expedition members we have who died so that we have the same number of people who survived and died because right now we so right now for example members train um we have many more people who survived and died and for our model we want that to be balanced so that we can do a better job of learning to recognize both kinds of examples so let's do this so right now we have defined this recipe but we haven't computed anything for example we have not computed the median of age we have not gone to all the things that are factors and figured out what levels we need to do to make indicator variables the thing to do to do that is to prep the recipe oh no that's not happy let's see so remember it wants all wants them all to be numeric oh you know what it's these logicals because they're not factors and they're not numeric let's go up back up here so say mutate if is logical oops is logical and say as integer so this is going to change them from those true falses whoops oh it has i can't have like that okay so before they were true false and now they're zeros and ones so now they're so it's like i made my own little indicator variables effectively by like manually so let's re-sample and let's re-set up my recipe and now so now the recipe is defined and now if i prep it it it it just calculated things like it found the median for age it found the cult factor levels it um made um indicator or dummy variables for all these and then it it did all these like k nearest neighbors here if i want to see i'm like okay i did that thing but what like i want to see it i want to see that data that you found it so um what you do is you bake it um but but that data is already in there so we want to we want to bake it with new data equals null like this so we're not passing any new data i want to say hey i don't have new data give me the data that you trained on um the original data and so this is what we get here so um we have we have instead of peak id notice that we have these peak id for things like um for you know everest here for other here for this one that a lot of people died on we have um notice we don't have autumn because that's the base level but we have spring summer and winter we have male but we don't have female because female was the base level and yeah okay so let's convince ourselves okay so so the um up sampling worked so we successfully so yeah i mean we have a lot more rows now because we we created new synthetic observations using the smote algorithm here using steps moat um so this is how this is how you can handle the recipe to figure out what's going on we are gonna use um workflows for our actual model training because it is uh easier and a lot of things get handled easier so let's try um two kinds of models this is actually the same two models that i used in the um a video i did maybe a month ago or so with the palmer penguins data set so it is um a simple or i don't know simple is the right word there's a lot going on in it a um a basic logistic regression model and a random forest model so let's make sure there's enough trees and we're going to use ranger as the engine and for a random forest model can be used for each either regression or classification so we need to set it so now we have these now we're saying here are our models here here's the models that i have and now let's let's set these up to um to try to be trained so let's let's set the workflow here so workflow and then we can add a recipe like so and this this is the recipe that we want so and let's call this members workflow like this let's take a look at what the workflow looks like right now look what it says so preprocessor recipe that's the recipe that we added and model equals none so a workflow is like um think of it as like ways to stick things together like legos and then you can carry it around and right now there's nothing stuck in the spot for a model it's like empty right now the spot for the model is empty and it but it's just there waiting for either one of these so we can go so we can go and add the model to it let's um let's let's add the model to it and and then keep going and and pipe [Music] and pipe and fit so we're going to add a model let's do glm spec and let's look at let's look at what that looks like so now it has two things in it it has the recipe because remember i saved this object with the recipe in it and now it has a model too so this is now an unfit workflow here and then we're going to use fit resamples because we're going to use these resamples we made or these folds that we made to fit these two models and then see how they did these are both models that don't need a lot of tuning because they do pre like they have either either they don't have tuning parameters or they do pretty well with their defaults so um we do not need to um uh there's there's not a strong reason to tune either one of these um so we're just going to use fit resamples which means we're going to fit to all 10 of those cross-validation folds and then be able to use that empirically to be able to say which model is better so we're going to say resamples equals members folds this is what we're fitting to um i am going to set some um some metrics here um if i didn't do anything it would it would compute um accuracy and roc auc um so let's uh i i mean i like those let's keep them roc auc accuracy but i have a really rare um positive like really rare event so i want to do sensitivity and specificity here so that i can easily find those and then i also want to save the predictions because i want to make some plots with the predictions afterwards so let's let's let's use parallel processing because that will make this go much faster and then let's do this so what this is doing is it's taking this workflow that has a a so it's a tidy model's workflow it has the um the pre-processing recipe which does imputation collapsing factor levels make me indicator variables um do up sampling and then it fits the logistic regression model so it did that for each of these folds um so it did that for each of these folds it did that whole process for each of these folds so excellent that ran pretty fast because logistic regression we all love it for many reasons including the the fact that it's fast um random forest is gonna take a minute so we'll we'll switch out the the model specification but everything else we're just going to keep the same and that we're going to get started and so what's happening here is that this this for every fold that data pre-processing um recipe is being is being executed on it and then a random forest model is being fit on it and then it's being evaluated on the um on the held out set in this resample and then that whole process is being done again here um on on the non-held out part of this resample do the date of pre-processing do the fit the model evaluate on the held out set so the reason why we do this kind of resampling is because it helps us it gives us an empirical measure of being able to compare models this is going to take a minute so i am going to um pause the video and then come back when it's done all right we're back we have random forest results that definitely took a minute and all the fans on my computer ran but but congratulations to my computer for finishing i'm very proud of it and now we have results and it is time to evaluate the models so we can use the function collect metrics so the metrics are in these um columns here on the results which are like tibbles full of lots of lots of delightful things in these in these tables but you can there's a helper function that can help you get those out called collect metrics so let's do it for both of the um kinds of models that we fit and take a look here all right so the random forest accuracy is fantastic would you look at that wow fantastic accuracy so high the um the the logistical regression much much much lower um the roc auc they are much closer to each other and the um the logistic regression is only a little bit worse why what is going on what is happening we can tell by looking at sensitivity and specificity so the logistic regression model those two numbers are kind of close to each other close-ish to each other and so the logistic regression model um didn't do so bad in ter if you compare how did it do for the positive and the negative cases those are not fantastic numbers but don't get me wrong like that's that's not um you know that's not anything to like you know sing like write a paper about or be like wow i just can't you know like you know go sing a song about or anything but the logistic regression model did about the same for positive and negative cases it was about equally able to recognize positive and negative cases what about the random forest the we cannot say the same it um cannot it has a very hard time finding the minority case finding the people who um who died we can uh uh like find this see this a little in a little more detail with this um with a confusion matrix so here so this is the truth so for for this and this is the logistic regression um so um people who did die and were predicted to die is about twice as big as those who were predicted wrong so um about two-thirds of the people who died were predicted correctly um which is you know why we're right around to the two-thirds number here you know there we go that's what the two that's this is why this number is two-thirds it's because you know it's about two-thirds here um and then i did a little bit worse on the people who survived what about if we change it to the random forest okay so these are the people who died who are correctly predicted to have died these are the people who died who were predicted to have survived it was very hard for the random forest to predict this even when we um even when we you know really took out all the stops and tried uh you know um uh smote to the smote algorithm and all this kind of thing we were not able to build something that the random forest was able to recognize the minority class very well so if we wanted to use random forest we would really need to look into another option here for us here we're going to say that the um that the the um logistic regression did better and we're going to move on so we're going to get out the predictions and then we can group by id let's let's remind me what it looks like here so uh we have this id which is the fold and then here are all the predictions for everything that was in there so if we group by id we can do an roc curve make an roc curve and it would be would be died the real the real truth and then the probability the predicted probability that um they died so this we have 10 roc curves here and we can pipe this to auto plot um here and so this is the auto plot this is the roc or 10 roc curves actually for the um for the logistic regression model let's let's show you let's look at what the random forest model looks like here and let's um make this a little bigger and let's switch back and forth so this is random forest this is wait which one is which now okay now i've got them confused in my head okay so random forest is here logistic regression is here so the differences that we see are related to those um to um [Music] to the differences in the sensitivity and specificity and how those uh cutoffs would work okay so we're going to use the logistic regression model so we decided we're going to use it right now it's been trained on re-samples but has not yet been trained on um on the all the training data so let's do that right now so we can take mem that workflow that remember is kind of an incomplete workflow here and we would need to add the model again the model specification and then we could fit to members train and then predict on the test set but there's a nice little helper function called last fit and we there we flip fit on members split like this and what this does is it it fits on the training set and it um evaluates on the test set so it's just like a it's just like a convenience function for the thing we usually do at the end of a modeling [Music] workflow modeling analysis which is to fit on training data evaluate on testing data so if i said if i do collect metrics on members final this thing that i just made these metrics are on the testing data so they are let's see they are a little bit worse not hugely a little bit worse so maybe we've had a little i don't know how big were the standard errors maybe a little bit worse and we can also collect the predictions on members final notice the number of them this is the testing data this is the first time we've used the testing data this whole this whole video this whole analysis and we can um uh we can do a confusion matrix on this as well confusion matrix and it would be the same as it would be dyed predicted class like this okay and so this is the truth up here so this of this many people um die of who actually died this many people were correctly predicted incorrectly predicted so correctly predicted and correctly predicted notice that the testing data has not been um uh has not had its has not been oversampled whenever you evaluate on testing data you do that with the with the like it's real life like it's real life in terms of what the what the sampling is all right and so this member's final object if you wanted to if you were like ah okay i did it i need to save this object um to use in the future when i renew data you could save that whole members final object as like an rd s or something like that or you could pull out you could pull the workflow um oh it's like i don't know we should maybe i don't know about that uh we gotta pluck it out it's like in a list um and then we can tidy in this case it is just a linear model so we can tidy it like this um uh this is pretty interesting let's um let's exponentiate it exponentiate equals true so that we get now we have um odds ratios and let's cable digits equals three um nice uh you know what i'm going to add one more thing a range estimate like this so we can see okay okay so okay so these things down here on this side are the things that um increase your probability of surviving so going on a trip in the summer being on an expedition that succeeds being a uk or u.s citizen um this is getting to be pretty small but um uh going later don't don't go in the year 1910 um and then these things down here are things that make your probability of dying higher of surviving lower so these peaks like everest or other remember how many of those there were a lot more people who died on those being one of the hired people on the um on the expeditions um being a man compared to being a woman um being a citizen of nepal so we see here this evidence that these um and this is something there's been a fair amount of uh uh press and writing about is that um that this is this is a this is actually a job for um native nepalese um sherpa people and that this is actually dangerous and so here we see here that it actually is more dangerous for these um hired like on a per expedition basis it's like more dangerous for those folks it looks like from from this model at least um then from from uh people who come in from elsewhere which is pretty interesting uh let's see let's make a little um let's make a plot let's not put it on the that scale anymore let's put it back on the other scale and let's let's say estimate fact reorder term estimate and we're going to put some points all right let's take out the intercept because that's not so helpful for us and then let's put on those error bars error bar and we're going to say x min let's estimate minus uh standard error x max equals estimate plus standard error i think that's all let's see what we got nice i why does it always look like that and let's put before we start putting any of this down let's put a horizon or a vertical line at zero and make it dashed like so nice okay let's let's call this the end let's look at this oh re-render there we go okay all right so these are so what's on the what what is on this axis this axis this is the coefficient from the logistic regression so we didn't exponentiate it so it's back on that um back on the other scale so the things over here on the positive side are things that increase your likelihood of surviving the things over here on the negative side are the things that increase your likelihood of dying and we can see um and we had we kept everything pretty much on it's a real scale we didn't normalize things because we didn't need two for these models so this is like citizenship of usa versus whatever the base level was i something i don't know like australian or something with an a i forget well we'd have to look that up um so did it succeed versus not like it it increases the it increase the coefficient is this big summer versus autumn the you know summer's way over here really helps you winter is is down this way these these peaks over here down this way you know they they really increase your likelihood of dying so um [Music] uh the so we're able to get this out so so this you know the the things that i'm saying right now are like uh interpretive um uh uh like like why what's going on but it is important to keep in mind um how closely did this model fit the data which is what we what we were measuring when we were looking at the accuracy um numbers and so remember that this this you know we we we was definitely better than uh you know we we were able to see something some meaningful um fit to the data there they're you know we this is better than guessing but this data does this model does not um fit uh extremely well we're not able to predict with high accuracy we are able to predict with you know let's call it medium to low accuracy and that's probably why the random forest model did so poorly is because the data that is in here the information the predictive power that is in this data isn't enough to predict whether someone will die or not um like just these these you know handful of variables isn't enough to really predict that with high accuracy so we're able to learn about some differences but not um uh uh not build a super highly predictive model such like the kind that a random forest would be able to really take advantage of okay we did it we used tidy models to train um models to predict who survived these um these climbing expeditions in the himalayas and um who did not we were able to achieve you know like medium predictive accuracy which we which is we use that information in being able to interpret the results like how how um how much how much information is there in what we know about who survives and who doesn't um uh we were able to learn things about what season are people more likely um to die versus survive um we you know we were able to see in these results a little bit of information about um the how what a dangerous occupation this is for people who are hired and we saw that even reflected in the um in the citizenship data we saw people who were native peoples there in nepal like the such as the sherpa people who um who are hired to go along on these expeditions were um were more likely to die compared to people of other citizens so i think this is a really interesting data set to be able to practice some of our skills like building models imputing missing data um to be able to account for the class and balance because dying fortunately is rare compared to surviving and um i hope this was helpful and i'll see you next time
Info
Channel: Julia Silge
Views: 5,389
Rating: 4.9793816 out of 5
Keywords:
Id: 9f6t5vaNyEM
Channel Id: undefined
Length: 61min 10sec (3670 seconds)
Published: Wed Sep 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.