OddsPlotty – Visualising Logistic Regression Models from CARET, stats and TidyModels in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys welcome to hudson's hacks today i'm excited to announce a new package has made its way onto cran one that i've published this package is called odds plotty this is something i developed when i worked in research in healthcare we used a lot of logistic regression models and we wanted to way to visualize those logistic regression models more effectively in terms of the variable effects on the thing we're trying to predict so i'm going to do and show you the implementation of our developed in carrots and tidy models in r so let's jump into our and let's get going with this okay so i'm just going to get rid of everything in my my window so i'm going to say hello as well hi i'm going to bring everything up in my window so first of all to load autobotty it's the same as you would load any r package so let me just get rid of these same as you'd load any other package but if you know you know you never used all before and you want to know how to install it i'd probably say that in the visual installer at the bottom is files plots and packages you go to install there odds plotty will be pulled straight down from clan crown and then you're about to use it i'm going to load it into my library and i'm also going to open the help so i've got odds plotty in the help window so i can show you how we can use it as going through so this vignette more or less replicates what's in here with a couple of additions so this package was was first um kind of developed to work with the carrot package but i'm going to show you how you can work with carrots glm objects and tidy models objects so i'm going to load in the library ml bench so ml bench is a machine learning benchmarking and data sets package that's common data sets in there because we're going to be working with the cancer data set today so the we're going to try and estimate the probability of cancer over not dependent on the variables that we've got in our data set as a classification task again i'm using this from a statistical approach but you could easily break this down into a training test or a k4 cross validation pipeline in terms of machine learning please review my classification modeling and tidy models to see how you can do that as a separate youtube tutorial we're loading carrots and loading tibble we're going to bring gt plot 2 we're going to bring the odds plotty package although i've already loaded that in i'm going to use e1071 and gt themes because we're going to add some themes to our custom plots later on so let me just collapse that down a little bit get our environment ready so what i'm going to bring in is the breast cancer data from package ml bench i'm going to get rid of null values so essentially it's filtering out complete cases so only one complete cases i'm going to look at the head of the breast i'm going to set specify the class as my as a factor because ml models need this as a factor so benign or malignant can be classified hit play on that as you can see my data frames return there you've got these variables which we'll go into later on and we've got the thickness as well again you can you can pull up help for the uh cancer data set as well if you want to see what all these variables actually mean and then the next thing to do is i want to code up my variables as characters as numeric sorry so they all then set to numeric again you could use per or you know supply or some other method to do that i really like two for loops um so yeah i'll probably be told off by hadley wickham or someone like that so what we're going to do now is we've got our data in we've got this breast data with cell thickness cell size some shape module adhesion size bare nuclei chromatin normal nuclei mitosis and the class of whether it's got cancer or not so we're going to use our past data labels making it a supervised machine learning task we're going to use our past data labels to then predict is it benign or malignant if we're doing it from a machine learning perspective here i'm just going to assess the significance so i'm just doing this from an evaluation point of view so more statistical inference so training the glm using carrots so we're going to bring in we've already bought in the carrot library we're going to specify the glm model so train is a carrot object so it should be carrot if you want to do the full paths carrot train we're going to use a class so if you remember from the data set class is whether it's benign or malignant data is equal to the breast data whatever i called that method is a generalized linear model so a logistic regression model and again refer to online literature around logistic regression how it how it's then transformed from a linear regression to a logistic regression through a logic link function and how that then helps classify through a sigmoidal curve again i won't go into the statistics in depth this is more meant to show you how you can use the visualizations that are produced from odds blotty and it's binomial because it's a binomial classification we've got just a classification of whether it's benign or malignant so you need a multinomial if you want to do more than one prediction on why so here we've got this summary of our glm model and actually as you can see there's still a lot of variables that we've not captured the intercepts very highly significant which means actually out of all the independent variables were captured there's still lots more that are missing which could better explain whether someone gets cancer or not but the significant variables here are cell thickness marginal adhesion bare nuclei chromatin and this normal nuclei here at the moment we've just got the the raw probabilities what we're going to do is convert these raw probabilities to something called an odds ratio and josh stammers got a really great website on stackquest around how you work with log odds and lots odds ratios so i'd actually direct you to that website so how you work each one of these out again here it's more around working with that data and i said earlier the sigmoidal curve this is essentially the sigmoido function this link function that helps you classify so it's 1 divided by 1 plus the exponent of exponent times minus the z statistic and it gives you this this score here so this is your z axis sorry your z axis there and this is your bias estimate so sum of w prime i x j sub j plus the bias so the biosystem that would be there again it's not a stats class but that's essentially how the sigmoidal curve work sigmoidal function works in terms of the logic link function and turning a linear classifier into a classification so from regression to classification and that's essentially what the aim of it is anyway back on task so we've got this we've got this data set we've got the self thickness mode of adhesion like i said other information around the aic if i was comparing multiple logistic regression models looking for difference in the variables that we include then this would be the statistic that we use to monitor our models so to compare our models so lower the lower the alka aki information criterion the better i can struggle to say that one so yeah we've got all this data here and we know the significant variables essentially so where odds plotting comes with them so odds plot is a way that we can visualize these variables so what i'm going to do is i'm going to use the odds plot i'm going to specify my glm model that's the one that i fitted it could be my model my test model my registered regression model whatever the important list item to expose here is the final model attribute you'll find it somewhere in there so it'll be down here i would think there's your samples so there's a final model within that list item anyway so we need to bring back all those statistics from that final model so they'll essentially be this data table here so the coefficients and the independent variable the raw odds the standard error of those odds the z value and the probabilities and then the obvious significance uh based off those probabilities um so yeah the final model title odds plot so i'm going to give it a custom title and add a subtitle as well and then from that plotty objects i'm going to expose the the these two objects in that list item there's odds data and odds plot i'm going to expose the odds plot so here what this is showing me is actually it is actually linked to the significance here so for every change so every one change you've got a 1.76 chance in terms of the cell thickness of having cancer so essentially that's saying for every every change in one unit you've got 1.76 odds in terms of a ratio of getting cancer so the more this increases the more likely your odds are of having cancer so if that increases by 10 this is going to go up by 10 times 1.76 so around about it'll be about 14 odds of cancer there dependent on each patient again someone caught me out of my stats there so yeah the cell thickness is essentially significant because these error balls aren't deviating outside of this cutoff line here so actually if you're outside of this cutoff line you're starting to indicate that there's no effect and the width of these error bars also indicates that actually there's a lot of variance within that independent variable so actually from the lower estimate it could be outside of significance the higher estimate we're not bothered about that it's more of the lower region here in case it breaches outside of this one cutoff line so the odds of one are just you know it's this essentially saying no effect again we've got one here which is significant chromatin the bare nuclei molecular adhesion normal nuclei is a borderline one i'd be a bit i'd be a bit careful with that and it also comes out it also kind of correlates with what we're seeing in the data table here it's only just significant there but it is a significant variable so that can be taken into account so what i tended to do when i was creating these plots i'd put these in my publications normally with data labels and i'll show you how to do that in a second alongside the results in a data table as well so we've got these coefficient results um and we want a way to visualize that because that's that's nice for a statistician they understand what that means so the z value is uh sampling the probabilities from that z score which is a normal distribution which is a gaussian takes that value from the x-axis this standard error in terms of how much error there is in terms of raw odds we then want to visualize that in in terms of odds ratios as well and again the odds plot function also returns the data that underpins this as well so we've got this nice visual but say we want to just get the odds data so we want to get the odds ratio so this is a conversion so essentially it's just simply the exponential of the actual raw result so the raw estimate which will be the probability to to formulate the odds ratio and again look at just armors nice interesting tutorial of making stats fun on statquest a bit of a plug there josh so plotty odds data you can see that actually now we've got these lowering upper confidence intervals they're based on a 95th percent confidence interval and you can see that actually we've got some that are lower than the the no effect kind of cut off of one so this cell size in terms of the odd ratio is not really great the thickness is really highly significant it's the most uh contribution it's the most important factor in terms of that in terms of the odd alongside mitosis but the trouble with mitosis is the lower estimate and the high estimate there's so much of a range between those two it indicates there's so much variance around the mean that perhaps we can't trust that estimation so whereas mitosis has got that quite a big odd ratio you also need to interpret it in terms of where it falls on this this this chart and as you can see the lower estimate and the high estimate are huge so we're giving a it's basically saying it's got an odds ratio of where's the mitosis one 1.71 say but actually it could range anywhere between one so no effect seven one and three so there's so much variance in there it's hard to trust that uh that estimate and these two can be used alongside each other to do that kind of discovery and inference as well so yeah we've got the odds data that supports that and then from the back of this what you could do is simply take the fit estimates so take these as a data frame so just the estimates from the list item and combine them so you could actually just merge the data tables together okay so what we're going to do now is i'm going to use different themes with those parameters so i'm going to pass it in the odds plotty package again i'm going to use gg themes this time which is essentially a package that's been created to create themes around different plots so different looks and fields so i'm going to pass in my final model again so the final model which is the important list item from the carrot model i'm going to give it a title a subtitle a point color so in a hexadecimal code an error bar color we're going to give it black point size error bar width and the line style i'm going to specify the odds plot element and i'm going to save that in another intermediary variable called plot and then from this plot i'm going to add a gigi themes style theme economist and i'm going to set the legend position off as well and then finally with that plot object i'm going to add a textual label to it which is essentially bear with me on this one so if i go back to my plotty object i've got plotty i've got odds plot and then under odds plot i've got data so under the list items odds plot under that list item sub list item i've got data and under that sub list item i've got the odds ratio okay so essentially four four hierarchies that are being stored in terms of data that can access so the way to go down the list is just to keep using these notations in dollar signs to expose the individual elements of the list i'm going to round that odds ratio down to two digits i'm going to do a little slight horizontal adjustment and a vertical justification as well so if i run all that code what i'm going to get is this nice and again you could also perhaps play with a color here so color equals red if you wanted to you know you can play with additional items blue blue might look quite nice against that whatever you could take the original point color couldn't you and actually make the label the same that i don't like that needs to be a bit darker navy okay so yeah then you've got the actual odds ratios on there as well and you can add another you could add more text to it you could perhaps add another lower and upper bar but it starts becoming a little bit cut a little bit busy there's already a lot of information on that plot to take in anyway around your cutoff limit so the effect in terms of odds ratio the variables the error balls the high and low the actual estimate in terms of the mean estimate there in terms of the odds ratio but yeah it allows you then just to add a little bit more contextual information to that odds plot again let's just do it again and i won't go through each one of the options if you need help on it you can go to the help for the specific function itself and there's the vignette as well so what i'm going to do is i'm going to set the edward tough theme as well okay so now say i'm i'm a bit more modern than some of our old machine learning practitioners i've used carrot for quite a long time and i'm still probably still my go-to machine learning package at all it's a bit like the psychic learning python uh if i'm going to model in python it'll be psychic learn that i always go to here it's normally carrot but i am changing my ways tidy models of the future max tells me so on twitter so let's let's go okay so training the model with a logistic regression object from tidy models so i'm going to bring in tidy models i'm going to use fitted logistic regression so that's what i'm going to call it so the actual command is logistic underscore reg and that's from the past nip package if you really want to have full naming conventions i'm going to pipe through and obviously because it's in the tidy way it they prefer you to write everything pipes so from magritta i'm going to pipe through i'm going to set my engine to a generalized linear model which is the family of models that logistic regression falls into among others i'm going to set the mode to classification because it's a classification task that we're doing and then the fits so can you remember our class is whether they've got cancer or not i'm going to fit it on all the other independent variables so going back to our data again just to reinforce this we've got our y thing we're predicting a dependent variable and we've got independent variables x1 x2 x3 x4 x5 x6 x7 x8 x9 so they're all going to be utilized in our equation to create our logistic regression so that then creates the fit object for tidy models so you can see now i've got this fitted logistic regression model so the important part here and i can't emphasize this enough the difference between this and carrot is to work with odds plotty carrot you'll use the final model list item with tidy models it's the fit object that we're going to use and it's i put that in the vignette as well so say now i'm going to do the same visualizations with oddspotty as i did before so this time i'm going to create a tidy odds plot same because it's based off tidy models i'm going to use odds plotty and i'm full name in here you could just call it odds plot if you wanted to and i'm gonna use the fitted logistic regression from the the model that i fitted previously here i'm gonna expose the fit object from that list item i'm going to call it a a unique name tidy model does plot i'm going to specify the point color and the horizontal line color as well and then we'll add a new theme to it as well i'm going to use this gigi themed theme wsj and i'm going to take the legend off and i'm also going to plot the odds data so because this uses the same underlying let me just go back to the model itself generalized linear model engine based off the the underlying stats package all the results and the outputs should be the same we're just approaching it a different way in terms of training it from tidy models compared to carrot but ultimately if anyone asks you you know is carrot or tidy models better i'd say they're actually both using the same underlying engines but tidy for tie diverse people is probably easier to understand so tidy models are easy to understand for your base r users um people that have been doing it quite a long time carrots still probably really good the thing i like about carrot and this is something that max will admit to tidy models has got around about 40 50 models whereas carrots go 193 different variants of machine learning models that you can apply which max is catching up on and developing but still you've got that much more variety in terms of models however with that variety comes how do i make a decision on which model to use so there's pros and cons of either but i don't think either are going away for any any time to come so we've got this yeah so we'll create this odds plot on this theme wsj and you've got the same underlying data that's come through and the same options in terms of the modeling if you wanted to just adapt these themes i can just go here and gt themes let's look at a few more while we're online so look theme base what's that so we've got this theme base here i can try theme um say excel new what's that so new excel type to make it look like your excel users theme let's see what else there is five 538 which is nate silver's blog theme foundation i don't like that one oh bad google docs that's pretty nice actually i like that one i'm going to keep that one actually thanks guys for letting me find that one out so google docs yeah so we've got this theme of google docs that we can then utilize so essentially that is what oz plot is it was developed like i said when i was doing some research but it's a way that you can really visualize the results of this perhaps statistically heavy output here there statistically heavy output in terms of the deviance residuals the overall where things lie in terms of probabilities um how well your model fits and you can also use something called the mcfadden's r squared to actually say how well your independent variables are how good your independent variables are at predicting whether you've got cancer or not yeah so please refer to the the vignette if you've got any if you get stuck anywhere i'm on twitter if you've got any questions around how to utilize the package if there's any issues you find submit a pull request to the supporting github which i will find in a second so yeah if there's a any issues then go to the github site so there's a there's a github site there so statsgary that's my handle if there's any like say issues just go to a new pull request and submit a pull request there and it will show you which version of cran it's on so odds plotting is now on cram it's got the vignettes there how you install the package how you use the package and i'm going to amend this as well to show some examples that we've utilized today so thanks guys keep watching in the future i hope to make lots more machine learning tutorials alongside some other cool stuff as well so yeah stay safe guys and please subscribe
Info
Channel: Hutsons-Hacks
Views: 743
Rating: undefined out of 5
Keywords:
Id: HO0Mm6_LCGE
Channel Id: undefined
Length: 27min 0sec (1620 seconds)
Published: Tue Jun 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.