Multinomial classification with tidymodels and volcano eruptions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Julia silly and I'm a data scientist and software engineer at our studio and in this video today we're gonna use this week's tidy Tuesday data set about volcano eruptions and we're going to talk about how to train a multi-class or multinomial classification model a lot of things are the same when you're at training a classification model for labels where you have more than just two classes and the label that you're predicting but we'll talk about some of the things that are different so let's get started with this data set of volcanoes okay let's look at some volcanoes here so this data there's several data sets here but the one that I'm going to look at is the the data set of actually not eruptions over time but actually where the volcanoes are and one of the columns that we have here is the the primary volcano type so if we look at this we have there there's quite a number in here but you can see that there's several different sort of main categories that we have there's stratovolcanoes their shield volcanoes we have a couple of them here there's filled volcanic fields calderas and so forth and what what we're going to do in this video is we are going to use other information that we have about the volcano for example we have information about the elevation about where the volcano is about kind of rocks the kind of rock that the volcano is in and we're gonna see what like how good of a job can we do building a model to predict what kind of volcano it is based on this but in said this isn't going to be just like a binary classification model where we only have two kinds there's a whole bunch of kinds here so this is an example of a multi class or a multinomial classification problem where where we have not just one versus the other but but three or more classes so let's start at 26 is really too many given that we only have like less than a thousand rows of data here so we're not going to try to do all these so let's start to build a new let's start to build a new variable last actually let's use transmute here let's so let's call it volcano type type and let's use case when so what's it and then let's use string detect so when we detect in this primary volcano type when we detect me move this over a little bit when we when we detect stratovolcano like this let us let's let's call it strata what volcano and when we string detects in the same string let's see what we I think we need to do the shields the shield volcanoes let's call that shield do we want to try to do any others let's let's start with this for now so let's where we've gone from we're not how to do binary classification we are going to do three classes so this is instead of one instead of one versus the other we're gonna do three three classes that we're gonna build a model to find here we could keep going you know we can maybe try to find all the calderas that are here whatever but with shields we're already down to you know 120 or so so let's uh let's just go with three for now and then let's think about what else in here we want to include in our in our modelling so let's take a look and see so the like the number and the name would be identifier x' and this is what we're gonna transform to to predict somewhere we I don't know so we got a lot of information here about location I'm instead of these things like region our country let's just actually use latitude and longitude so this numeric measure of where these things are let's keep elevation I like that idea let's keep this tectonic setting see if we can do something with that I don't think we'll do anything with evidence category that's whether like how sure you know how do we know if it's when the last time it you know erupted or not and then let's look at these rocks okay so there's a lot of empty values but it looks like in the first column we at least have something in the first column for for looks like pretty much all the volcano so let's keep that so let us let's let's keep that volcano number in case we want it as an ID we definitely want latitude and longitude an elevation and then let's get the tectonic settings and the and the rock but just that first one cuz the other ones a lot of them are empty so we'll just keep one so this will be this way the data that we will keep so and use and our modeling so this is this kind of nice to look at because we've got a mix of some numeric predictors and makes us some things that are currently like categorical so before we go on let's do an is if it is character let's change it to a factor and let's call this volcano DF so we are going going to this will be our our modeling our data will use for modeling so we're gonna predict volcano type using the other things so before we go on let's just do a little bit of exploration I mean when you got when you got um spatial information like that the least the least thing you want to do is you want to make a map so let's see map data yeah world so this is all over the world so let's let's make a little bit of a map here and see where we can look at where these um volcanoes are and at least me do that much exploration before we get started on our modeling so I'm gonna make a I am going to make a map that has two layers and they are going to have different they are going to have different data so normally when you do ggplot you put the data right here but we are going to so one of them is going to be a map and then one of them is going to be points and so the the points are going to come from here and then the map is going to come from here so so you can you can send so I can say data equals world like so and then data equals volcano DF like this so you can come I don't think I've demonstrated how to do this before and you may not have seen it but you can actually combine different data sets into the same the same plot using ggplot in this way so for the map let's also say math equals world and then you started defining the aesthetic so if I remember right mostly what we have here okay so we've got so on the x-axis will put longitude on the y-axis will put latitude and then our map ID here is we need to say something called map ID and it's gonna it's this region thing so we'll take region and put it here like so and then we can start to say some things about like if I only plotted this I'll just show you what that looks like I think that will do something there we go like so so we have we have beginning of a map which is looking pretty good already so we can start to make it a look at make it look a little bit nicer for example if I want let's see let's make the the color which is of the the lines white so let's make the fill of light light gray and let's make them a little bit like let's make it pretty transparent because what cuz the next thing we're gonna do so that changes like the changes it to light gray so then the next thing we're going to do is for the we're going to put points on top of it so we're gonna say a es and then we were gonna say longitude because in our let's remember remind ourselves what's in this so we have longitude the they have different column names here so that's why this looks different longitude latitude and then let's make a different point different colored points for a different volcano types and let's make them a little bit transparent too so that if we want to if they plot on top of each other we see them and in different we see them a little bit on top of each other so let's zoom in here so we can see this a bit and this is looking pretty good here so there is for sure spatial information here I like the main thing I know about what he knows is that they're arranged in a ring of fire the ring of fire here is like split in half if you start you know down here okay it's you know goes around the Pacific Ocean like uh you know on this edge here and then it goes around comes over here and it goes back around this way so you for sure see that and then we see them sprinkled around elsewhere around the world um we doubt you don't we definitely see some spatial information in how the different types are organized you know there's shield volcanoes out in the middle of the ocean and then Antarctica the stratovolcanoes you see big there that was the most common so you I guess I makes sense that they're clustered together and then there are a lot of other I this you know I live in this part of the US and I think we have volcanic fields I think and whatnot around here so some of the other things are pretty strongly clusters so it will be interesting to see if we can use some modeling to learn about you know some of these differences that we see some of these spatial differences or if some of the other things matter to like you know elevation or the rock or the other things that we have here um yeah yeah okay so that's that looks good so let's let's call that exploration for now and start to build a model so we do not have a ton of data here and so what what I'm gonna do here is instead of instead of splitting into testing and training data as we often would if we wanted to build a predictive model I am going to instead create bootstrap resamples of this of this data frame that we main let's call it volcano boots like this and what this is going to do is create 25 bootstrap free samples and we have you know the the two the two sets that is divided up into here and so instead of having like training data and testing data and we're gonna evaluate we're going to train on the training to evaluate on the testing data what we're gonna do is in each of these bootstrap samples we're going to train on the analysis set of the bootstrap resample evaluate on the assessment set of the bootstrap seen tree sample and then see what we get here and use that as a measure of what how good of a like how how well our multi class model can perform so the good thing to keep in mind if you do something like this is that it's probably pessimistically bias the estimates that we're gonna get out of it when we have such small data like this I'm just going to demonstrate with doing it this way so we have our data now these bootstrap free samples that we have created the the next thing going to do is I'm going to get ready to pre process this data so cuz remember what's in each of these each of these splits it's a bootstrap resample of this data is a a mix of a numeric data of categorical data and one thing I want to draw your attention to volcano d/f count volcano type is that there's quite a the number of shield volcanoes is quite a bit lower than the other two and I I would like to deal with that class imbalance in this multi-class problem so I'm gonna use a I'm gonna use a recipe step from the Themis package which is a package for with recipe steps just we're dealing with class imbalance and so the one I use is called step smote and so it's the smote algorithm is it it generates new examples of whatever minority class you have using nearest neighbors there and so to do that what we need to do is we need to convert all of the everything to numeric so we need to make indicator variables dummy variables and then we let's also Center and scale all of our all of our numerical our variables here so let's get started with this recipe so first recipe right we say hey I'm making a recipe so we say I want to predict volcano type with everything else and the data that I'm using is volcano DF the first thing I'm gonna do is I'm gonna update role for this volcano number because I want to keep it in my data but I it's not a predictor or an outcome so I need to give it a new role okay so what else do I want to do here um how many volcano DF how many tectonic settings are there 11 that's kind of a lot given how much data we have so let's use step other on tectonic settings what that does is is it collapses some of these levels that are not used very much down together into an other category so it'll keep you know the the big categories and then it will collapse the very small categories down I'm just going to use the defaults but you can play you can set exactly where that threshold is set oh let's do the same thing for the rock how many levels are there well how many kinds of rock are there ten again I think that's kind of a lot and some of these are you know only use two or six times I I don't want to keep that on my model so let's step other the rock as well next i have these categorical ways that this is encoded but i want to have numeric i want to have numeric values so i am going to do step dummy on these two on these two variables that are currently factors and I want them to instead of factors I want them to be you I'll have instead of one column that has all these levels I'm gonna have you know seven or six columns that are going to have ones and zeros in them instead let's do a step zv to remove everything any anything that has zero variance and then let's normalize all the predictors so they are centered and scaled and then finally at the end of all this let us use that smote algorithm to to over sample so that from the minority classes so that they all have we have it so that it's balanced and let's call this the volcano a recipe like so so the volcano recipe so what I've done so far is I've defined all the steps but none of them have been estimated or like we haven't gone to the data and for example calculated the center and scales for what we for what we will what we need to do what the way to do that is that we prep so when we prep the recipe like this whoops not when we prep the recipe then we have we've gone to all the variables and we have calculated we've figured out how many factor levels do we need to collapse when did we what was the average and the you know standard deviation that we needed to do and then how did we how do we need to opps an oversample to get the the the classes to be balanced if you want to kind of check out what happened and check on your results you can juice a prepped recipe and get back out your results notice we have more rows than we did before and that's because of the over sampling so if we wanted to currently we can count volcano type here and now look it's even and that's because of the over sampling that we did also notice how many more columns we have now before we had seven columns and now we have 14 and that's because we made the dummy variables the indicator variables instead of having one column with tectonic settings in it we have you know one two three four like five ish and now we have four or so columns for rock with the different kinds of rock that we have here and so we can use all that and notice that all the numbers are all centered and scaled and so this is a good fit for the smote outward which is nice so this is all this is already now we were able to over sample using that nearest neighbor algorithm so that we have even numbers here so hopefully we can do a good job of recognizing you know the shield volcanoes even though there's a fewer number of them and not just the straddle volcanoes okay so that's data pre-processing such an important part of any modeling workflow next let's talk about mop the model the model specification we're going to use let's use a random forest let us so let's do set mode classification yeah because we are doing we're trying to tell the difference between stratovolcanoes shield volcanoes and the arrests of everything else other and let's use Ranger so so random forests using a ranger engine works for a multi class multi or and or multinomial classification just out of the box it just works as is you don't have to do anything for it and also it um it works quite well with like you don't really gain a ton tuning the hyper parameters as long as you have enough trees so it's a really nice fit for a problem like this it's this isn't really very much data to use with a random forest so that's kind of maybe a downside here but um it could be a very good fit for what we for the problem space that we have here so we're gonna make ourselves a random forest model specification we we up to the number of trees and we're a training a classification model it's just going to successfully do multi class instead of binary classification for us as is and then for convenience let's put our recipe here our unprepped recipe together with our model here this random forest specification into a workflow let's call it volcano workflow B as a way to carry around our bits of modeling workflow a workflow is a like think of it as like a little set a way to hold together things that stick together like Lego blocks like your recipe and your model stick together and it's easier to carry them around in your code and in your modeling workflow and you can fit a workflow much like you can fit a model so let's what I meant to do let's do that so we are going to use the function fit resamples and the first argument is the workflow the second argument is the resamples which for us remember is the set of bootstrap resamples and then the only other thing i want to do here is i want to save the predictions because volcano result you know maybe let's do a verbose also since we're gonna be sitting here and watching it going so I'm saving the predictions because I want to look at what which which individual volcanoes were predicted correctly or not and so if I save the predictions in this way then I am able to get get that out which is convenient and nice and if we use verbose equal to we can see how close to the end we are which is pretty nice all right we got through all 25 bootstraps just to emphasize what's happening here we for any individual bootstrap first we look at the recipe the recipe is evaluated we look at the model the model is evaluated and then then predictions are made using the fit model on the on for that bootstrap resample on the part of the data that was not used to fit so that happened 25 times here for our 25 resamples all right so let's look at what we got okay so notice the beginning of this looks similar we have the bootstrap resamples but now we have metrics notes where anything that goes wrong is kept around fortunately we have zero notes and then for the predictions are here so we have here so each of these is only like 350 or so because that's what is in the analysis that kind of the you can think of that as like a testing set for each of these bootstrap resamples so how did the predictions go there so what can we do with this volcano result the first thing we do is we can just collect the metrics so these are the default metrics so this this is probably the biggest place that what we're doing with multinomial classification is different than what we would be doing with binary classification the metrics that you need to use are different then then what then what you would use for binary classification the yardstick package which is part of tiny models has has like excellent full-featured support for multi class metrics performance metrics so for example you can see that you know Fitri samples could tell that this is a multi-class problem this is a multinomial classification and instead of just doing regular binary accuracy it did multi class accuracy and so here is the valley you that we got for multi-class accuracy perhaps nothing to write home about but this is what this is what we got and then area under the curve for an ROC curve this is something that you have to decide how are you going to extend this to the multi-class situation and there are several ways to do this and the default is this hand till estimate like weights do it and the yardstick actually has a great article a great vignette that talks about multi class metrics and I encourage you to read it it is very helpful for understanding how these things how these things work what else can we do with this so let's uh let's talk about so we have the predictions as well so we can collect the metrics but we also can collect the predictions so here we have let's look at how this is different so this this row that's the row in the original data and this we have the predicted class and volcano type that's the true class that we got had from the init irish the initial the original data and so we have prediction predicted probabilities for every of the classes and you know which one which one was highest so this this half one was other because the other was highest and this one is stratovolcano because its highest and so forth so one of the things we can do here is we can do a confusion matrix so for sir just put everything together like all the all the bootstrap samples together and look at a confusion matrix so if we we first we say the truth so volcano type and then we say the predicted class like so and so we get a a confusion matrix here so let's think about this so here's the truth up here and here's what was predicted so other other it you know you can you can tell it you can see here why we got accuracy you know kind of you know okay ish but yeah you like we're better than better than guessing but you know maybe not fantastic like so yes so most of the other volcanoes were classified as other volcanoes but let's you know that's quite a lot that were classified as stratovolcanoes the stratovolcanoes it was the easiest to classify them and remember we actually had the most of them in the real dataset and when even though we used over sampling the smote algorithm over sampling it was still easiest for us to identify those the shield volcanoes looks like the proportions are probably about the same there the shield volcanoes you know it was the most best job I mean it was the most common choice was to say shield volcanoes are shield volcanoes but certainly certainly we're not blowing it out of the water here or anything like that so so these metrics were the default ones like I didn't I didn't say anything like tell me specific fancy metrics but you if you save the predictions you can always come back and we can collect the predictions and then you can always calculate predictions after the fact like say we want to find the positive predictive value and we do the same thing where we say the truth the truth and then the predicted class like so so that's and again this is a this is a way of weighting these for the for the multi class situation and you also can do things like group group by the the resample so now we have for every ID what is the estimate so we could you know look at how is that distributed and you know get some edits get some understanding of how that you know you can do a quick you know a quick uh histogram of that and understand how much this is how the how this is distributed over here we only have 28 is it because we only have 25 but we can you know you can look at things that we have like that okay so that is the predictions something else we would like to know more about is what was what what can we learn about what can we understand about this model because this is a you know as a random forest model all these trees are voting together it can be a little bit more difficult to to understand what's driving the predictions but we can use variable importance to understand our model so we the VIP package variable importance can help us to do this so if you go to to do this we have to go back to the spec so if we go to the to the model spec and then reset the engine we keep it at Ranger but we have to we have to say ah I want to set I want to calculate importance this time because we didn't set the importance up here it's it's slower so we probably don't want to do it every time and then we fit it like you much like we fit a model I me fit a workflow you can just fit a model specification and let's do volcano type explained by everything and then the data that we do that we do this on we're not going to do it for all the bootstraps instead we're gonna do it actually on the on just the data set as a whole just the scene at a single time we're only going to do this a single time so we are going to take this just prepped data set we're gonna juice it like so and then we need to get out we do not want the number the volcano number and then um let's clean these names because they're kind of a mess like so okay so this is the data we're gonna train this on like so so let's call it data equals just to clarify for ourselves so what this is going to do and then I can pipe this to it so this is a fit I'm fitting I'm fitting the random force again I'm fitting it again one time instead of a bunch of times on models too so the purpose up here was how how good of a job am i doing on fitting which which which parts of the classifier do a good job and do a bad job like which which categories what I'm doing here by fitting again one time on them on the justice sees a single dataset what I'm doing is saying what can I learn about what not what what variables are important the reason I'm using the juice data so that I can get I can compare the subduction zone you know being being in the continental crust versus the oceanic crust and compare those things to the importance of latitude and longitude so I pipe this to the VIP function and I'm gonna say giome equals points because I think that looks nice okay so let's run this it has to fit the random forests one more time so it this is a small so that it's fast but let's take a look at our result here okay so so the big the big to here are latitude and longitude so the biggest thing that is impacting the predictions are where the volcanoes are um interestingly the second thing which is not that far behind is the this rock basalt I think that's how you say it basalt another geologist and then we have some things behind there like the continental crust and then elevation so so you we can understand what is it that is having the biggest impact on the predictions for this random forest model which is always good to be able to do and look at it and understand given that latitude and longitude are so important let's wrap this up by making one more map and we're gonna make the map with the bootstrap free samples so this should be this should be good I you know I said we we still saw them have the opportunity to do something a little bit fancy with plotting that whenever whenever the opportunity lends itself I think we should definitely do it okay so let's take those let's take those results let's look at the predictions again the predictions there we go and let us let us think about okay so volcano type you can't quite see this because okay so when these two columns are the same then we did we were correct the model was correct when they are different like here there is incorrect so let's make a new column mutate let's call it correct and [Music] let's make it well it's gonna be a logical so when volcano type equals predicted class like this so this is my new column so overall how did we do I think we know this from accuracy already yes about like 60% or whatever okay great notice we have this dot row this is the row from the original data volcano dot underscored erred volcano DF so if we we can make a mutate dot row row number like that and then we can join these up together so left join like so unless let's let's join these guys up together so run this alright yep that's an okay join and so now what we have is for everything for every for every boot sample for every bootstrapped prediction which is a lot because we did it 25 times we have whether it was correct or not and we have where it was let's save this let's call this volcano prediction just like so so this what this data frame has in it is every bootstrapped prediction whether it was correct and where it was and then let's make a map of this so let's go back up to our map let's copy this paste it down here but let's we're gonna change it a little bit so instead of this genome point thing we're gonna use one of this a stat a stat summer so SATs summary so let's see they these these let us read let's read this the data are divided into bins defined by x and y and then the values of Z in each cell are summarized with a function that's exactly what we want to do so we can we can start with stat summary 2d and then we can try a different one so the data that we want are these predictions and the AES is the same as before longitude latitude the Z here is this correct correct whoops nope correct like this and the oops let's see what did I do wrong okay and then the function that we want is mean we're going to take the mean like so we get like what percentage is correct and then let's like do like we this should be somewhat transparent like so so let's try that whoops okay this didn't quite work I think I need to say as integer here so force it to be ones and zeroes there there we go there we go this is this is right this is right okay nice very nice I like it okay so in these little squares what we've done is we have taken the mean in each little square as to whether it is what's the one is more blue than a it is more correct and when it's more dark it is less correct so we're able to understand across the world if it is more or less where where were we more correct and where are we less correct let me see if I can make this look decent here for you okay so let's so that's those 2d those 2d summary is nice but um like like many of us I love I love some hexagons so let's change it to hexagons i I mean obviously this is better because it's hexagons obviously I think you can I think you can change bins and you can make them like more more hexagons and if some if we have some hexagons obviously more hexagons is better that might be too many I don't know can you have too many hexagons so hard to say okay so the scale fill we can change that gradient like for example you could keep the de l'eau color that dark grey and change the height to and the other good thing by using this is you can change the labels and for for example I want to I want to be more clear that that's a percent there okay excellent excellent I think this is really coming along okay so you can see that here in the West in the US remember that was a lot of other we did a really good job of saying those were other and these shield volcanoes down here and Antarctica we did a great job of finding those because they were not mixed up with other things but some of these areas where things were mixed up a lot it was much harder for us to say what they were or not so you can kind of see across the across the world where things were and we're not easier to to to predict in this three class classification model if you wanted to change the the title on the the color bar there you can just put in a something put in something there with a with labs you can say something like percent classified correctly here and you can get a let's zoom this in one more time and take a look let that render okay okay so one one thing I'm not gonna do this because I always have to look it up every single time I do this but one thing that could make this map better would be to do a better projection I think that G that the hex the hexes require a Cartesian projection so we would have to look into something else if we did that but this is we're just project we put have put the whole earth on fixed coordinates here with XY and I'm sure you know that's not great you know everything in Africa looks tiny and Antarctica look looks enormous and whatnot so that would be a way to make this map better would be to use a better projection but then we would have to be we would have to think about how we did the the hexes because I think that we could not just use the hexes if we used corden map and change the projection there but overall I am really happy with this visualization as an output okay we did it we just trained a model to predict the what type of volcano each all Cana was and this was a multi-class or multinomial classification problem we used over sampling in our pre-processing because we didn't have the same number of volcanoes in each category we use the smoke algorithm for for that over sampling when we're working on multi class or multinomial problems the the metrics that we want to use to evaluate how our models are doing like accuracy or area under the curve have fee adjusted for that multi class situation and that's one of the things we have to think most carefully about when we're working with it on problems or with data or models that have that are not just binary classification one label versus another but instead many many labels I hope this was helpful and I will hope to see you next time
Info
Channel: Julia Silge
Views: 5,317
Rating: 4.9583335 out of 5
Keywords:
Id: 0WCmLYvfHMw
Channel Id: undefined
Length: 47min 38sec (2858 seconds)
Published: Wed May 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.