Build a predictive text model for Avatar: The Last Airbender with tidymodels

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia sully and i'm a data scientist and software engineer at our studio and today in this screencast we are going to use data from the tv show um avatar the last airbender and we are going to train um a model to predict um uh who the speaker is for individual lines from the television show and we are gonna train a couple kinds of kinds of models see which one uh how they do and then we're gonna talk about how to compute variable importance um using um using permutation hello welcome let's get started today we are dealing with a really fun data set about um a tv show avatar the last airbender and um this show is so fun actually that um i think we should i'm gonna make um all my plots themed so that all my plots can look like they belong to the show this is actually a pretty popular show here at my um at my house a slayer and my my kids like it and i like it although i will admit they've watched more of it than i have um so this is a this is a great and fun package that can make our make our fonts look have themes from different tv shows and fortunately we have one here right from avatar the last airbender so here is the data set of what we have so if we take a look at it um we have one row per every line of either dialogue or so the dialog is repeated here in both the character words column and the full text column here so um uh some sometimes it is cara is word spoken by a character and sometimes over here it is a different kind of text we can see that it is sometimes scene description so there's a lot of different kinds of analysis we could do here and there's been so much there's so much fun analysis and visualization happening in the in the tidy tuesday community um there's this rating over here there's actually if we if we look um here and look at um like how many actually uh so the seasons are called books there's um there's there's these three seasons these three books earth water and or water earth and fire and then there's chapters which are the um which are the um episodes so there's 61 episodes and so that is what is linked up with the directors and the imdb ratings so it um that you know if i wanted to do some modeling that is just not very much data and i would really like to be able to demonstrate some modeling and machine learning using this data set so that i'm afraid is just not going to do it like 61 episodes on the other hand i do we do have all of these lines these spoken or the spoken lines spoken by characters and or these um you know this this scene direction so what what we're going to do here is give a go at um taking the spoken dialogue and seeing what we can do in terms of um connecting the spoken dialogue like katara settle down and see who speaks it in this case it is soca so um uh we're gonna see if we can predict who is speaking from um what is being spoken here so we're gonna try to build a model to do that so let's see who who the most common speakers are here so the scene description is uh you know as we might expect the most common and behind that comes ang so aang is the main character the um the last airbender if you will and then we have his friend soka katara and zuko um you know tofu comes in later and some of these other uh thing uh other characters who he meets during the course of what goes on in the story um so what i would like to see if i can do is build a model to put to identify if a line of dialogue is spoken by aang or someone else so i think that is what i would like to try to try to see if i can build a model to do so let's start out by doing a bit of exploration first let's look over the course of the three seasons the three chapters through books excuse me chapters or episodes let's see what we what we see here um so first i'm i'm not going to look at the um i am not character words i am not going to look at the scene description i'm only going to look at the spoken dialogue and let's uh let's look at the um let's let's make this into a factor the book in order that we see it here so that water will come first and let's also let's see what the most common characters are so we can get an idea of who we are um who we are going to be lumping together so we can do these like lump low frequency i'm let's lump i'm going to make a plot so i'm going to lump n together so i'm going to lump the character um variable together and i only want um 10 to get uh uh here and then let's count book character like so and um make a plot so i'm going to put n on the x-axis character on the y-axis and make fill equals to book and i'm going to make a a geom call and let's facet wrap by book and let's do scales equals free like here all right so this so you can tell that we have our beautiful fonts and our some of our theme here for um for the show which is great so uh we don't need that um y-axis and let us um let's make this in order let's let's order this a little bit so we can tell what's going on so after we count let's mutate let's say character and let's use this function from tidy text called uh re reorder within so we will reorder character um by and within book like that and then we what we have to do after that we have to add a scale y reordered like this so this will reorder it like so so that's looking good and now those colors are not um are not avatar colors so let's try to make some avatar colors here um the there are people you know if you explore the tidy tuesday um uh people who are are involved in tidy tuesday there's gonna be much more amazing visualizations than what i'm doing here but um i'll i'm going to uh so there's a there's a function called avatar palette here that you can get some of these palettes out and so the way that this works is you um you you give it so the palettes work you give the name of the palette and then you put in an n so uh we we want for this plot here we want three we want three of the different palettes so we want first we want water tribe and the way the way these palettes whoops i didn't even put in palette whoopsie daisy okay so avatar palette so the water tribe like that and then you put the the one after it like that like you're saying you just want one so if i do that um we need three of them right so we so we only did one so if we go like this and we do two three like this and we so we want we only want one color from each of these so we wait let's see uh plots so we have water earth and then fire so we need earth kingdom and fire nation like so and so that looks pretty good that first color it'd be nicer if it was red rather than yellow but that's okay okay so this is how many lines are spoken by these characters in the three books so um you know perhaps unsurprisingly when you bin a whole bunch of other together that is very common you know lots but um other than other um aang speaks the most in water and then katara is right behind soca actually speaks the most oh more than aang actually in earth and fire which is pretty interesting um and soka is behind bang and which i guess makes sense so we so we can see these um how often do these different characters speak and so what we're going to make an attempt to do here is to identify from there from a line of dialogue is the speaker aang or is the speaker someone else so let's um let's build a little data frame to do this with so let's take our raw data frame so we do not want any of those rows that have um the scene direction let's make a column called aang which we'll just use if else so when the character is equal to ang it is and else it is other like this and then let's take um let's take that new variable we just made let's take book in case we need it i'm not sure if we will and then let's take the character words like so then let's call this avatar so this is a data frame that we're going to use for a few things and let's just look at a few just to kind of get an idea of what might we what we might be trying to get ourselves in for here so for a few for some lines spoken by aang let's let's just take a few and pull them out pull text like this and let's just uh let's just see what what we're getting size filter oh i didn't ever do it there we go what am i doing avatar is here filter aang is equal to whoops there we go okay let's try again okay hi i'm aang okay these some of these are so short oh boy this is gonna be i don't know i don't know if this is gonna work at all let's um all right i did do a little exploration ahead of time um okay so here's ang talking to his friends um okay so there is a fair amount of data here but this is so these are so short this is i mean this is an amazing data set it is super fun but these are so short like so notice how short these are notice and just think about looking at this like how hard would it be for you as a human being to tell whether this was spoken by aang or by one of the other characters um you know if you if you are someone who knows the show um uh even even if you're someone who knows the show how hard would it be for you to do this so so this is what we're this is what i've decided to try to make a screencast to do so uh we'll see if um this is a terrible idea or not um here okay before we before we move any further let's do um a couple more things so let's type let's convert so this right now the strings are all you know stored in a column called text and let's uh tidy this make this into a a tidy uh text format so we will tidy into word from text like this and let's count um for the ang versus other what the words are so now this um you know so we have a hang saying ah and then let's load a function called our load a package called tidy low tidy low is a package for weighted log odds using tidy data principles and we can use a function called bind log odds and we say first we say the set and then we use the feature and then the counts like this and um it will it will compute the weighted log odds you can look in the documentation for more details about the kinds of weightings that are available there but we can for example now um uh arrange by the this weighted logos find out what are the highest weighted log odds so this is for uh so eng you know talking to katara saying aye so i so aang is much more likely to say i and i'm than the other characters um aang is more likely to say momo other characters are more likely to say welcome so let's call this avatar log odds and then let's make a little bit of a visualization i i do want to notice like notice how low these counts are well this is a you know this is a fair amount of data like thousands of rows but as we look at this let's take the top 15 or you know we can use slice if we want slice max i think so slice max by log odds weighted and n equals 15. i believe no oh yeah i have to make it first okay so this is now the top 15 in both groups and let's reorder those so word reorder word log odds weighted and then let's make a plot so let's put the log the weighted log odds on the x-axis word on the y-axis and whether it is ang or not as the fill let's use a bar chart again and let's facet wrap so that we have two we have eng and other next to each other like so all right good so let's we don't need that y label and and this time let's use scale fill avatar like this and so this i believe is the um the fire yeah this is the fire nation but we can use let's use so this is aang and let's use um so let's use let's use the air nomads since nice okay so these are the words that have the highest log odds of being spoken by aang versus everyone else so aang is more likely to you know say the name of soka and katara and appa yep yep if he speaks his bison thing um and the other people are more likely to say mother princess prison men escape you know things like that um uh aang is less you know less likely to say these words so um one option um when it comes to trying to build this kind of text model is to do it from the content of the words itself and there is information in the words you know like that's what we're looking at when we look at this um when we look at this plot however um i i will suggest that we don't we don't have enough data to um or it's going to be a hard it's going to be tough to build a model with the amount of data that we have from this um we could certainly give it a go but um uh you know we saw what the counts were here uh on this so like we would be build building a model where the counts were things like this like the counts on our features would be things like this so instead i want to show another option that sometimes can work when you have um uh when you have text and that is to look at other kinds of features in the words and um there's there's a nice package for doing this that's called text features and i'm gonna let's walk through how to create how to create these um uh just using the package in its uh base basic form and then how we can do it using tidy models so it's ready for modeling so the kind of features that we're talking about are thing are not the words but instead um uh features of the text like um what are the number of characters per word um how many uh punctuation marks are there how many um uh spaces are there per uh per document and things like that so these are these are features of the text that are not connected to the um the actual vocabulary of the text itself so let's talk about how to do that so we'll load the text features package and then we'll use this text features function we can give it the data frame we have because it has a um a column called text and then we will say we don't we're not going to do the sentiment scores and we are not going to do um uh word vectors here and also i'm going to i'm going to turn off a bunch of these things we're not going to normalize it because we're actually going to do that um within tidy models later so if we do this it's count it's now counting up the features in the text it finds the parts of speech um and whatnot and what we get here now is a new table it has the same number of rows as our ultable notice it doesn't have anything else and like in our old table it has some things that that our um our text does not have like uh urls hashtags mentions but you know you can see how this is very useful if you do have those in your um data what our data does have though you know how many times do people speak with first-person pronouns second-person pronouns um number of words number of unique words um number of lowercase characters lowercase spaces exclamation mark um uh uh punctuation mark commas um and so forth so let's um let us uh bind this with our original data frame and then let's group by ang versus everyone else and then let's summarize so we're going to say across starts with this n underscore so we're going to summarize across all these things and then take the mean like this so this gives us so for um grouping by these two we're gonna say across we said across all these different columns um i want the mean i want the mean for for all of these so um i really am enjoying using meat across these days it's making something so much so nice and smooth that um i often struggled to do before so that was great so now um this is wide and so let's make it uh uh tidy and narrow with pivot longer so we're gonna pivot um again those starts with an underscore like this and let's say names to and let's call that text feature like this all right so this goes so we can see here so this is all aang um let's um let's what i'm going to do here next um there's a bunch of zeros let's take let's say we only want things that um that we want to that has that are that are not close to zero like this so that is um looking good there um okay good so this so some of the things are of course uh this is we counted up and then took a mean so some of these things are low and then some of these things are high um oh i'd like to make a visualization um first let's just make a visualization with this as it is let us um [Music] let's put let's put um ang on the x-axis value on the y-axis say fill equals ang let's make little bar charts lots of little bar charts we're going to say position equals dodge we're just going to have a whole bunch of them uh um and we i don't think we'll need the legend and we're going to facet all of these text features and they will need to be scales equal free like this okay so let's see how that works all right we are going to need let's see how this worked okay that that worked pretty well we're able to see yes okay so let's just clean this up a little bit more let's make um i think let's make one more column rinse it of five six and let's take uh so risk reminder cells what are we saying um so we don't need that x1 and on the y this is the text [Music] features text features mean text features mean text features per spoken line and we will i mean i think we need to make it a avatar color again right scale fill avatar and let's use air nomads again because we're still we're still looking at like aang vs everyone else all right let's try this come on you can do it okay okay so oh let's do one more thing let's let's um let's reorder this so text feature text feature in order of value like this so that we can see the ones that are the highest up in the corner and then i'll go down to the lowest okay so aang speaks shorter lines than um other people but you can see that the proportion is different he has more unique characters per line proportionally um you can see the number of exclamations per per spoken line is higher the number of first person we saw that with i and i'm the um so we see so what we were going to try to do here is use these differences um these differences between aang and other characters between um these text features like how is punctuation used how are how are pronouns used how are unique words or unique characters used to be able to tell the difference between a line that is spoken by ang and a line that is spoken by someone else so we have things like um [Music] you know first second person pronouns to be verbs um just the overall length of the um of the uh of the lines all of this we're going to see which of these thing can we build a model that is able to distinguish between a line by ang and aligned by someone else using this kind of information so let's do it let's see if we can do it all right that looks hideous let's i can't i can't i can't look at it okay so let's load the tidy models framework set of packages for um modeling in r and the first thing we're going to do is we're going to do an initial split so an initial split for using that avatar data that we have we are going to do a stratified split so that we have uh the so this is the sampling happens in in the in ang and everyone else's lines so let's call this avatar split and then we will tab avatar train training on avatar split avatar test testing on avatar split and we need a seed for reproducibility okay so our training data is here we have about 7 500 lines and let's reminder i don't think we've actually looked at this yet um we do have some pretty significant class imbalance there are many more lines by spoken by people other than aang than but spoken by eng so that is something we are going to have to deal with while we're here thinking about the data let us make a set of resampling folds resampled folds let's use um a crop let's create a set of cross validation folds so let's say let's we do this on the training data avatar train and let's again do it with um let's do it with stratified sampling especially now that we've noticed we have this um this this class imbalance and let's call this avatar folds and let's set a seed again because that again is um has random randomness in how it is chosen so each one of these folds is a um like a like a simulated data set that we'll be able to use for training or choosing models now let's get ready to do our data pre-processing so we're going to use a recipe for this we're going to say we want to predict whether it's spoken by ang or not using the text so let's look at avatar train again um you know i don't think we're gonna actually use book i probably could have left that out so we're gonna do it predicted by text and the data here we're gonna say is avatar train so the reason we give the data here so that the recipe knows the data types that it's dealing with so let's load a couple of add-on packages that have extra recipe steps first let's do themis which is has recipe steps for dealing with class imbalance so for example we're going to use step down sample to on so we're this is just very simple um uh d uh where we're just gonna down sample so that uh we're gonna throw away basically a lot of these this other data uh so that um uh it has this is balanced um you know believe it or not sometimes throwing away a bunch of data can give you better results uh than keeping the data so that uh we can build a model that can learn both the positive case and um the negative case if we if we did not do this then we would build a model that only that largely learned how to shoot how to learn things that were other um next let's load the text recipes package because the text recipe package has a step text feature and we are going to do text feature on the text column so this um like if i load this right now we're gonna we're gonna do text feature extraction so if i look at step step text feature it what it actually does is um it just it just actually goes to the text features um package and uh and does all these count functions that we just talked about so it um uh it does largely what we just showed which is great because we can have this easily available to us in a pre-processing function um where we can do it within our cross-validation folds um within our modeling pipeline let's next do remove step uh remove variables that have zero variance this is going to be nice because it's going to get rid of all of those columns for us like all the urls and things that we don't have any of and then let's step normalize because we're going to try to use some at least one model that has that is sensitive to centering and scaling so our um we need to normalize which centers and scales um while i'm here let's prep the recipe and juice it just so you can see what happened to it um if you're modeling using a workflow which is what i'm about to show we don't use the prepped recipe but it can be helpful to understand what's happening when you declare the recipe here um you're you're specifying it saying what's going to happen what you want to do but you're you're not actually um estimating for example uh which hey which one of these observations am i going to keep and am i going to throw away you're not finding out which of the predictors actually have zero variance you're not you're not estimating the means and the variance here that happens in prep okay so we've got our data pre-processing going um uh so uh a prep is where the actually you could see when i when i defined the recipe nothing was actually getting computed or estimated but when i had prepped it that's when it was that's when it happened and then juice goes back into the prepped recipe and gets the data out gets the data out that was used to um to the training data that was used to compute it and so um we can do that to see oh yeah like what what um what actually happened what did i get out at the end how many you know features do i have and um and similar so for example you know if i wanted to understand which ones were getting removed this isn't going to work because it's not gonna be able to normalize things that are zero um well it can i guess it will just have like nands or something i can juice and then it'll have so it'll still have all these things about um urls and whatnot there all right okay yes okay so let's keep going and let's talk about the model so let's do um let's try uh let's let's try two models with some sensible defaults let's try a random forest model let's just make sure there's enough trees and let's use the ranger whoops it easy let's try the ranger engine and let we are doing classification here we're trying to say um we're trying to say did aang speak this or not and then let's try a support vector machine so let's use the um radial basis function it's pretty flexible and let's um you can tune like like with a random force you can tune these um uh values like cost and um the these parameters here but let's just set a default here and then and see what kind of results we get here without tuning but with some with a default so let's use um kern lab for the engine and again we need a to set the mode of classification because these are both algorithms that can be used for classification or regression and let's start avatar workflow let's start a workflow here so we'll say workflow and then let's add a recipe our our recipe here and um if we look at this recipe notice that the workflow right now doesn't have a model it has a preprocessor but not a model so what we're going to do next is we're going to take that sort of this work so a workflow is a way to put together pieces of your modeling your um your modeling process and they stick together like lego blocks so that you can carry them around and this um so this one right now has a preprocessor but it doesn't have a model yet so we're gonna do that when we get ready to do the um the model the model fitting so let's um let's uh set up parallel processing here so that we don't have to sit and wait too long um so let's take the workflow and let's add the first model and then we're going to do fit resamples and here's what we need to add here we need to add the folds that we made um let's add some non-default metrics we add a metric set um so we can have we can keep in here the things that are the defaults which are roc and um aoc and accuracy but we can also add sensitivity and specificity so we can understand better do we undo a better job predicting ang versus not ang and then let's save the predictions control um resamples let's say save prediction equals true all right so let's call this one rfrs like this and let's oh let's set a seed like this so let's get that one going and then let's start another while that's going let's start another chunk down here for the um support vector machine so let's take the support vector machine specification the model specification um let's keep everything else the same and then we'll make it svm result like this and let's everything else will stay i don't know let's change the scene just for kicks okay so we have here we have our random forest result so we're going to be able to explore it a little bit so we're going to be able to say collect metrics rfrs like so and we're going to be able to say um so this will just give us the metrics and then um let's do a a re-sampled confusion matrix like this we just we just pass in um the results and we can do the same thing down here for other result like so okay so let's let's look at what we get for so first the random forest okay wow impressive does not impressive right this is barely better than guessing barely better than guessing um i mean it is better than guessing right like we have evidence here that it there it is better than guessing but barely barely um also the um the random forest is about bat like it does about as good a job between the positive and negative cases it doesn't it's not um it's not super uh does not do a super better job on the positive versus the negative case is that right yeah so here's the correct ones for aang and the incorrect ones so it's it's just you know it does a little it's okay and then it's just okay and then the other way so here's the correct other and the incorrect other okay so let's look at the support vector machine all right so the roc is higher and the sensitivity is higher but the specificity is worse and that's because the support vector machine does a slightly better job of identifying the positive cases but at the expense of the negative cases it does not it is not the so the support so comparing the random forest and the support vector machine here the support vector machine is doing a better job of finding the ang speaking at the at the expense of incorrectly identifying saying finding not speaking and saying hey you said that ang said that whereas the random forest um is very close to just like you know just barely not random guessing here um okay so so this is pretty interesting right like we so first of all there's you know as you may have guessed when we you looked at those example um uh you know spoken lines there is just barely any signal hair i why am i why am i showing you all this i don't know um uh there's barely any signal here we were barely able to learn anything um from this delightful and fun data set but the random forest learned um in this case learn something different than the support vector machine did if we uh you know and depending how we tuned these models they you know it may shift around a little bit what what kind of things they were able to depending on you know your purpose your business use case your goals you know you may have different needs around um how your model behaves like what your what particular your particular trade-offs are um when faced with these kinds of different performing models let's um let's move forward from here um oh i was i guess i should i guess i'm i guess i'm monitoring i'm on to evaluating our model so let's uh uh let's let's move this here so let's um let's because we haven't really done much with support vector machine models yet let's take that and move forward so let's so first to give you an idea of just how on the edge of just even meaningful this is let's look at the predictions and um let's uh let's group by so let's let's remind ourselves what that looks like uh let's get out the predictions so we have so this is the true this column is the true um who truly said it this would give us the hard prediction the hard class prediction and then this is the predicted probability for each of the categories that we have and so you can see wow they're so close to 0.5 you know like this is this is the situation that we're in okay so we can make an roc curve uh we give it the truth we give it the predicted probability for aang like this and then we can plot that so we can say gg plot let's say 1 minus specificity since sensitivity say color equals id um let's say line type equals two oh let's i don't know is there like a dark brown i i forget because i think i don't know have you ever have you ever seen an roc curve done in the style of avatar in the last airbender before because you're about to that's that's what you get from uh from watching this because that's that's what we're about to do scale color avatar i think it's i think it's time for earth kingdom that's what i i feel like and um it's nice to be able to do chord equal with um oh what did i do wrong dark brown i don't know what what what's a nice brown i forget i don't i don't know with all the g plug okay oh my gosh that brown is terrible oh gosh the colors are so i will just make it black fine okay oh man the black looks so bad um we'll just make it a little transparent that covers a multitude of sins okay so this uh this only has nine colors so we're missing one of our um one of our folds here but you know this gives you an idea of just how the kind of performance that we're seeing with this support vector machine model it is you know uh you know except for that one um resample like it is on the side of the of the um um you know up above that dash line like we do we are we are seeing we do have some meaningful information in these text features but but you know barely barely we are barely on the edge of being able to learn something meaningful in terms of um as a speech as a speaker from the model as we have set it up versus not so please enjoy this avatar the last airbender themed um roc curve plot okay um all right so one of the really um one of the reasons i am excited to do this um to do this model is because it gives us the opportunity to talk about variable importance in a new way so we're going to use the vip package for variable importance as we as we always do it's a really great package um and we are going to so let's let's start with that uh that workflow we have let's add the model that we're using svm spec and then we'll fit them we'll fit the model now to the training data avatar train and then we will pull out the um fitted pull out the fit from it and then after that we are going to pipe it into this vi function so the vip package has um several uh ways of computing variable important scores you can do it there's there is such a thing as model based important scores and so if you're doing something like glmnet or a tree-based model you use model-based important scores typically because you you you like the model itself gives you something that helps you understand um what is important to the model like in a linear model it's the coefficients in a tree based model you can look at all the trees and and um and compute what's important to the model for some models like a support vector machine we don't have model based importance and so then we have to go to some of these other options today we're going to talk about how to use permutation model based importance so the way permutation model based importance works is hey you have a model that you know it's kind of like a black box and you want to be able to understand what is driving the um the the model to make predictions at least on a global scale so you um permutation based model importance variable importance as the name indicates permutes the variables like leaves them out shuffles them um and then based on that and sees that says how much do um the predictions change and then is able to uh um how important various variables are based on how the predictions change when you shuffle the variables so that's how permute permutation based variable importance works so that's what we're going to do here so this is different than uh how at least i have shown how to do this in various videos and whatnot so far so um here's we do method equals permute and um we have so since we it's going to have to do a bunch of calculations we have to really tell it a lot um we have to tell it um there's a bunch of stuff in here um there's a bunch of stuff in here so you have to tell it the target which means the why like the outcome so in our case that is um you have to tell it the metric that you wanted to use to compare so uh we we were using these um it the vip package calls uh auc auc not roc auc um we have you know pretty imbalanced or we you know we saw that difference there so i don't think we should use accuracy i think we should use auc um we have to say a reference class and i to be honest i always get this backwards i think we want to say other not ang here if it comes out backwards we'll say the next thing we have to say is pred wrapper and this is what are we going to use to do the predictions so we could write out trying to write a little wrapper but we can actually just go to the underlying function and say i want to use the predict method from kern lab and then what's what is the training data and the training data in this case um we we we actually want the juiced virgin version here uh prep prep sorry like this we actually want this because we don't want so avatar train just has the text and we don't want it to say how important is text that's like like duh that's the whole thing right that we gave it we want to know how important are exclamations and how important our lowercase um characters and what not so we want we want the juiced data here so if we do this we can call this avatar importance so if we do this one time what it's going to do is let's make sure it works it will do it one time so it uh so we calculated the importance one time it did one round of the um permutation but if we come up here and we say n sam let's say 10 it will it will do it 10 10 rounds of the permutation and as you can imagine with a permutation based calculation we do in fact want them to um we do in fact want to do you know more than one time doing it one time is not a great idea we want to do it a bunch of times so while that is running let's start setting up the visualization we want to make so we're going to say variable equals let's remove all that stuff here text feature let's remove this all right it's done excellent and let us a variable fact reorder variable importance like so and then we'll do this ggplot aes so importance on the x-axis variable on the y-axis color equals variable so now we're not going to have just um one value we actually will have a measure of the center and the spread so a standard deviation there so we can do the points let's make them big and we also can add in some error bars underneath it so we can say let's say x min importance minus x max importance plus and okay let's get rid of every all the whole theme i mean the the whole legend all right it's looking good we do not need that y axis and we do however i think need some colors some avatar colors um uh let's use fire nation here to end oh there's only eight colors well well let's only take the eight most important things so slice max importance and equals eight all right what are the most eight most important things in um in predicting whether a line is spoken by um a line spoken by ang or not so um the number of uh uh capital characters um punctuation um extra spaces exclamation points um lowercase union characters so by far it looks like the most important thing is is capital capitalization and followed by these like punctuation and extra spaces and then exclamation so these are the things that are that we do see as important that we see as important so there this this model is um um this model shows that there's there's barely anything to be learned from these text features and we also see that the um these error bars are very wide right but what what we can if they're if the differences that we see are most important the most important things that do contribute to the difference are these capitalization and punctuation and exclamations well that's certainly not the best performing model i have ever built in my life but um that's what happens sometimes you know um uh i this date is really fun i'm really glad that um i got to deal with it but i it's i think that's just how informative these individual lines are in terms of how um uh how predictive it is for um for who spoke the line we were able to build both a random force model and a support vector machine model and then we walk through how we can compute variable importance even for kinds of models that don't have underlying methods for that for uh for model based variable importance um this time we went about doing it using um a permutation method so i hope this was helpful and um i will see you next time when maybe maybe we'll build a model that um that uh does a little better
Info
Channel: Julia Silge
Views: 3,023
Rating: 5 out of 5
Keywords:
Id: wd4MZHx9F9Y
Channel Id: undefined
Length: 61min 45sec (3705 seconds)
Published: Tue Aug 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.