Build features for machine learning from Netflix description text

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silke and i'm a data scientist and software engineer at rstudio and today in this screencast we're going to use this week's tidy tuesday data set on um tv shows and movies on netflix we're going to train a machine learning model to distinguish between the two to distinguish between movies and tv shows based on the um how they're described in the description on netflix so this is going to use natural language processing use use the tidy models framework for modeling using text data text features this is if you watch the last screencast i did it's quite similar in that we're using text features as input to a linear support vector machine but we're going to use a different way to create the features so let's get started okay let's get started exploring this data set of movies and tv shows that are on netflix so i'm going to load the data set here and let's take a look at what is in it so this column here type has in it um whoops not that has in it um whether each of these titles is a movie or of a t or a tv show and then we have other um variables here like the title um the director the cast uh when it was added on netflix when it was originally released and so on and um another another information set of information over here is this description field so this description field is um is text here so let's kind of look at how long this text is and what it is like so let's um let's just take a slice sample let's take a um uh just like 10 example titles here and let's pull out the description so we can look at it like so all right so these aren't this so this isn't super long right this isn't super long here but um we have um about you know but they're all kind of about equal in length which would be great if we were going to do some deep learning um i don't think we are though we're just going to build a a more um traditional uh kind of machine learning model here so we have descriptions that tell us what these tv shows and movies are about so let's look at a couple other ones um so we can kind of do this over and over again and kind of look at what this data is what's in here so we've kind of got short short-ish description fields telling us what about the content of the tv um shows and movies so let's do a um make a visualization of these of these description fields before we get started on some modeling so let's load the tidy text package and then we can use the function unnest tokens from the tidy text package and unnest into word from that description field and then let's remove stop words from that word um column there and then let's count by type um and word so go sort account with um two arguments there so that we can say what the most common um well what the most common words are in these two categories so we say life young new man family and movie and then we can see here we've got um a series showing showing up here remember with that this this is this data set i think i did this up earlier there but there's quite a lot more type there's quite there's like twice as many movies as tv shows in this data set so that's why this not these numbers are a lot higher and we don't see series until way down here notice that series isn't in this data set at all with movie right so when the when a description field has the word series in it you know it's only shows or showing up a lot more often for tv shows than for the other ones so this kind of is showing like well maybe we're going to be able to train a model that will be able to distinguish between tv shows and movies based on the descriptions especially if they you know it talks about series and whatnot and um you know documentary and whatnot and these two different categories let's make a quick visualization here so let's group by type and then take a slice of the top by this variable n let's take the the top say 15 words by by word count then ungroup and let's um let's use the function reorder within from tidy text and reorder word by and within type so this makes it a factor which is nice for plotting and then we can plot this into ggplot so we'll put n on the x-axis word on the y-axis and we can fill by type and then i'm going to make genome call i don't think we need the legend let me do this and then what we need to add because i use this with reorder within is we need to use one of these scale y reordered with these one of these reordered functions and i need a facet wrap i'm going to facet by type like that all right okay that's almost right so i let me make scales equal to free all right that's pretty good so i could clean this up a little bit um so what what on the is on the x-axis x-axis is number of uses of word uses like word frequency and num on the y i'm just going to get rid of that because i think that's pretty obvious that that is um those are words like this so what this is telling us this is the top words in netflix descriptions by frequency after removing stop words for movies and for tv shows so on tv shows we see series life world new friends family and on movie the movie side we see life as well young new man woman love documentary so we see some differences on these two sides some words are the same but some words that are different and so our goal in this screencast is going to be a train a machine learning model that looks at these um these words in the description field and this uh learns to be able to distinguish um between a description um that is about a movie and one that is about a tv show so let's we're going to use um a fairly straightforward kind of model and pre-processing here but you really could train all kinds of different models to do this so today we're going to use tidy models to do this and i'm the first thing i'm going to do is i'm going to split my data into testing and training so i'm going to take this this netflix title data set i'm only going to use type and description and i'm gonna um whoops i never loaded that there we go okay and then i'm going to do initial split and i'm going to do this uh use stratified uh resamplings um here because um we had we have some class and balance there are a lot more movies than tv shows so i'm going to stratify this and then i'm going to make my training data here from the split and then i'm going to make my testing data from the split like so and the next thing i need to make is i need to make some resample folds of data so i'm gonna make i'm gonna make um crossfall use cross validation to make folds of data i use the training data to do this and i'm also going to use stratification here as well again because of that class and balance so let's call this netflix folds here so if i run all this what i get is whoop netflix split silly there we go okay so i have training data testing data and then now i make resamples i make cross validation folds here and so um what is the purpose of these data sets so this is this section here is like spending our data budget we only have a certain amount of data with which to train a machine learning model and we um we set a a certain amount for training a certain amount is held back for testing to the very end the purpose of that data is to estimate how our model will perform on new data and then we take our training data and we create simulated uh many simulated data sets to use to to estimate performance we could use that to compare models or to tune models so this think of this section as spending your data budget now let's move on to um feature engineering because we've got this text data set and it needs to be pretty heavily pre-processed to um be ready for a machine learning algorithm so we're going to make a recipe so i'm going the outcome here is type which is is it a tv show or movie and we're going to predict type from the description the description that is found from netflix so the data here that we use to learn about in this case what the data is being used for is to learn about types to be able to learn about what kind of data types we're dealing with and now we're going to go through the steps of future engineering for this text data set i'm going to do use i'm whoops i didn't ever load that there we go so i'm gonna first tokenize and you know for this um example i'm just gonna tokenize to unigrams the single words but if you go in here we can you know we could change to something like n grams but i'm not going to do that here we're just going to go for a quick and easy basic model i'm going to use i'm going to filter to say i'm not going to keep every token in this whole data set because that would be that would blow up you know the memory on my computer and be too many and besides many of the infrequently used um tokens are not helpful in a model they're not going to be predictive so let's say the max that i'm going to use is um oh i don't know let's see how much data do i have here so so each of these cross-validation folds has in its an um analysis set 5000 so let's let's use um let's keep a thousand tokens so that's we use a thousand most used tokens here this would be where we would actually up here do um remove stop words i'm actually not going to move stop words um because um i let's see if the stop words end up being important to this model then i'm going to wait by tf idf um so we put which um which variable i'm waiting here and now i'm going to i'm going to use a model that needs to have the um that needs to have the um the variables uh centered and scaled so i'm going to normalize these variables here actually that's not right i'm going to say all numeric predictors at this point and then i will anything else i need to do uh you know what i'm gonna i'm gonna let's go back up here and let's load the themis library which has packages for handling class imbalance and i'm gonna add step smote steps moat let's look at what it is it is a a feature engineering step or a data pre-processing step to for up sampling for so if we have data and balance like we have a whole lot more movies than tv shows here what this step does is it generates new examples of the minority class which here is the um the tv yes the tv shows using nearest neighbors of these cases so it makes new examples to put into the data set um to uh so that we can have an equal number in the when we train um and fit our if it our model so what this does is it ends up having a model for us that is better calibrated and more able to recognize what's the positive class and the negative class if we train a model of this model when it has so many movies and not so many tv shows it's going to get really good at recognizing the movies and not so good at recognizing the tv shows so when we're able to use this kind of up sampling account for class and balance it'll be able to do a better job at both have a better calibrated model okay so that's our feature engineering the next thing we need is our um model that we're going to use to to do the fitting to do the training so i'm going to use the same model that i used in the last screencast that i did um a linear a support vector machine these are so nice for text problems they work really well in a lot of cases so this is currently in the um development version of parsnip so you will need to get this as of the day i'm recording this video you will need to get this from github not from crayon but it will go to cran soon but it has really some really nice characteristics for dealing with text and often works very well with text and we can put it together with the feature engineering it also it doesn't really have any important tuning parameters so it just kind of works as is typically and we can just add it as i am showing here and so i'm going to put together the model and the feature engineering recipe and we have results that look like this so we've got our recipe with these five steps we have our model which is our linear support vector machine which is set up to do classification okay so we have now set this up and now it's time to get ready to estimate how well this model does so i'm going to set a seed again and i am going to use the function fit resamples which is from tune i'm going to pass in my workflow i'm going to pass in my reset be my resamples my folds my cross validation folds and then i'm going to set a couple of options i'm going to set the metrics because i want to instead of the default metrics i want to look at accuracy recall and precision so i'm able to understand how how we're going to perform in both of those like in in for the positive and the negative cases and then i want to save the predictions here save pred equals true so that i can make a com a confusion matrix so let's call that svm rs for results so what is happening here um for every fold which remember we had 10 folds top 10 cross validation folds now this um this uh feature engineering recipe and this model are being um you know estimated and or fit on the analysis section of the um of each fold and then being evaluated on the assessment part of that fold if we think about how some of the stuff that's going on especially in this feature engineering rest um you know recipe things like the tf idf things like even you know this has a mean and a standard deviation these things are all learned from the training set the analysis set and then apply to the assessment set you know let's even think about this last step smoke right um only the only the um training the analysis set is up sampled when you evaluate we evaluate on the data as we would find it quote in the wild in its original proportion okay so that's done and we can do collect metrics on the results here to see how it did so you know this isn't the world's most fantastic model as you know we probably expected but notice that it's pretty balanced between you know that's not so bad right between balance between precision and recall and that's probably because we um we did do the up sampling um there so uh let's look at let's use this function svm result and let's say let's use this function confusion matrix resampled so remember that our this this object has in it 10 sets of predictions um and so if we say oh i want to make a confusion matrix it's a little unclear what we mean by that like you know what do we mean we mean for all of them and so what this function does is it makes a resampled set it makes a confusion matrix for uh for when we have resamples like in this situation so if you do it with right now if you do it with no arguments it makes it in a tidy um in a in a table it's in a tidy format or we can if we want to say tidy equals false here we can get it in a way you might be more familiar with and you can see that um uh you know we've got these pretty you know we're doing we're doing medium okay here and we can do let's say let's say a auto plot and see how this is okay so here we see the class imbalance again right and so this is the truth on the x-axis and so of all the movies you know this this big block here this biggest block shows the movies that are being predicted as movies and then over here on the tv shows we can see that more than half of the tv shows are being predicted as tv shows but you know it's a little tough it's tougher to do the tv shows than the movies um i get maybe just because of the class and balance and whatnot but anyway so you know that we do this this we get this kind of result with just like a very first um simple straightforward kind of model that isn't um doesn't involve anything too difficult or fancy to do so let's um let's talk about if this was the model you decided you wanted to work with and move forward with what would you do so you would then go to this function lastfit and you take your workflow here the workflow that we have we do lastfit with back with the split function the split object so the split object remember has testing and training in it so we have not used the testing data at all so far and so this is the first time we're going to use it so let's use the same set of metrics like so and let's call this final fitted so we will um and we can do the same we can uh do collect metrics on it again and we will expect to get you know the same kind of metrics that we got before so remember that this is um this is this is now what we're doing here is we're fitting on the training set and now we're evaluating on the testing set so before we were using those reset those simulated little data sets the resampled ones and here we're now fitting to the entire training set one time and evaluating on the testing set so when we say collect metrics here we get out these these metrics are on the testing set now which is the first time we have done that um so we can also collect the predictions out like we could have on the um the other object as well and that will give us predictions on the testing set again if you notice how many rows are in this um in this object that's got you know spat out here this is the testing set predictions not the training sets and we can do we can do a confusion matrix here as well so we'll say the column that has the true values is it a movie or a tv show and then the predicted class there and we get about the same kind of proportions that we had before you know we could auto plot that as well if we wanted to see the same kind of visualization and it looks you know about the same so we're having about consistent results um which is good of course good we're happy to see that means we're not you know we're not having data leakage we're not having um we're not over fitting to our training data and so forth all right so let's say you now you're like okay great i evaluated what do i actually do um what what um where is the fit that i actually want could use moving forward so we take that final fitted object i don't know that i ever showed you what it looked like so it's a tibble and it's got it's got metrics and predictions and these things in it one of the things it has in it is a workflow so if we take that workflow i think we have to do that unfortunately so this is a fitted workflow now here and this workflow is something that you can save you can serialize and you can put into production like you can you know serve it via an api you can put it in a docker container this is the this workflow is the thing that you can um use um to save you know you can you could you could you know read our save um what is it rds write rds you know you could do something like this to it so that's the thing you would use um moving forward um i what i'm going to show you how to do here next is how to do a bit of variable importance how to understand what is contributing to your predictions moving one way or the other so i'm going to take that fitted workflow and then i'm going to pull out the parsnip fit out of it so you think about a workflow as having two things in it the feature engineering and the model algorithm so i what i did right now here was i pulled out the fit the the model fit if you wanted you could also pull out the um recipe if you wanted um i don't there's no reason to do that here though so now we have the parsnip fit and we can tidy that fit and so this gives us remember we're training a linear svm so that means we just get linear features out of it and here they are so these so what we have here let's look at this so this is the t the tf idf of the words from the description field and then we have things here and at the top it's in alphabetical order right so 1970s 1980s about accidentally so let's arrange this and let's see what let's do it like this okay so these are the words that contribute the most in the model in driving a prediction towards tv show so series docuseries adventures group world and so so i didn't take out soft words and it one of these softwares did end up being um important here okay crime school and crimes so crime is more on tv show than in movies which is kind of funny okay let's look at the other direction so um documentary biopic performance um how which is another stop word but like a stop word that apparently is used more in movies than in tv shows um stand comic so these i bet this is like stand-up comic i bet um film so people are using the word film to describe movies but not tv shows anyway okay so we've got all this and let's as our last thing to wrap up here let us um let's take this tidying let's tidy and then let's build a visualization where we can just like look at the most most important things there so we so first let's take out this bias here that it means the same as um intercept so we don't need the intercept on our plot and let's group by the sign of the estimate and let's take the by the absolute value of the estimate let's take the top oh 15 words and so what this gives us is the top 15 words in either side let's let's actually while we're at it let's call that something there we go and um let us um string remove from term let's remove these all this business right here we do not need that there we go so now we have the words and then um let's see uh oh that sign that sign value there let us so let's change that so when sine is let's use if else so a sine is um true say more from tv shows and if it's false let's say more from movies like that and now let's plot pipe that straight into ggplot so we're going to put the absolute value of the estimate on the x-axis the term on the y-axis let's do fill equal sign let's put the let's give myself some space okay let's say um make a geom call do we do not need the legend and let's do a facet wrap by sign okay so let's see how that works whoa something didn't oh gosh okay um le there we go silly me silly me okay so we need to say um scales equals free like that and we need to reorder this by again the absolute value of estimates which maybe i should make that a column there we go and now let's just um change the labels so what's on that x-axis what is that that is the coefficient from the linear svm model and why i'm just going to call that null like this okay so the so this this model i'm sorry this this visualization answers the question which description words are most predictive of a title being a movie versus a tv show so so yeah we've got we've got um oh this is pretty interesting actually um adventures crime crimes fight personal chants these are things they're from tv shows whereas um summer biopic and violent like those are more from movies so um so this is the kind of information that we can get out from even just this pretty straightforward um model this this particular data set might be a good fit for trying more sophisticated models like maybe ones that take into account um a word position like a deep learning model all right so we created features for machine learning from the text data that we had in the description fields for these netflix um tv shows and movies and then we used those features in a linear support vector machine this is often a great um basic um option for um modeling with text and the results we got you know were like like pretty decent in my opinion especially as a baseline to start with or maybe to compare to notably especially we got a pretty good um uh like a precision and recall were fairly balanced and that's because of how we handled the um the class imbalance we were able to deal with that with some up sampling in this case so um i hope this was helpful and i'll see you next time
Info
Channel: Julia Silge
Views: 4,253
Rating: 5 out of 5
Keywords:
Id: XYj8vyK864Y
Channel Id: undefined
Length: 34min 3sec (2043 seconds)
Published: Fri Apr 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.