Tidy Tuesday live screencast: Analyzing Netflix titles in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson welcome to another screencast we'll be using r and rstudio to analyze data i've never seen before as usual this data comes to the thai tuesday project an amazing weekly data project in r run by the art for data science learning community and as usual if you're tuning in live please do feel free to join in a chat say hi and especially as we go ask questions and give ideas visualizations and analyses that i might not think of it's one of my favorite parts of doing these screencasts lives so let's see what we have this week it's netflix titles netflix is the titles there's like more did it's yes oh yeah it looks like it's a list of tv shows and and movies available on netflix as of 2019 uh all right and could integrate with other data sets all right so that this is gonna be um i think it's gonna be fun so the um uh let's let's read in the data so let's do library title um tidy tuesday r use tidy template created for today i usually do a few modifications first i do live to create my own um i really should create my own template shouldn't i some things i like i like let's see let's look at this data all right the names are in snake case and the year the year is a year show id starts with an s that's fine and it has type and let me click count type cool let me try counting year and year it's release year okay it's numbers and uh and i also like i'm just gonna do a little bit of of checking in if i do min release year max release year this is a clean data set great i think did they say it came from kaggle yes it came from capitol okay all right and the um let's uh let's uh do it so let's do netflix titles is tt netflix titles all right and that's the main data set all right and we have 7 700 this is is this all the no let's see that it has no triplets 2010 it's decreased with 2000 since 2010 that the number of movies uh tv shows is nearly tripled um all right so the the um is this all the movies that are available yeah um maybe i i don't know if somebody knows this from the documentation we can see the documentation quickly all right looks like maybe it is similar to text-based features okay cool uh so let's see yeah let's see what um what we can learn from it okay first i'm interested in the distribution of release years of both movies and tv so i can histogram that bin width equals i don't know every five years and i can also say fill is type what is it called uh type and we say okay most material is recent instead of fill type maybe i do a facet wrap by type and and call is one for stacking them on top of each other and let's do scales free y i'm just kind of curious is there a difference in like the shift or something like that all right maybe the the i think yeah the tv shows are even more shifted recently than the movie than the movies are this isn't the perfect way to uh to tell that i think the best way to do it might be something like to count decade is uh how do i do this under 10 times floor um 10 times release year truncated division 10. and type group and then i can do quickly uh group by type percent is n over sum n and do uh decade percent color equals type it's trying to look at like some other ways we look at this distribution this is very similar to the above graph basically we see we do have more movies uh that are a little bit older those tv shows are really overwhelmingly from the 2010 decade then of course this is only like two years 20 20 20 and part of 2021 in this most recent bucket uh but yeah so that's some that's when we look at it if i change this for a moment from decade to every five years to the conclusions change a lot well we really see like chase this that's so dramatic they're gonna try like every two years and uh just uh this is like another way we can kind of see it where it's like okay um this is the most recent uh 2020 2021 uh and that's so much that i just wanna i wanna try just based on i don't really need decade i'm not i'm i'm gonna do more interesting things with this i just i just wanted to get a sense this out account this and type yeah i just kind of want to say like okay 2021 there's a breakdown but i get tv shows most tv shows came out in the last few years all right so that's the same we're learning about um about release dates uh we're learning about things like our country i wonder listed in is going to be i think uh interesting that based on the kaggle description i saw is going to be oh it's genres i was kind of thinking maybe countries that it's available in uh and no what i see is i see listed in horror movies international movies all right so we definitely can do things with genres we can do things with countries we can do things with ratings right we've done a couple of movie data sets you've made tv 14 tdpg we've done things with uh ratings stuff like that and oh i see one we might need to clean and that's duration so the this is kind of like all right i'm curious on the duration let's see if i if i took this data set and i said let's see separate duration uh set by uh if i said um duration and unit units i wonder how would what how would work out if i count units it's always min season or seasons okay that's cool um and i can throw in a convert equals true uh and at which point duration now gets a number so that's handy because it um uh this is what i really care about and i'll call this duration units all right what do i want to do that because i might want to know something like um well i might want to know our movies getting our movies getting longer over time uh heavily among the ones that in netflix i could filter for type this movie uh and mutate decade is uh or i'll just group yeah muted by decade is uh 10 times year now um i'm just i'm annoyed because the old decade is going to have very little data and but uh divided by 10 for division uh release year gg plot decade by rel by duration um animated geom box plot and i'm actually going to need because decade is numeric you're gonna need to add group equals decade all right so yeah so one thing is like the movies it has that are longer tend to be from the 60s and this was definitely an age of epics i think lawrence arabia was in the 60s maybe no might have been in the 50s i can't remember but mostly it's like oh okay the you settled this just short of a hundred uh of a hundred of a hundred um minutes that i could also ask questions like you know i could i could create a function called summarize movies just know that i'm gonna that i'm going to do this uh where i want to take a table and buy some groups summarize average um summarize what i'm going to say uh sure i'm going to i'm going to call it summarize movies just because i think hmm it's not going to be that interesting because i could do something like duration uh average duration but that's not that it's not that interesting no i'm not going to create a function yet maybe i will later uh and what i can do is i can say i wonder something like netflix titles what if i uh let's look at let's look at genres i'm still bouncing around a little bit but let's look a little bit of genre i can do separate listed in uh separate rows listed in and say genre say three rows listed in and sep is um comma and now i can count these genres here we go and it looks like up i have a space in these that looks good uh so we say okay international these dramas comedies etc and then i could group by the um by listed in and ask something like summarize n is always good i'm also going to arrange descending n and median duration median uh duration uh duration i'm going to filter type you know what i'm going to do group by type and listed in and i'm going to call it genre it's listed in and median duration is median of duration how's that look okay the median is a bit funky when we're looking at uh at like tv shows they'll typically be like one season being the median makes a lot of sense but yeah so but i'll do i'll leave it here anyway and then i'll add filter type is movie uh i wonder is this too few is this too many movies it's not really too many movies to do in a visualization and i'll do something like uh genre no i'll do median duration genre geom com gym call geo point movies is a category that's a seems like a bad category with a median duration of 44 minutes uh that doesn't look see that that looks a little off i might even filter for duration sort of know for genre is not equal to movies but i'm also going to do genre is fct reorder genre by median duration yeah this is i don't know like stand-up comedy tends to be an hour and the others and the classics tend to be longer but mostly just kind of in the middle yeah sure um i can do more things on on duration i could do it i could do the same thing by rating see there's where i might start i could do something like this where i was starting to say i would summarize movies or summarize titles because it's both movies and tv shows i like to have something that grabs like uh this kind of thing you know i'll even say median year is median of release here i like to have this just so i can like aggregate these really quickly and then i can see get these and look at and uh if i'm curious then like like are some of these more recent than others okay but the um uh all right but let's now look at let's see i think about what to uh what to look at next uh so i'm definitely gonna be doing some like tokenizing of descriptions and connecting uh text to genres things like that uh well i wonder how many i'll have a cast uh you can find a lot of connections here um oh uh date added here's a column we don't we that is not that um uh clean yet date added uh huh what i'm gonna do is add a step to our cleaning process there's a function in lubridate [Applause] called uh mdy that should parse this really nicely i'll say date added turn this into a data into a date column okay let's let's let's ask about data added so i'm curious like uh first how many were added per year this is not it looks like we're probably missing ones yet we're not gonna have any that have been removed uh so that that's a a selection bias we have in this data but if you start with something like count year added is year from luber date of date added and uh filter not is in a year at it um update added if you're not missing it uh and then i can say oh when not not a lot of movies in the states that have been around this far which makes some sense because the streaming service i don't know when exactly the senior service introduced but what were the first ones added if i did um a range year added a range data added first movies added that are still present select or fair style titles added type title and date added uh all right not uh not incredible set not really a very memorable set of tv shows or movies these are ones that have been on for a really long time i don't think i've seen any of these so the um netflix stuff okay and so i might want to know i'm just noticing that like the ones that are still on most of them were added in the last few years so i'm not going to get much from 2015 or before what i'm going to do is do mutate year added is year is p min you know p max of year of date added and 2015. and i'm going to group i'm going to count year added and you know what i'm going to do genre what am i going to do this by heck i'm going to do it by raiding i don't know i'm trying out a few ways hypotheses like what has oh no i'm going to start with type i'm curious like what has it ever we've seen that most of the tv shows it has on recent tv shows is that true of when things were added things are added in the last five years were they um uh did they change in terms of type over time what i want to do is do year added fill equals type oh and uh n fill equals type geom we've done a lot of area plots recently it's a good way to visualize this yes so um most so of course not a lot of things added to 2021 we're only a part of the way through the year and i'm not sure how recent this data is but it's like okay mostly it's just been growing in both uh it could represent as a percentage don't know whether this i have to in this case the next thing i'm going to add is rating you know what i'm going to do i'm going to throw in throw this into our into just in case i want like year added is year of date added just so i don't have to keep recreating it now if i do fill rating has that distribution shifted well now that i think of it i'm definitely going to do some fck lumpy and do mutate oh no um i can do it in account rating is fct lump of radian top six what do we think uh and i'm gonna do filter not is in a radian so that doesn't get stuck inside yeah so that this is oh um i neglected to do that step where i said year added is pimen you're added and 2015. i did p min again i meant to do p max where i want to say start our data in 2015 or before uh all right so the um uh is this region changing we'd have to change these percentages to find out so i'd change this to group by year added this is a thing called a spinogram percent is n over sum of n and we need a column called emery to use percent what i'm seeing you know i'm gonna order these i'm gonna order these you know i should do these in movies and tv separately okay what i'm going to do is going to say group by type mutate rating is fcp lump rating 4 and then let's do the top four in each and let's also throw in type this is broken now uh i think i mean oh cause uh you added type rating type in year added fasted wrap by type some i may need to do a little bit more modification no this this one seemed to work out work out uh all right the um some points that we probably should should uh charles place i probably should filter out 2021 but the um yeah what we're seeing is like within movies why do i only have this many i grouped by type one thing i don't understand is what happened to the rest of my ratings i did fct lump of rating four i really expected them f f 45 rpg 13 where's my uh where's my other movie ratings is what i don't understand i grouped by type right here uh am i missing something if i filter for type is movie that's odd i'm getting tv 14 tvma tvpg among movies let's take a quick look at that if i say that might be a data problem but it might be it could be tv movies seems a bit odd i think a lot i think a lot of these are foreign movies uh and their ratings okay one possibility what it's looking like here maybe it does look like a lot of these are our foreign movies uh yeah it could just be that the they use these like tv whatever ratings um right worth going uh it does mean this is a little it isn't quite as like intuitive as i was i was expecting it to be if i dropped this yeah for whatever reason we have these movies that were added that have tvma tv 14 et cetera instead of the oh i guess yeah only only american movies i guess would get rpg 13 etc okay now i think i understand a little a little bit better uh question is why use pmax why don't i use a filter i wanted to keep all things before 2015 in the data set but yeah i don't know that's the right choice let's try filter year it's not gonna make a big difference year added greater than equal to 2015. so i see some shifts it's more common for for uh this is maybe more maybe there's more american movies getting out with with rpg 13 ratings and fewer say tvma i think it's a little hard to um interpret this uh so i'm gonna keep on moving because yeah let's talk let's talk let me see i'm thinking about oh let's start with like um dirt let's talk duration and country i haven't done much with country yet so let's look a little bit at country i'm still sort of getting a feel for what is exciting here what is what is interesting um if i do filter not is in a country country and let's throw in type because we have movies and and what if i did country is fct lump country six let's make it ten or nine so we have ten total including another and we do uh n country fill equals type geom call and i'm going to throw in a country is fcp reorder country by and love this visual this kind of visualization yeah we can see something like what are the countries that contribute the most um movies and tv shows to netflix i'm going to increase the make that other a little bit a little bit smaller yeah there's a really long tail it looks like a lot of things that fall into um other countries one thing i can see is that most tv sho there's some countries are this cool there are some countries that are mostly uh movies uh india is almost entirely movies egypt is almost holly movies then there's a few countries japan and south korea that are mostly tv shows also taiwan oh you know it's almost like you could we could throw this onto a map because it does feel like no not a set let me see like southeast asia and pacific islands like the philippines and indonesia that are all movies then we got japan and south korea and taiwan that are all tv shows that's a little let me let me make it a little longer to see is there anything else any other trends you can see here and germany like some tv shows lots of movies okay it's just interesting in terms of netflix's content acquisition strategy uh how has the um uh you know let me actually know let me ask now by let's say uh duration and country because that was kind of what it was what i started looking at here if i did this and i say summarize a filter for let's say type equals uh yeah let's filter for type is movie and summarize titles group by country and where i created that um summarized titles uh table and it it's like oh okay the most wait is this right yeah no yeah this is right the movies most whose are from united states then india the united kingdom i didn't leave in the other here and is there a difference in terms of duration yeah it looks like there is well the movies in india are longer egypt are a little longer turkey and philippines i could do this on a visualization if i didn't probably want conference balance if they quite need to but it's just one that i noticed that um the united states india united kingdom and canada are make longer movies what about buy ray i see a question which is um do r-rated movies come from specific countries or the countries where netflix shows the harder stuff you know i don't think the i think these are the countries that the film was made or the tv show was made in not where it's available am i missing can i find out where can i find out uh where it's available i'm not sure that anywhere does have what countries this is shown in if i if i look for instance that description is there some information there i don't see anything nope uh yeah i'm not seeing i don't see where they're shown but i could see where they come from uh there's tv there's uh but then the other thing i saw is that our rating was mostly from the united states so if i did for instance filter rating equals r but most of these movies are from the united states most are but that's true it's overall and then it seems like some movies from uk canada other english-speaking countries uh get counted in i also see there are some that that are like joint uh that are from multiple countries or are these uh yeah the no i think i was wondering for a second are these like what countries available no i don't i think this is the country that it's that of origin we can actually check the documentation where's the documentation country of the movie show was produced yeah all right uh okay the um but yeah so so this is what r was if i did but i could say in r or tvma and now i do get ones in india united kingdom etc uh so that that comes with television shows but probably also covers uh movies that are only uh tv uh all right let's um let's see oh i'm kind of actually interested in ratings and oh yeah so let's let's ask the question we looked at that duration we saw that movies in india and egypt are longer movies in the uk us tend to be 90 minutes uh what if i said let's look at rating and country if i did let's say mutate radian is fct lump rating i'm gonna i'm gonna lump together tv and and movies no i'm not going to do that what i'm actually going to do is something different i'm going to do mutate group by country is fct lump of country and do the top like nine countries and then summarize pctr or ma because i i don't want to distinguish between like rnma and everything else oh i'll do filter not is in a rating and say mean rating in either r or t v m a those are right uh was it uh yeah r a t v m a i think so if i did if i did count rating i would see the yup r and t v m a okay so i'll do this summary and do n equals n arrange descending n and we see okay like are there some countries oh it's like india is only 25 percent mature rating japan 36 percent spain 82 percent oh so we can see some some neat uh variations here i wonder if i did group by both type and country uh and we throw in a drop the groupings uh step here uh united states 50 of movies forty percent of tv shows are ma uh whereas um oh i should get rid of country n a yeah filter not as in a rating that is in a country and so in spain yeah are any of these really small numbers yes some of them are so i do something nifty here i'm gonna i can actually compute confidence bounds for these so what i'm gonna do is right now i've got the number that are and mature i kind of want a visualization when i say art do some do movies that netflix has in some countries are they more mature than others in terms of ratings again i'm lumping together r and t vma are there any other mature ratings that i should include probably count rating uh tbma uh we also have no uh nc17 which is they don't really do anymore but or let's say netflix isn't show it uh we're throwing in not rated could may or may not be mature but i'll skip it uh all right so the reason i did n uh over these is that i can do something cool i can say pc key mature as amateur over m but i can do something else i can actually get confidence bounds who's seen this trick before to get a confidence about a lower confidence bound from um and out of uh pct for sure i can do q beta and then let me see if i can get this right 0.025 i want 95 confidence bound um n plus 0.5 and then and mature minus n nope other way around and mature and and minus and mature number positive number negative both adding 0.5 this was called the jeffries confidence interval q beta is a quantile of the beta distribution and this trick is nice because it gets me uh i could have used prop.test i guess but this is oops change it to seven five there's a trick for getting um confidence bounds uh that um in terms like okay i i think the average is fifty percent what are the low and high uh intervals these some number is wrong here oh uh this should be some much better okay now i have confidence bounds and now that i have them i see that a lot of them are pretty wide but the um yeah uh there's where they're out it's um and now what i can do is i can say by pct mature by country gm point size equals end so i can get uh points each of these color equals type geom error bar and now i'm going to do error bar h because i can use x min and y min x min is conf flow x max is conf high error bar works with those now i don't know how far along yeah so so this tel and this tells us a little bit here is we can say and i can do scale x continuous labels equals percent and while i'm at it i might i don't really need to expand to zero i'm not gonna expand this zero there's no reason to believe zero is that special but i'm going to i'm gonna say expand limits x equals zero uh so the um so we could say it's like okay south korea and then uh so so what what countries have the percentage percentage of films that are r or tvma representative of work titles that are in terms of the content that netflix has spain largely mature content south korea the movies are mostly uh mature the the uh the tv show is pretty basically down the line egypt may be a little bit under canada kind of but generally movies are more likely to be rated r than tv shows are to be rated tvma uh that's how you can see it across most of these last thing is i want these to look a little less like tie fighters so height now get a question from uh chanini which is a perfect question is it possible to see models modeling with description i agree i'm really excited about doing some modeling here the um the monitor that i want we're going to turn this into a prediction problem and i'm thinking about um some of the things we can predict we could predict uh just for fun we could predict what's the probability something's mature now we so if i had a rating i'd predict good or bad we've done that on imdb before here we don't have good or bad we have um uh we have their rating we have their data at it i'm not going to predict what year they were added i don't think that's that's exciting i could predict duration yes i've actually got two things i can predict i'm gonna start by predicting rating because i think it's going to be fun to predict to see like what words genres or other things predict rating i i haven't looked at at genre yet we certainly can but yeah let's get to let's get to some predictive models i think it sounds uh awesome so let's do some um some text finding first i want to do not is in a description i think i think i saw some of them were null no maybe none of them were uh what i'm going to do is i'm going to load up tidy text for more on tidy text see the book public uh text mining in our published by myself and julia silvi the unness tokens word description count word what are the most common words in description in uh title in uh descriptions real boring look at that look at that boring stuff but i can join with a pre-set up set of stop words from tidy text and now if i count them at least a little more interesting life family world love woman friends series documentary school home etc all right so we um instead of saying small like what is difference between movies and tv shows uh so if i did title and word and uh then i could oh i oops i meant to do type in word then i can spread the word by n fill equals zero whoops that's too big i meant to do spread type and word after doing type is uh um uh i'm gonna do this i can do snake kit i can actually use the snake case package uh library snake case i like using snake case for this snake case those two snake case right uh type what do i like a snake case mixed movie and tv show uh so then i could do something like movie plus tv show arrange descending total uh take the top hundred in terms of up here total and then i can start doing something like movie by tv show or words appear in one but not in others i'll need uh it's kind of got a scatter plot in mind that is going to have x axis on log scales in both axes label equals title uh nope it's gm text text no labels right and word is the word whenever i was looking for not title uh this is not that exciting the problem that i have here is i'm actually doing this based on um totals and that's kind of cutting this off in a funny way this is the wrong way now i just thought of it to ask what words are more common so i can see for example series is much more common tv shows i want to go through a little bit more um like this is not quite the way to look at this why do i say that because here we go words nested there's a package for this actually tidy yellow for tidy log odds if i want to compare between these two sets you've seen me use this before it's a package by julia silvie and the um and if i do words unnested and bind type oh no i'll need to do first count type and word and then i do bind log odds with a set is type and the feature is word i get and i need to have n i get to know the what what is the relative ratio uh so like if i do a range descending log odds weighted i could found out that okay by far the word that most peers in tv shows is series but also adventures and world and docuseries this makes this uh honestly makes a whole lot of sense uh whereas words that appear more often in movie uh include performance so we can um we can visualize this by saying group by type the log odds ratio would be like the um log how frequent is it in in tv show versus how frequent is it in movie so the um uh so i can group i type top end 10 by log odds weighted top 10 in each type and ungroup type uh and uh log odds waited i meant to do word and log odds weighted gm call could need a little bit more effort than this you see me i've done the same as for example comparing taylor swift and beyonce the uh scales equals free y here we go what words are most overrepresented in tv series or in movies and we see things like uh tv show series adventures world docu series friends movies performance concert uh bride stuck you know what's actually interesting i can see here things that describe situations like friends popping up in movies whereas things that appear that suggests that are like a situation like stuck uh makes that does make a little bit of sense to me um that's one uh quick look but but uh what we might be interested also is i see a question from i'm so sorry i don't know your name from zhao uh which is what if we did clustering so this is a cool idea what if i did here we go we take our words unnested and i want to know what words tend to appear with other words you do that with a with a ydr package if you've been watching for a while you've probably seen me do this first thing i want to remove the really rare words and name is word total i'm also i'm actually going to first distinct by type title and word and then i'm going to add count to word so future occurs 98 times but slums care is only eight i'm going to say word total must happen at least 20 times get this data set a bit smaller and what's great about ydr is now i could say what words often appear without other words so among those that appear in at least 20 titles and now i can do i think for uh just one heartbeat about this it would be fct uh oh sorry no it would be um pairwise core uh of um of word by title say kong and hong tend to appear together all right no surprise no surprises there middle and aged and martial in arts uh we're seeing pairs of words that often appear together i could say something now i can take this and uh for one thing i might want to find something out i want to find out i don't know i like um uh okay well what do i like let's see i like um mobs uh gigs like mobster movies i could ask what words tend to appear with mobster mobsters appear 20 times matter luck what if i say i like crime movies but two words tend to appear with crime corruption fighting boss lord framed that's pretty nifty um what if i just wanted like a visualization of uh what words tend to occur together i could at i could instead take this and say filter uh thanks yes um that i can take i can take this and say correlation is greater than let's say 0.1 how many connections is that too low correlation greater than 0.2 let me up this a little bit the threshold for a word being included and now i've got this data set that i can use um i can turn into a graph so i'm going to use library tidy graph and library gigi ref and take these correlations and i actually use a different toys graph i think i can get this from tidy graph but i still out of habit use this graph and data frame one and i can say ggraf geom no uh edge link you see me do this before new node point uh aes alpha equals correlation and layout equals fr which i don't know what it stands for but it's a cop it's um what i tend to use for these visualizations let me decrease the correlation threshold a little bit uh all right so this is trying to get some like constellations of uh what words tend to appear together what tends to be clustered and other than this i can say um geom node labe uh late text label equals name we just won a actually let me try to repel equals you know let me turn up the threshold a little bit check overlap equals true this is usually somebody takes a little bit of experimentation i'm going to drop the legend position i'm also going to set the seed so that's the same each time i look at it all right so we're starting to see this is um oh did i do vg i didn't you know let's try repel equals true see how this looks ah not too bad so this is looking like a cluster and what what are genres can we start to see talented aspiring singer fame rise fall love is a whole that's a whole story there uh for sure um tour scenes for there's a documentary cluster we can see a crew space earth cluster uh and so on you know what's interesting is we actually um another way to do this is we actually have the genres so we can look at what words are particularly common with genres ooh there's a question of why pairwise core with word and title and not type uh oh i didn't i wanted to say um i didn't want to say it was the difference in the types this was just looking at what words appear together in descriptions some clustering words appear together descriptions but let's actually do this by genre so this is like it's cool it's it's i can explore it and some other visual mom single mother daughter but actually i'm starting to see boss crime corruption government police i'm seeing jean was here i want to see that ex i don't know that explicitly so i'm going to do is i'm going to take our words on nested um and i'm actually going to unnest it even farther i'm going to separate it by genre so do um remember i've got this separate uh rose of listed in and i'm gonna rename it to genre and count genre and word genre um first in a distinct type title word so it only occurs once oh nope i want that before the genre first i do it only that only each word only gets occurs once then i s and uh genre i have to throw this in first i make sure each one of these titles appears only once free only once for each word so the word pops up four times not gonna be in the data set then i separate out our genres i don't need this next step anymore uh oh yeah set equals here we go now i separate out the genres by by uh now i've got future across all three of these and now i actually can use that tidy and now you count genre and word and now we can get those log odds in really interesting ways and say what are words that are very specific to this genre i might also want to do mutate genre or just remove the rare genres filter fct lump of genre oh i need this after the separate rows top 10 is not equal to other this line i do i just want to like remove the rare ones now i count genre in word let's make it nine because then it'll be great when i look on a facet and now instead of doing it by type i'm gonna do i'm gonna do these words by genre so i can now do a bind bind log odds with genre be the set and word being the feature and then n and now this i can say for every one of these words for every one of these genres how frequent much more frequent is it than the overall rate so double 07 is more common in action adventure than we'd expected by um by chance having said that i probably only want these for fairly common ones uh so i'm only fairly common word so i am gonna do add count word this is the number of distinct movies it occurs in word total is greater than i don't know 50. try to make this a little more yeah let's do kind of just experimenting around but the idea is i'm saying okay only things that are across all genres have been at least 25 times uh i didn't want i don't know what too many like really rare words so now i can call this word genre log odds i'm finding some links between words and genres i can take this and visualize group by within every genre uh group by genre no um no i'm gonna yes sorry yes group by genre top and log odds weighted top not 10 within each could have 90s 10 10 from each of these genres and s and do um now i do ungroup text has as reorder within i can say word is reorder within word by log odds weighted within genre i want to reorder within every one of these and that matches up with scale y um reordered because i'm going to need to do log odds weighted word gm call facet wrap by genre and really important that i forgot to do is scales free y that's why this is taking a long time to render otherwise every word would show up in everyone so this is a question of what words tend to be specific to what genres clean this up a tiny bit before we look at it uh log odds this is pretty nifty uh i don't need this and i'm going to throw something and i'm going to say no legend and fill equal genre i like how it looks what do we learn from this it's very easy to tell documentaries documentary interviews examines footage all our terms that are much more common documentaries than everything else uh action adventure you can see words like protect action squad rescue terrorist ooh this is probably this is what i'm going to build i can build a a genre identifier uh that's pretty that's pretty cool so the um i could do a rainy night at the fire no let's leave it let's do a later fight let's think about that for a second but yeah the um uh dramas independent movies international movies etc uh okay and yeah tv dramas chronicles yeah we can see like the things that show up in tv dramas chronicles thriller protect protect appears in multiple yeah all right so these are some things that are like uh specific all right question from morgan is there a way to filter pairs of words grouped together like new york and car accident uh one way is you can tokenize by i'm not gonna do it right now but you can tokenize by bi-grams using um unness tokens uh if you say n equals uh if you do use um where does it pop up here here it is unless tokens engram tokens because n grams uh and they do like n equals two at equals three we could tokenize uh uh adjacent words not just words appear in the same passage that's one way to start going down that path let's build a let's build a lasso regression i've done this in a few other screencasts i think it's the really fun thing to do let's predict whether a movie is to be fun let's predict whether something has a mature rating an r or tvma rating based on say the words that it uses uh all right so if i look at words on nested and i select uh and i let's do on distinct type title i'm starting just with type title rating and word uh let's start with that so word uh and let's make it a count if the word appears all times so words and ratings uh yeah why not uh then i'll do uh all right the other thing i actually should have checked was are there multiple count no there are not there are no duplicate titles because there are they they have some way of de-duplicating them uh all right that's good that's good to know all right so the and you know let me let's see i can make it simple no let me all right yeah so let me let me add one column it's called like mature is rating in pvma r or nc17 and i'll do filter not is in a rating this would be the data set that we want let's say we want to just do a prediction of based on whether a word occurs or doesn't does it have um it doesn't get a mature rating uh so i can mention things like documentaries probably rarely get mature ratings but other ones i can certainly imagine with terms like sex which turns around violence wouldn't would tend to appear and mature once i'm a little curious what we would find from this and uh i'm going to use you know sometimes i use the tidy models collection of tools but today i'm going to use glm net um because the uh and i used tidy text has this uh function i really like i'm gonna take our word ratings and i'm gonna do one other thing i'm gonna do add count word uh and i did this before but word total let's say you must appear and let's start with you must appear in at least 30 and we work our way down from there uh i don't want to include anything that doesn't occur any words that don't occur in at least 30 movies so i think that this and this the way i get to work with glm.net is i would cast sparse uh where each row is a title each column is a word and the value let's start with the value being uh the number of times it occurs so the um so if it occurs multiple times it'll get multiple points so this will be a word type word matrix this will call like a document term matrix and it'll be 566 words as the words that occurred at least 30 times across everything 7748 movies did we lose any in that filter and they didn't we lost only a handful those are ones that didn't have any words that didn't occur at least 30 times i'm going to go ahead and let them go but the question that did we have is um uh is is what are words that predict uh that a show was mature so they might say is where uh this is the part that always bugs me a little bit is i need to do row names of this uh i don't quite i don't have a super easy way to do this i just say to match these row names up to the word rating uh up to the first time they occur in terms of the title and the area connected just to netflix titles and then subset them oh um actually i need to do it on uh we're on word ratings title word ratings mature so this is some non-tidy verse tool where it say match the row names to the title use that so that's my like my why my thing i'm going to predict which is uh trues and falses for each whether or not it's mature so uh this is the step i wish there was a good way to do this this is a quick thought uh if someone knows one let me know but the way that i go about doing this is saying the um the uh let's predict with cross validation cbg lemnet the word matrix y uh and the family is binomial am i forget anything here quickly reminding us yeah uh family binomial um and uh applying a lasso aggression model and uh yeah so this is a uh uh cbg laminate and i can actually say did this work just in terms of so what this is doing is fitting a model with a linear term for each of these words uh to predict yes or no does it um end up being is immature what is this this is showing as my penalty parameter changes at a very large penalty parameter and all the coefficients get driven to zero the very small penalty parameter we're approaching linear regression and we start overfitting because we have 566 predictors here we're going to overfit if we try that many but here's that sweet spot where we set a lambda that's some that uh pushes our error down uh and what i can do is i can take that ma uh that mod uh geomet fit that giant thing and pass it to broom's tidy function and do tidy and now we've got the term so we can see if we needed if we only had one okay comedian and stand probably stand up are two of the terms that most push especially at stand most push things the extra mature that makes sense comedy routines makes a whole lot of sense uh so we can say uh but i can do this and filter just for the term lambda is mod lambda let's do lambda within one this is that choice of parameter lambda within one standard deviation of the bottom and now i've got a bunch of terms that are positive or negative the word estranged makes somewhat makes a a title more likely to be rated mature the word government less likely to be mature uh so let me actually so let's uh let's try visualizing the top terms so i do say top n estimate uh oh absolute value of estimate uh and let's visualize uh estimate term what terms are are most strongly predictive anyways the geom call and let me do a little reordering you've seen me make this graph before if you've been watching for a while it's a really neat neat graph we say what that's what i love about a linear model it's so interpretable horror drug comedian stan comet gay paris sex 1980s violent and terrifying oof yes that definitely sounds like something you don't want to show your uh your eight-year-old potentially but magical animated magic adventures monsters compete science scientists creatures compete is probably a story about like documentaries uh and uh or or reality shows and um uh which which are unlikely to get a mature rating and yeah we can see these are the side that are like um so you see things maybe that are maybe that are for kids documentaries could be scientists creatures could too let me add a couple more this is pretty cool what words are most predictive most the least predictive of being um oh man murder french prison hates violence online brutal thriller spain uh we saw before that spain tend to be more um uh mature than other uh movies from spain and mature in turn in movies from india less likely so this is pretty cool uh you know if you stay for five minutes i'm going to show one of the one other trick here which i really like about about the log log um rating is that i i've got these words but i could also bring in other features i've done this some in a couple of other ones if i want to say let's predict if something is um mature or not i could bring in other features like uh here's what i'm gonna do i'm actually gonna move this i'm gonna move this mature step up to the filter so up to the cleaning so that i can just work with it uh oh netflix titles and i've got my word ratings but i've got this i don't need let me see yep and now my my matching still works the same way okay why am i doing this because i actually want to add a couple more features and what i'm going to add is uh is take our our original netflix titles cast and listed in are these all ah yes they're comma space separated so now i can do the great thing is now i can do gather and i'm going to do genre gather type a feature type feature by a director cast and genre and but it's still going to have this could be multiple until i do filter not as in a feature this is still going to be multiple within each but not for long because if you separate rows feature set is comma space and now i've got and finally i do unite so the unite feature type there's a lighting round right at the end uh unite feature and feature type separated by let's say call and space uh uh sep is these sounds like director this director this um and uh and so and now it's like director cast and genre and uh yeah i'm going to throw in one more thing where i say feature type is string to title feature type because i want capitalize all right and uh and one more and then i take this this is my other features besides description and i've tokenized them i've got them in in this format title feature okay and i do mutate feature on this one feature on the words i do paste word oh and let me throw in add count feature uh name is feature count filter don't include anything that's really rare actually that that i probably need a higher threshold than that because some directors aren't going to have a lot most directors are going to have more than 10. okay so the um and now i say field faced i'm going to combine this with the other features now that i've got this i can do count feature let's say we're like the most common features genre national movies description uh let why are the two spaces here oh yeah because it adds space um and now i've got my common features and i can uh i can i can cast that i can sparse that out oops i yeah no way no this is too this didn't work because bind rows any features features doesn't have quite enough uh i would have expected maybe the problem is my filter here was too high what i said didn't work is that the matrix didn't get any bigger oh there it is i needed to do it as feature that's way bigger i'm going to turn this special back up i'm just kind of playing around with it experimenting and the stories yeah we have 900 now i have 924 features and the point is i threw all these into a bucket and uh still i've got my my i'm going to call this a feature matrix feature matrix oops i did a word matrix again feature matrix uh-huh did i have something missing um one second oh yeah the end the end is no good uh i'm gonna leave it binary when i do cast sparse then and i leave it like a yes or no does it include it not how many times does it occur um just looking at like what i'm popping up here it's like yeah we're getting a binary matrix now of each row is a feature okay i need to fix some of that up now the question is i've doubled the number of features here and uh but i still get a lot of success out of setting out of if i can set some lambda i can still get a good predictor out of this so let's uh yeah let's find out what are the top terms these are terms that make things more or less likely to be rated mature so one that's clear is that okay things that contain that our kids tv children family movies faith and spirituality must less likely be really mature but also we see some cast members we see over genre movies as we saw that was kind of a junk one uh we do see some descriptions pop up uh but i'm actually going to split out yeah i've got this this is pretty good but i'm actually going to separate out when it was after tidying it i'm going to separate back out the feature from the feature type usually that data colon space uh and it's actually called term and i'm going to say feature i wonder where was i i got some missing values i don't know where those what happened that quite what happened there but um oh i see missing him hmm well the um uh yeah this is i think oh yeah so this we can see like okay some genres pop up but also some cast members in each direction you know i didn't throw in country let's throw a country in real quickly because i think that's a lot of what we're seeing in this actor is they see some like um some foreign names that could really really they're representing like india is less likely or spain is more likely um so if i throw in country now i've just added it as a feature this is what i love about this site i just like threw in all those terms and now we got those in the bucket not keeping my intermediate models which is not good uh hygiene for fitting lasso but you know if you see my um my board games uh uh visual uh screencast they do some very similar things here we go okay so here i can see like okay there are genres there are countries uh what are things that tend to be um in one direction like john paul tremblay or stand-up comedy a couple names uh spain is much driven in this direction we see words like drug it's a mix of like here's the the feature and here's the um feature type that is pretty cool uh i'm going to clean this up so you can say um how much does this question coefficient does this total more likely to be tv ma or r and uh why do this all right that's fun i've done i yeah this is like that's what i love about like legit lasso models we get a really quick interpretation and um yeah see like okay science nature tv samuel west david attenborough things driving documentaries and such much less likely we can see we can actually see that the effects that and then no surprise that kids tv children family was face faith really very very strong yeah uh you know what's interesting is i would have expected horror to be brought up here is har ki vihar and har as a word that shifts in the in the positive direction so we do see some of the these uh but yeah that's some thing that's uh that's pretty interesting and we can really pull out of this like hear the actors hear the anything that go in either direction all right that's it for today if you um uh all right let's take a quick look through with the stuff that we did we looked at um uh i looked i did a lot of this exploration like okay when did these movies get released they didn't look at has duration changed over time i looked a little bit at some things like um durate at duration by genre um and the uh the data but i think i end up getting a little bit interested in the rating which is why we ended up focusing on it uh later later on the um uh we eventually got to the question of looking at um and this was a question of rating by country so we learned a new trick of getting confidence lower and upper bounds called the jeffrey's confidence interval uh and we will kind of see this um uh oh this should have been with now that i think of it that's the one yes now get there the tie fighters um and yeah we see these differences uh then i finally did some um onness tokens used um also used julia silvie's uh tidy yellow package to get log odds the um uh pulled these out sorry like uh words that are more indicative of movies or tv shows and then later movie words that tend to appear together and that gave me kind of the idea of i really was curious about genres and words and i used word genre log odds as a separate visualization my network's not showing up but you saw it earlier the um and then finally the really the the nifty thing we did was joined it was tokenize the words tokenize the directors the cast the genre the country uh put them all together into a bag and find out what are things that made a a movie a um title movie or tv show more likely to be rated r slash tdma and get some kind of interpretable visualization uh interpretable visualization of that all right uh i um i uh uh if you like this please do remember to like and subscribe i hope you had fun i certainly did i'll see you next week
Info
Channel: David Robinson
Views: 3,183
Rating: 5 out of 5
Keywords:
Id: 3PecUbnuYC4
Channel Id: undefined
Length: 69min 42sec (4182 seconds)
Published: Tue Apr 20 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.