Tidy Tuesday screencast: analyzing ratings and scripts from The Office

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi I'm Dave Robinson and welcome to another one of my screencast where I'll be using our to analyze data I've never seen before as usual the data comes to tidy Tuesday project an amazing data project run by the RFI data science online learning community every week they're releasing your data set and let's see what they have this week this week we have the office so I've seen this sticker before is that an R package fruit huh so it looks like fruit our package and ratings of each episode this is gonna be great okay we're gonna install the Schrute package Oh in the tidy Tex package that some developed by me and by Julia silky and the PI detects mining with our book that's cool to have a an advertisement for for it and let's see alright then we're gonna get the data and the Schrute package oh this let's see so let's start by installing the shrewdest package there we go oh wait I'm gonna see what's in that package what is their route is the name look if you don't watch the office I have a big fan of the office so I really I'm really excited about this fruit there's a data set called the office alright as table of the office I'm gonna do library tidy verse as table okay we have oh man this is really cool we have index season episode episode name the writer the character the text and the text with their directions things like on the phone and such alright and we also have the ratings okay so we have that we have mercy I'm gonna say transcripts and ratings office transcripts office ratings alright let's see what we have so first of all I'm gonna start with the the ratings before we move on to text data this is um so the the general I would want to start with some like ratings because I'm you don't have plane apart words till you've looked at this data says that there are a hundred eighty eight episodes across if I recall correctly nine seasons so like the easiest thing we can do is summarize things like what's the average rating on IMDB and it looks like yeah it looks like it's been going it goes sort of down over time my I I'm a set of a fan of the show I would more or less agree that this the eighth season was probably the worst that was the first one after Steve Carell left and bounced back a little the ninth season the sixth season was also elope was also a weak season we had you know you have your Carell years you know this some I'm gonna clean this up a little bit say Alex continuous breaks r129 that won't keep all nine seasons and yeah basically peak around the fourth the third fourth season I actually love the second season all that makes sense to me but we don't just want to look at maybe we don't just want to look at season averages maybe looking wanna look across all episodes so maybe what I'll do is say this I'll do I'll you l I'll just say season let me see I'll do this I'll say season episode is paste season and episode and I'll do FCP reorder now I want to basically have this I want to have the season in episode of put together let me see now here's how I'm gonna actually do it I'm going to take the episodes and it's a title is FCP reorder I'm gonna reorder the title based on the season plus some tiny number times the episode why do I do that to make sure that they're ordered in the correct order and what am i doing I could have just done FCT in order of title that's a good trick because then I can say let's do the title by that the IMDB rating as a bar plot now it's gonna be too crowded and we're definitely way too crowded right there if I do a theme it's this still not gonna be amazing but I do a themed accesstext X and let's see element text angle equals 98 y equals 1 this is how I rotate text by 90 degrees on the x axis still way too crowded what I think I'm going to do is try doing this as let's see I'm gonna siding there are a few ways that I can try and um yeah this is not gonna nothing that includes tax is gonna work no matter how much i zoom in there's a hundred eighty eight episodes that's just too many but like maybe I can do a say you know I'm gonna do I'm gonna it's a little silly but I'm gonna actually try element blank now they're all like this and I'm going to throw in a fill equals factor of season these are all the seasons you know I don't like a column because the lowest rating we have is like a what is that a 6 point something so what I'm going to do instead is do a line plot but that means the fill is not going to work on a line plot you know what is is maybe I throw in on top of it a color point with a color color equals fact their season oops uh and here of a line plot oh yeah we want a group equals one to get one continuous line across them here we go and the last we see a couple things we see here one is that the last few episodes have higher rating but I'd really like to know this I've got some guesses just cuz I'm a fan of the office this looks like it's season seven I bet this is the last episode Michael Scott was in there a few other good episodes from from that season Jenna goodbye Michael this part of this episode what is this this is season four and then I think this is season five I don't know what what episode this is in the middle of season five there were I think a few there was a Super Bowl one well you know how I'm gonna know which is which I have a couple options I'm gonna start actually with a giome um with a dream text but really importantly it's not gonna work unless they don't have over to add text to every single one of these but I want to know for the really extreme ones what it is so here's going to do I'm gonna say label equals the title but check overlap equals true a zoom out a little yeah that's gonna be better that's pretty good the banker was a clip show episode so that makes sense that it would be the lowest rated show I barely it was this though I don't know the season 8 was kind of lame I don't remember what was what was going on here this these episodes yeah was some there were some pretty lame episodes around their finale yes the last couple episodes go up a lot yep so garage sales episode where Michael gets engaged stress relief yes a stress relief no no was that the I'm gonna stop trying to test my or whatever there's a season three finale season two finale I think this one wants some some Emmys yeah okay so one thing one thing we're getting here is like you actually a lot of this lines up to like my recollection of watching some of these some episodes that were particularly funny a particularly low rated but yeah alright so this is um last thing I'm gonna do is throw into the text and H just you know not the last thing I'm gonna do I'm also going to theme set theme light I'm not crazy about the gray theme that's pretty good I'm also hmm I also don't really like the other lines of these like vertical lines I'm making it look really crowded so I'm also gonna add panel grid major X element blank and I'll throw in a panel grid minor X do I I don't because um when is a factor I don't need it all right that's actually pretty good and I don't think I need a legend on this one but I'm gonna throw in actually jump point one that we don't consider is that is the number of total votes so when they're gonna do is say here's my total um size equals total votes I don't think I actually need a legend on this one I'm going to throw in a legend position equals none looks a little bit better now notice he did the H just yeah yeah this is pretty good this is pretty good because our eyes get drawn to the right spots here it's a shame that some spikes would have texture overlap with others not a lot we could have done a GG ohm tax repeal let's actually give that one shot I have changed this to a gym text repeal I'm actually gonna comment this line well this is gonna look Oh looks terrible truly terrible and I could add it for a small make that force fat factor smaller that causes them to repel each other less nah not just way too much text but what I actually like about the this is that if you're an office fan you can kind of fill in the blanks yourself like you start to feel like oh this is when this happened this is when this happened so it's really not it's really it's still pretty reasonable okay and yep this is the two part where Jim and Pam get married lots of lots of cool things I said the the clip show clip show for people they don't know from TV is when an episode mostly consists of clips of previous episodes like I kind of they're generally very unpopular among people that have fans of TV and [Music] okay alright so generally we see yeah is there's I'm actually gonna add one more thing what I'm gonna add I'm gonna add a GM smooth no not a not a linear smooth what was the general path across these oh I need to teach it that the geums it's not enjoying this GM smooth because it's on a factor yeah I don't hmm I'm gonna give this one try x equals as integer Oh what if I just did this could not just an overall up a certain number like a regular person what did I need this whole title thing what am i doing what am I doing here regular person so let's say episode number equals row number here we go I don't necessarily need to remove the text anymore there it is and now I can probably get rid of these yeah okay that's better all right there's a lot I like about this shows the general shape the pilot is less popular in general um but yeah it shows the general shape that goes across these alright alright so that's popularity the office over time I'm gonna clean this up with a couple of things gonna say X is episode number y is I am DB rating let's see title is off popularity of the office episodes over time subtitle color represents season size represents a number of ratings so that actually Y is the number of ratings relevant one is it's a it's a rough measure of like popularity we can see that that they're less and less watched over here but another really is that like you can say okay there's a lot of noise but not a lot of people were rating them all right let's say over in this area okay this looks pretty good last thing is I probably want to read the this side of the screen I'm gonna throw in I'm gonna throw and expand limits x equals negative 5 not because there are negative 5 episodes maybe 10 yeah I don't want to see a you see yeah that's pretty good and now we can read across all the episode titles alright that's popular to the office over time so I could also have looked at it as what are the most popular episodes could have said a range by descending IMDB rating and I could have said well if I would have wanted to combine these together so if I want to say something like title is a season dot episode space just thinking of what will make this a little bit more readable and look at the top 20 this is the because I could have said what are the most the 20 most popular episodes I am to be rating Jim called a word flip I had neglected one step here I want to say title is FCP reorder title by the IMDB rating you know here look at look at me I did it again I did genome call when I really want a genome point something's up here hmm perfectly tied it's a little weird I don't know I guess they're always given in like 0.05 increments yes they are always given point out of 5 increments that's something we learned here season and throw in a color equals season most yes ok they're given point Oh 5.0 5.1 they're all in point 1 increments episodes of the office of the office all right so I don't think I agree with some of these I think the there's a lot of nostalgia that goes into the last couple episodes I don't think they're up there with it with the earlier seasons or anything I think they're being compared to sort of the rest of the season um but yeah a lot of these are very yeah pretty sure stress relief is the Super Bowl episode in season 5 yeah I have a good memory for TV it's not just the office but let's see how I'd say yeah and then you have now yeah these are some like classic episodes season 3 finale the season 2 finale episode where Michael gets engaged I'm a big fan of this episode really big fan of this episode yeah so these are like are some of the most popular episodes the office I could have also said size is number eight what is it numb ratings what was it ratings ratings took total votes yeah you know this is fine this is fine yes we see like okay some of them didn't get a lot of ratings individually all right I kind of like this I like this graph a lot more I'm glad I started with this one if I want just one graph out of this plot out of this data set this would be the one that I would do okay so that was on the office ratings by themselves now we can start saying hmm and they know what a few ways a few things let's begin transfer the transcripts alright so we have office transcripts these ones look a little different one thing is I'm gonna want to clean this up right away parse number of season horse number comes from read artists as in did your work when it's a um does this work because as it did your work with the zeros in there let's find out the hard way yeah it works fine okay so if an index a season episode I just want to clean that so they could join together I could have drawn another title too alright so some of these we could find we could say our what our characters that lead to better episodes worse episodes we have to control for the confounding factor of the terms better or worse in terms of the ratings but we're gonna get to that in a minute in the meantime let's look at some actually gonna put that parsing up here along with making it a table let's look at words okay if you haven't used a tied attached package before I've created by myself and Julia Sylvie you can use it works based on the unnecessary into words I'm gonna remove the text with direction for this analysis all right yeah all right Jim your quarter looks fit quarters look very good how are you okay it is we split into one line for each word your Quarterly's look very good how are things the library oh I told you couldn't close it so you come to the master for guidance yeah so there's at first that's the first lines of the first episode you imagine it is read through it for the rest of the screencast yeah I probably not supposed to that probably not be fair use oh it's so one day I could say is one of the most common words so I'll call this track will personally call this transcript words and they'll say what are the most common words love count I'll count all day and they also love amp adjoining with our set of stop words that comes with Patty tax yeah hey Michael so I'm actually gonna filter out a few additional stop words okay like your hand like a scientist actually say yeah hey this is just for like cuz I don't want things these to show up gonna not completely crazy but about those words popping up here I could keep going but whatever I'm just gonna say filter were not word in blacklist here we go and now and that we have removed some common stop words and we have names of course we have a claw they're common ones I just uh I'm not gonna create a bar plot absolute like create a bar plug but yeah the idea is like things that aren't quite stop words others in sorry Charlie things yeah these aren't quick stop words but like I just really but are really common things that could pop up in a modern sitcom alright so we show our our words words but I'm actually going to do the anti join and the filter up here I'm gonna that is to say I'm gonna keep our transcript words right like this okay and now I'm gonna say traffic words alright here's that I'm curious about how much because each character say each word Oh some characters have a little bit of funky parsing I really should have actually before I did anything here's the thing I hadn't thought of I should have taken this counted the character these are the major cast the office I'm going to scroll down and look for are there weird cases any weird cases there's gonna be some quote situations now they don't actually matter wait hmm well here's I'm going to try I'm going to try taking this and saying you take kit character equals a string remove all character remove all the quotes from that just in case those up are parsed incorrectly rerun this yeah so notice these aren't important characters they're the ones that start alphabetically I just happened to notice the the quotes and it was annoying me a little bit alright so counting character in word not going to do this quite yet what I'm first going to do is take my words here we go index is going to be one per line okay yeah so like office transfers is an index is one per line not one per episode yeah so what I'm gonna do is actually say group by character num lines no I'm gonna do it up here now that I think of it am i yeah I am I'm gonna add count the character I'm gonna move all characters add count as an N that shows the number of times a character speaks Michael C has ten to elms eleven thousand lines across all the episode Jim at 6300 I only want characters that have at least hem lines total alright said I think that least ten is a good start how many characters does that leave I'm now at a count okay 178 that's too many do I want to do ones that are in at least two episodes or something I think that I do I honestly you must have at least hemlines here's what I'm gonna do to do it different than his sake and must be greater than this than ten lines and distinctive aside doing some arbitrary things indistinct idle is greater than one must be at least this way - I don't know who he is I'm not trying to just remove a B I'm just trying to like make us a little more a little bit more reasonable character I'm doing something wrong and just group by character oops look at me go what am i doing indistinct I was not it's not called title it's called episode name cool all right so there are eighty-one characters who so many lines they have some of them aren't meaningful like waitress crowd woman everybody those not meaningful characters both those are like kind of generic terms but yeah so much so what if I said you must have Bob Vance want to keep Bob Vance alright this is now called the Bob Vance cutoff you must have at least 30 lines all right I'm just messing around a little bit there we can always change the filters later but mostly we want to keep we know we're gonna keep this this main cast Michael Dwight chairman deep and etc all right so these so why have I been doing this well what I want to know is are there words that are specific to particular characters I bet Dwight says Michael a lot I said that Jim and Michael say Dwight a lot so a lot of them are gonna be names of other characters we might at some point want to remove those if you're probably going to but how am I going to find this out what they do is bind the key F idea tf--idf is term frequency times inverse document frequency that's going to be taking seeing what words are common for particular people that are uncommon across characters I'm going to treat each word oh I'm gonna have account word and character first and then I'm going to say by tf-idf word and character and oh and n and now we say and now to arrange what words are most specific to which characters the group says shabooya shabooya shabooya is that like I don't know is that a chant I guess so so that's charming oh but it's not but it's kind of weird every yeah so the ideas we have here is things that everyone says all at the same time everyone in all are a little bit weird I'm gonna say blacklist characters everyone all both guy/girl those are not characters a group and the group and I also I need to remove do the filter here didn't I didn't I do this what did I do there goes okay yeah so Val's a character and I just talks with someone to Brandon and so on I've kept in season eight I think all right so what we see then are our words that maybe some only one person says is a lot of a lot of these alright but whatever I might not also want to keep only words that are a certain amount of frequency so maybe I say I'd count to word filter and greater than and I must have twenty uses overall for example alright so now we have like hank use the word chairs and copy or three times helen is life's way we have kind of plot specific things and plot specific character specific things honestly we should only keep character characters and have more than some number of lines yeah I try to I try to get it like a core set of characters how many characters do we have now otherwise the characters that appear is like in 1 or 2 1 like have little episodes have like one plot line of the words there's to those plot lines yeah I'll be hard to argue there are more than 30 characters mmm there's yappy we have like alright David Wallace is now the cutoff not Bob Vance and this is the number of words each of them has said ok and I don't know that is and then Robert often says California off it says Andrew I don't know Joe talks about printers Darrell talks about says Mike a lot n says Michael he I would have expect it to be a white saying Michael lapa that's interesting oh it's the nerd to say what is Spivak to one character I could say for example I want to know all about white what is Dwight's what is Dwight's thing oops not look at the count I'm looking for thee this is the character tf-idf it's only common words common characters character equals Dwight he poses barn or farm a lot but Oh yep he says he thought it would reverse there's no name Schrute Moe's micheladas cousin farm is up there beet is up there idiot is up there on man I don't know you're not a fan of um of the office I don't know if this is this is fun for you at all but it's really fun for me so we can see then here's this specific characters signature reorder word yep idea how's that look need a coward Dwight Schrute Mose Michael assistant ha la hey farm idiot war yeah so the idea is like I don't know what I don't know that it would pumice what is pum what is what is it what even is that sometimes worth looking back this is secular filter string detect text pom-pom pop-up hum presumably I'm gonna actually say come with a space after it okay it's singing okay so one of the problems we have here is someone uses the same like this is just this is singing a little drummer boy in one episode we could distinguish it across some I don't know it just is it's not a big it's not a big deal alright and that's sometimes we see this is some characters and their things they say let's let's let's actually look at a couple of characters let's look at right Jim Michael well characters I like I like Holly I like Daryl let's see and the but let's we have to fast it wrap by character I created like scales equals free it is not going to work yet oh not had plenty instead I need a group by character pop n 20 tf-idf I need to look at the top oops it's not gonna work ungroup last thing we need is a reorder within so I'm going to say a reorder within which comes from tidy text and we say reorder the words within characters oh and I need to add something but that notice this is to ensure that I can order within each of these facets so if I didn't Susan to notice if I just had reorder they're all the words are now ordered overall but because the orders are different between each of these they all end up in kinda different orders reorder within fixes that problem but it requires a step where I say scale I want to say X reordered yeah that's the one and then we switch this to only have ten from each okay oh look at me how far am I in almost 40 minutes in and I haven't saved yet isn't that charming so what I'm gonna wear is this this is office that's what's so what we see is that Michael talks 4chan a lot this is not the words they use the most this is the ones that are most specific to that character limo a is I guess a song like a like a lion like a a Wimble way okay this is a the lion sleeps tonight song yeah alright and I would see an Andy sing it's a 7-point in a car and there's none vallis all love interest in later seasons yeah Lots like sound kind of things pop up that one first I might do a couple of times I'm wondering if we should say distinct don't count the same word multiple times in the same line that would help a little I don't know I don't know I don't have that strongest sense here CC is James daughter and Canada's love interest he calls he often calls some his own wife bees Pam Beasley Beasley I that's cool assistant Regis so so there's a running gag where Dwight will call himself the assistant regional manager Jim will say assistant to the regional manager this is what he says assistant so often alright this is pretty fun this is really cool can we add any characters when I had Jen want to add is Holly in the does Holly hit this threshold Holly as Michaels lover just alright cool cool cool yep here a whole bunch of things that we can see a lot of the type of characters we could have removed the character Dame's so these are signatures of character so I act things here tf-idf of character as I could add more text but the idea is that these are signatures that are specific to each character I think this is fun I'm gonna make one change here I'm not gonna keep this this axis free cuz like notice Daryl has some words that are very specific to him others don't have words that are as specific to them and that is kind of worth noting what if I replaced like Michael doesn't have a lot of it specific to him what if our placing with David Wallace David Wallace is definitely gonna have my boss as his boss what if um expecting Michael no yeah right he has a product and later students called suck it that's why the word suck appears here oh man I have a lot of office knowledge rattling around in my head that I did not realize I did okay so that's some things about just text analysis now I want to combine the two datasets I want to combine text and and episodes I don't actually want to use text prediction DISA to predict the rate in Nevada episode well I don't die because I am I'm skeptical that oh yeah but episode uses the word a use particular word a lot it's going to be unusually popular that doesn't feel plausible to me it feels more plausible that it'll be confounded either with which characters are speaking in which characters are popular or with time so that's why I'm gonna want I'm actually gonna want to to instead estimate what um what character so here's we're gonna here we're gonna try to estimating machine learning what what affects popularity of an episode is mean to the season or position or the PI of time etc there's the season there's a director there's the writer notice what we actually have in this office ratings data set wait nope it's not an office it's office transcripts which is not exactly right the we are director we have writer and lines her character is gonna be the remaining one so I'm actually I'm really kind of interested we'll say office transcripts and we'll say count episode name [Music] character alright so say man this is annoying I think that they were split up in the other one I wonder if I did an anti join if I do a distinct episode name and I do an anti sure what is annoying I'll tell you what's annoying by episode name equals title some of them are not gonna line up a lot of them don't we use some 26 don't line up thinking about how to UM get how does this not line up to fact office transcripts are they just are they different felt like even episode like get the girl like episode name okay yes some of them just like this one as a job this one doesn't have a dot don't love that okay here's oh okay I've got a guess here let's see something distinct see I think I might be better to use episode numbers but maybe not not if part one and two okay now we're going to use episode numbers it's just going to be a little bit less messy even though here the part twos we it ends up the rating of part two don't end up don't end up included alright I think that that's pretty reasonable what I'm gonna do is then say are there any that do not have the season episode pairings season here I'm just trying to feel for this okay there are some that don't have ratings season four here 15 16 17 oh man joining these data these data sets together it's gonna be harder than we thought I'm thinking of whether there's some easy way to manage this so first of all let's let's take a quick look at this what's going on here if I do filter season equals four we have as one three five all right that's on one and then if I do ratings then it as one two three this thinks there's a 14 episode sees the ratings think there's a 14 episode season and the transcripts and here I'm going to throw in one set when we throw in title episode names called here and then here alright so like we think this thinks is 14 ah-ha-ah-ha okay you know if I get rid of parts 1 & 2 I'd probably get a lot better on joining all the titles because these aren't really part 1 in two episodes are they okay here's we're gonna try so you see what I'm trying to join this your datasets you have to do this character before you do anything exciting we're gonna do is say episode name is string it was in the transfers with it string remove string remove from episode name space anything like this just remove it and now ask are there any that aren't in the ratings Oh bye-bye episode all right and some of them are going to be man oh oops dry crude oh right a seaport I chose this needs to be like part okay I see look I don't get it's it's a capitalization is that standardization is a problem here that we have is like guys okay what I'm gonna do is say could string okay I'm gonna have to standardize both and it's a little bit lame name is string remove string to lower oh I'm gonna remove that I'm also gonna remove periods need a little bit of so here where it's like I'm removing periods I'm removing I'm making it lowercase and here we go we're doing and I'm doing the same thing to the ratings now I'm doing it by name all right and some of them are still are like still off I could keep getting I could try and get the last hand I just probably punctuation thing but let me see and really do wonder what is the if I flip these two I put this here and what do i flip these two that's the part all right these are these are ratings okay here's what I'm going to do I'm going to what I'll do is this from the ratings I'll remove the parts I'm doing quite a complicated little regular expression here but it's actually just saying I want to say heart remove any of those and then I actually I'm gonna do something I'm gonna say take my ratings root by so this will mean some will have multiple I'm gonna do is like the ones that have two parts so I'm actually gonna do is group by name summarize every summarize plan to be rating equals mean if I am to be average across the parts all right we got pretty close whatever this punctuation issues here now but if I I'm gonna stop trying it's fun the UM I could join the last six but it's a small fraction what I but the reason I need to join these is that now they're if they're like not counted as multiple episodes okay I didn't need to do that and now I'm gonna do an inner join on the office transcript all right ratings summarize here's what I do yeah nah all right office trip take our office transfers okay as I said we can count our characters and the name the episode name let us say name name is lame alright see ideas now I have our name and again I've counted our counter characters group by character filter I don't know and it's better than 100 that's 100 separate characters some man it's greater than 100 okay now at 30 I just signed quick to get say okay there are 30 characters now that I'm looking at and I want to know a couple things we can try we can try here one is I say what is the average rating of a character in an episode other than inner join on ratings summarised by character oops by name reading summarize by name I got there eventually and now I do a summarize I'm still grouped by character from up here and I say average what is the average rating of each character though there's gonna be confounded by factors like when the character was introduced and we're gonna get to that in a minute but I just want to start this simplest what's the average rating of an episode of the episodes that a character is it that I actually go characters in number of rate I bet you Andy's gonna be pretty though this is in at all that is other Abney's gonna be pretty high simply because he was introduced in the third season when these were peeking yeah but he also was even bigger character down here okay I'm gonna find a seat and and and there's the number of episodes the character was in now the 31 characters here ago character lines ratings there's a complicated data set that includes the character the name the and the rating of that the name of the episode then the rating of that episode here's character line ratings alright so ah the best the quote best character actually this this lines up to me there's a character Charles that's Charles miner played by a trestle bub who's in only six episodes of four episode arc in season four episode he's in losing a number of episodes in season [Music] four a five that are really popular and then in the season five finale I think he's yeah I think it's basically an not at all ah here he's like where is he but these two episodes are two of those nope that season four here he is he goes from new boss over here basically he's only in these episodes not on anything lower so that's why it is very high Karen is only is really only a is only character in season three so the UM and season three is one of those populist explains what Karen is popular Holly is just in a couple of a lot of the best episodes of the show as well as being a popular character Jan was more character these season so there's a mix of popular characters and characters who were around only for a very specific set of episodes who are the characters are the lowest average rating I bet they're the ones that are in almost every episode like Dwight I know the and replace the Angela was is played by Will Ferrell is only in two episodes and they're not very good ones that is a mistake on terms of he just happens to have more than a hundred lines total so I'm actually gonna say you must be in at least three episodes of that I'll see ya at least fifty lines get at least five I don't I'm not trying to hack the results but I really do want like to count those characters Brian and who is this what is this about I don't know was it five but yeah like um yeah what I'm trying to do is say like who's Brian no recollection of this alright so we see that is like there are some characters in a handful episodes that are popular and then some Robert California is only in season eight and yeah these these are characters are introduced near the end of the show and that explains why I'm the section exception Todd Packer I never liked how Packers character and he but he appears across the entire because a lot of us was across the entire series and that actually might be meaningful that he that if he's in an episode of gets a little bit lower of a rating so that's where we problem in some sense control for the control for the season that you're in alright so how are we gonna do that what I'm going to do is to a is try to a machine learning model to predict the the rating of an episode based on factors such as how many lines each character has the season and the what about what do the the director and writer okay so what I'll do is say take our transcripts distinct name director writer look for that a second I need to separate rows so here's what I'm going to do it looks like there's some that have multiple writers I want to give each of them a credit not a lot that have multiple writers but some some oops who's there directory director okay and what I'm gonna do is I've done this for gather my I've devastate think even in my screencast before is I'm going to be doing a generalized linear regression a lasso regression model I'm really excited to say what director has pushed it up with directors push it downward do that I've got to gather the type and the value of director and writer unite type in value into one called feature so director and set equals this but before I do that I want to separate rows I want to separate the value I want to separate out these um these semicolons I want to say set so separate was set equals semicolon I did a lot right there what did I deal with what I want to do is make sure there aren't any semicolons rather they are multiplied accounted multiple times as multiple separate writers so now every writer now gets their own line anything weird any empty strings anything else no this is generally kind of working alright last thing I'm going to do so this is these are my feature my director features my director and writer features okay and what I'd also need it what I also would need from this is to have the is to have director writer the season in in there oh is that will first attacks a bunch of text yeah transcript words here at only certain characters no yeah what I want to oh that's right I wanted to use the character lines so if I do this I'm actually gonna let me see yeah I'm gonna yeah this is pretty good I need use the character line ratings alright what I want to do was keep the and I forgot to filter out the blacklist characters and how we did that earlier I thought that was pretty good what I said is say filter not character in blacklist characters and I say okay there are 39 characters and take these radio and well two is now now have the number of times each of these appeared the last thing I'm going to do so I've got what I'm doing is I'm hiding together these features every one of these is a feature though I might want to add count filter and greater than equal to three say I only want to say keep ones that have at least three here at least three times Oh add count feature only keep only the ones that appear at the directors and the writers that appear at least three times otherwise it probably don't have enough like stats to you we don't have enough data to use them okay so that's like step one then I'll say take our character lines and we have our name and the feature then the feature is the character and the number of lines name feature equals character value equals and what this is is our is our character line features I'm really trying to UM I'm trying to combine these together to get something um interpretive you might have seen me do a lot of these last regressions before and finally what I'm going to do is add the um let's see yes I want to have the season I need to add as a dummy variable that is I want to have a separate number for every I would have a separate number for each season as a group so then we have a season effect overall that will try and help remove that that'll handle the change in time but also kind of handle the non-linearity where it goes up then goes down so I'm gonna do is add season features now what I'll say is office ratings select distinct because it is actually multiple name and season and then I did it yeah and then transmute name feature equals as character of season value equals one now change this a little bit I'm gonna do a group by feature filter this what am I doing here I'm trying to make all three of these lineup to be to have the same shape find ro and I'm gonna call it paste season and now I'm going to combine these three together combined director writer features character line features season features now there's still group by feature what are they what are they currently looking like alright yeah so what we have then is is the oops character line features yes so this is how many lines then each one of these will be will be like a term for I'm actually gonna change this and get a log of n log - let's make it I said say if they exciting suspect the distribution of character lines is gonna be logged norm it's gonna be log normal yeah it definitely is see like this stretched out kind of like this this is the number of lines each character has in each episode log is going to be a little more normally that we have with this this sketch where some people have only one or two or three lines but yeah we're to say like use the log of the number of lines they have use the here we go character lines ratings transmute that's what I was looking for name feature this oh yeah ungroup this Oh character line ratings is still grouped lamb I shouldn't have left things grouped I'm really having a fun time yeah I'm out of practice with my screen cast here we go well we say that as we have our features those of season 1 season 2 and so on and now we can say here's all of our most common features here's all of our features ok so now that I have this what what features are predictive of quality how could i how can I answer that we could say the what we can do is we take features hide text has cast sparse and we'll say every row is one episode one name every column is one feature and we're casting the value so now we have a sparse matrix in terms of here's how many now we have it's not gonna work now with a sparse matrix in terms of like how many times is character have a line is this is this person the director what season is it is a dummy variable um and now that I've done that we can say episode feature matrix to be a hundred and ninety-two thought there were fewer than that there were fewer than that the feet the the episodes and the M but for one problem is that the the episodes and the the they don't end up lining up oh the there are some that have there are different episode names between each of them not joining them so it's it's it's ending up being a problem I what are we gonna do about that and add one last step I'm gonna say AB go join the office transcripts okay here's what I'm gonna do and I take our futures in a semi join it with the office ratings yes there's a vist is a very messy and the office transcripts this is a very messy future clean process I'd and I think the tidy models package has some better approaches that I hadn't bother using okay we're using 8087 features create 120/78 ratings what are those ratings well they're um let's see we have there there we can get these out of the out of the office rating summarized office summer ratings we can get them all right out of the ratings summarized I'm gonna be rating this is the biggest hack we're doing here is we need to say ratings summarized they match the row names of this to the episode to the episode feature names so these are the ratings lined up with the episode feature matrix so for example actually what would end up being basically same order yeah this is the first unpopular episode and so on ends on a stands on a strong episode alright so that that's how we had me order these now we can we never have my sparse matrix a vector of ratings and again usually these days from machine learning that Julia Sylvia's gonna make it an omni Zeng screencast that does this prediction in a much better way I'm just doing this to get something quick and interpretive oh using the tidy models package I'm doing something quick and interpretable just just off off the top of my head I'm going to use the GLM net package to use the episode feature matrix to predict the and I'm gonna I'm gonna do it with cross-validation so here's my cross validated model this alright now I've done that last regression a lot of previous ones you can see my boardgames screencast or you can see my Bulgarian screencast or my Y and Radian prediction screencast we've done a lot of these before but I'll explain quickly what this is this is showing if I just did linear regression how well would it do and then as I start adding a lambda penalty term that causes the coefficients to be smaller to ready to be regularized to not allow large coefficients on how much better does the prediction get and the answer is it gets a lot better it gets better it as we add it gets better down to about this optimum maybe I want to go all the way here how can I then interpret this I can use the broom package I'm gonna say what are the coefficients I can say I want to tidy the GLM net fit and what I can see then for starters is that Michael being in the more lines Michael has like this is the first coefficient added the more lines Michael has the higher the episodes rating that's mostly thing that season 8 and 9 don't have them but the or is a yes but that's if we if we tie it if we say okay look at the first few steps how do I pick a lambda this is a very high penalty value that removes all the coefficients how do I pick a lambda I say filter lambda is equal to our model lambda let's do let's be go for this minimum let's do one one standard deviation I'm doing a a stricter lambda one standard error from the bottom okay so this then shows is here the things that have a positive or negative impact on the on the rating even after our regularization so what I'd say is used well filter will also say is not equal to intercept intercept is not that interesting everything starts at 1/8 or whatever and we'll saved huh we'll create a graph term estimate giome point forward flip new tape term equals FCP but his thumb graphs like this so often that I know that what I want with this parentheses around intercept all right this shows what has a positive and negative impact on the rating and that's actually this is so this is it very interpret we could say estimated effect on the rating of an episode you know I can do instead of a G on point a GM call and a fill equals a bit greater than zero yep and then throw in a few other things throw in a theme legend remove the legend and do a I have this move legend and one thing is a det cord flip your switch this here here it is alright so what does this show simply being in season eight makes the makes the the rating point two lower we saw that in the general trend season one also a little bit lower but and there's a big bonus we've ridden by it by series creator Greg Daniels he wrote a lot of the most I think a lot of the big finales things like that the more lines Jim has Paul Lieberstein who also plays the character Toby has a positive impact more lines Michael and Angela have Holly is a positive impact but smaller than I would have expected that's interesting what if I what if I say um this is a very conservative model it's trying to to winnow this this set of things that improve an episode down to as small as set as possible make a small change and I use a less conservative estimate this is I think a very least this is more fun is to say we get more interpretive things even if some of them might kind of be due to noise so who are the best directors Steve Steve Carell Tucker gates best writers BJ Novak more lines Jim has is good Jim Cathy's Papa character Helen I think is Pam's mother was in a few popular episodes especially in season five and yep there are negative characters are with respected Todd Packer is not a great character season six is below average ray Rainn Wilson is director we see some some things where episodes did Alison Silverman right not to no one and one on one I've got ice I bet it's later season ones um yeah so ratings summarized no no let's look at office ratings filter oh no it's that that's hard to do these things it's hard to do this because the ratings are in different things then a different data set than the transcripts but the idea is we get to see like but yeah I really do like regularize linear models because this some of this is plot is probably noise but it's generally showing all things all controlling for other confounding factors like the season what has a positive impact what has a negative impact on on an episodes popularity that's pretty I think pretty uh pretty interesting so the yeah we see some graders and directors in the bottom as well so somebody the top seasons three five six eight one other ones that pop out as being unusually above or below the they have a user sonic time effect that's pretty that's some really overall pretty cool more lines from Jim and Kathy episodes from our popular Kathy was a Michael's girlfriend in season two I think season two and three two and three okay alright so that was some machine learning on again if you will if you're really interested in GLM dad and you want to see more trout the the board games dataset or the wine ratings data said both of those use text classification text predicted regression no they're not both tasks but they both use regularize lasso regression to build predictive models and yeah we really we really learned a lot today we saw the ratings of episodes over time we saw the so the ratings webisodes over time we saw some of those popular least popular we saw the each characters signature based on tf-idf character word pairs what words that each character use the most yeah feel like um we learned a lot about one of really my favorite comedies okay that's all the time we have I hope you had a great time I certainly did see you next time

Info

Channel: David Robinson

Views: 3,923

Rating: 5 out of 5

Keywords:

Id: _IvAubTDQME

Channel Id: undefined

Length: 71min 30sec (4290 seconds)

Published: Mon Mar 16 2020