Tidy Tuesday Screencast: Analyzing horror movie profits in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Dave Robinson chief data scientist at data camp I'm excited to do this second screencast well I'll be live analyzing a data set I haven't seen before so each data set comes from the tidy Tuesday project and I'm excited about this one all I know about is the title but I know that it's odd topic for the end of October we're looking at horror movie profit so I'm really excited to dive into that data set and see what it's all what it's all about and what we're going to learn together about it all right so see I'm gonna start by fire let's uh I'm gonna find out oh it's all about our scary movies the best investment in Hollywood looks like we've got eight columns and the raw data is here for it movie profit dot CSV looks like that'll be the raw data what I'd like to do with raw data is I'll click it click raw and select the the URL and now I'm going to go into my new project an our markdown file where I'm going to load in Tyler's package and then load in our spooky data set of movie profits I'm gonna call it movie profit okay I've got some parsing failures we can look at those in a bit I'm going to explore the data first see if it how it looks if it looks reasonable alright and I click this and view it about 7500 rows looks like a run Roper movie Bowditch we have a release state but it's not in a great format so going to need to parse that we have the the distributor their rating the genre and we're proud to be fine in horror movies among those the domestic and worldwide gross hmm looks like some of them are 0 this is a future we're probably gonna need to clean that out cleanly uh the ones that come out but haven't come out yet we have production budgets that look production budgets domestic and worldwide gross alright not too many columns rated I know we're gonna have to control for that r-rated movies tend to make less money genre title alright and looks like that first one is a road name so okay I would i parse this I'm probably just say row names what do I say I say row names equals true that's not even a thing what we do here it doesn't read CSV I'm just gonna select out and I forget we're at x1 not an important column that's good I'm also going to parse the the release date as part of this cleaning process so I use what does I use parse date time I'm going to want to use a library looper date date and time and I say out of release date and I just like to use this all the time I keep forgetting the particular pattern of how to parse a month day year out it'll be lets see month it will be alright looks like M exclamation point and day is all right I like that it'll be I think it goes the format is and here these are kind of thing I just try out and some just have to have to experiment with a bit percent d % I think capital y if it's all for we're gonna find out the exciting way together and it's called release date that looks pretty good the one exception is that oh it's parsed daytime I just want parse date because we because otherwise gets me midnight and every day it does not like that at all let me figure out what I did wrong here par stayed hmm parse date format equals maybe I got the wrong car state oh boy look at me not doing the best job of cleaning this data and I could say all right I'm gonna cheat and I'm gonna do as date of parse date time because that worked maybe I'll come back later and clean that up but yeah right now I've tittering turned it from a character string into a date coulda done better at that clean of cleaning step so we have date movie production budget right whenever I see these numeric calls production budget domestic gross I want to filter one more thing I'm gonna filter for movies released beliefs date I'm actually I'm gonna say let's skip this whole year because this whole year you could be releasing more movies so I want to move the one I've it's probably making more money so I'm gonna say up to 2017 let me come back and look at 2018 data probably not all right now I've got eight columns release date movie production budget domestic gross I've got three interesting numbers that I want to take a look at whenever I look want to look at a numeric column I'm gonna look at production budget I'm gonna look at a histogram did you plot two I like this set to my themes before I run it and see here's to grab a production budgets almost anytime you see money you're gonna want to put it on an X on the x-axis on a log scale we see sort of log normal distribution here now it's not perfectly log normal this looks like I'm going to throw the labels I'm gonna throw in the scales package and put the I'm gonna separate a set up chunk from our data processing chunk and I'm gonna save dollar format I like that million dollars ten million dollars looks like the modal budget was between ten and thirty million dollars and let's see I have a suspicion that the reason it's not log normal is we have different distributors here some which are much larger than others how many distributors do we have in this well let's look at count distributor sort equals to through explode dated a little more 298 whenever I have this many distributors I'm always going to want to lump the ones that aren't from the most common I really think these first six everyone's gonna recognize up to Walt Disney I don't even know what 1.0 means under that that's a studio name I don't know what that's fitting into here and but we're gonna lump everything besides the first six this is making me realize I want to separate my steps I want movie profit raw to be read in and then I want to create the movie profit data set with movie profit raw afterwards this is gonna give me a lot of leeway that I can rerun this movie profit staff later what are the cleaning do I want to do I want to take distributor I'm gonna actually do it in the same mutate those do it up here but I'm gonna move them you take after the filter I left my filter before I do make a change it of columns I'm going to say distributor equals FCP lump really happy I use it in the last clean cast and say distributor and what is it N equals I think I say say six it's going to be the six most common along with other so now if I count distributor store equals true it gives me seven and are the missing values hmm myself there's a way to handle that in FCT lump there isn't that's okay some of them have have missing values so have six main ones six man was does not look the same as when I counted it before let me try removing this moment I saw whatever the Sony universal paramount oh this was the Sikh those coming oh and Walt Disney I think since I did the filter something might have changed here and will and one point out what are these numbers I'm gonna take a quick look at the original data and I'm curious we have Walt Disney where can I find 25 point oh all right this is intro is a strange data that's coming in where we actually say it said here the descriptor distributor is a company called Polygram here is 45 point out this one is 2.0 I'm now noticing something interesting about this data there's actually two rows here one for mightyjoyoung it says 25 point out the other says Walt Disney and I'm noticing Little Nicky we have two rows for it and yeah this is some strange data it looks like it's usually duplicated I'm actually gonna take a quick look at our data set and say five count the movie so equals true yeah we have a lot of duplicates here let's check on the UM on the documentation see if it mentions anything about it otherwise I'm probably gonna be just removing the like directly removing that data but I wanna see if it can tell me why it's why we have these duplicate rows mmm-hmm it is not telling me yes it's not telling me anything about that I think this is looks like not great data Oh unless there's sequels is this about sequels oh maybe it's about multiple what like like for example multiple Halloween movies let's take a look since this is a Halloween it's important that we figure out what Halloween and carry each represented so if I filter for movie equals Halloween figure out what's going on with that data I actually get two Halloween movies that duplicate up ah we're gonna each of them having a bunch of distributors okay and it's multiple um so we have multiple and PA ratings this is some not-so-clean data you should when I see like the date of this bad oof I'm gonna want to in this case I probably want to keep one of these rows and if these cases really want to drop that hmm I don't think I'm now if I take a little bit I gotta take a look at carry eighty point oh six point oh okay there were two carries but there are four there are eight carries here and before I saw the mightyjoyoung popped of twice in the data set and same data except here the MPAA rating was incorrectly state as 1998 and here it stated as this is the better row so I in a lot of this case there's a good row and a bad row the good row comes second all right I'm gonna look for one of those other ones that was counted so just get one more feel of Ten Commandments came out twice that is two different versions which is that's totally fine except that yep sometimes the MPAA rating gets a year sometimes it uh sometimes it gets which it shouldn't be getting and sometimes it gets the actual rating I'm starting to lean towards understand though the data doesn't have to be perfect for me to explore but I'm probably gonna spend about five minutes cleaning it before I jump any further so let's take a quick look at the one more look at the raw data there were a lot of warnings when I when I am read this in so I'm gonna take a quick look at the problems that it's associated with with on movie profit raw no trailing characters well domestic gross that's actually it looks to me like it did fine it did um fine parsing those all the problems with domestic gross that's not what's popping up as a problem here if I look at the original data and I look for saved mighty Joe Young yeah I'm seeing ah it had to Chandra's okay alright and if I take a quick look at UM [Music] at Halloween a lot of who's at Halloween in the name yeah I see an action and horror rose for the movie Halloween and multiple distributors even within each date okay hmm I'll tell you what I'll do now I'm gonna want only one row for every every movie that pops up multiple times often want the last ones looks like in Halloween the last was one of the the last one here Universal our horror is one of the better ones often it's got a bad one first good one after okay last house and left pop left pops of three times the last one that's best that has Universal our horror I'm gonna try and go for four I'm gonna go through a data cleaning step here and then I need to UM make Jenny Brian happy it saved my my file as I go and here it is I am going to do a distinct I'm gonna reverse these I woulda arranged in descending row number which is a way of reversing the order of my rows and then distinct every combination of movie that's the title and release date with keep all equals true so what I'm gonna do is keep at keeping the last version of each because first sorting in descending order and then keeping the first one now if I count the movie I see these are a lot of these movies that came out that I know came out twice like Halloween like Carrie had the remake I feel better about I feel do feel better about this and I counted by both movie and release state it never has duplicates sorry alright so what I'm gonna do is I'm not sure if this data is good enough I'd like to start exploring it we may come back and grab a better variation of some of these it looks like there may be some corrupted data within this data set very typical if we're if we're starting to analyze something all right you still have these 30 ones is one point I was getting might come back to it I'm gonna leave in that line where I did mutate distributor equals F CG lump of distributor 6 now if I count this I like this data more if you're missing a lot of others six big distributors if I tried this histogram I'm gonna turn it into a box plot I'm going to say I want to know based on the distributor which distributors had more made more money than others the y-axis and a log scale will in a box plot what it looks like aha we definitely see that I'm gonna collapse or so this is a lot easier to read what instead of missing distributors way less money makes sense ones that are the other category wide variety but generally lower budget and then all our major studios Warner Brothers Universal Paramount Pictures with a couple of exceptions high production budget well there's definitely time is gonna play another role here we're gonna need to see that when we explore it that's good to know okay that was production budget I'm gonna copy and paste the exact same box plot but this time find out the worldwide gross similar we see that missing makes the least that one's that don't that we don't know the distributor make the least money they're not a lot of those who probably filter them out it's not a big deal and then ones that I'm twentieth century-fox through Warner Brothers generally made a lot this all in a log scale it's all compressed across time would it would need to return to that but we've been looking at distributors so far and it sounds that what we're really interested in is what genres make the most money but probably interesting what make the most profit from an original investment not just the most overall box-office gross so I'm gonna start with this I'm actually gonna start by finding out how many genres there are I'm gonna count the genre just five makes it really easy we don't need to do any lumpy I'm gonna copy my box plot logic and change it to genre I'm gonna start with production budget adventure cost the most on median comedy drama action horror I'm gonna actually throw in a step here where mutate the genre to be the reordered genre by production budget it's nice we saw this last screencast ik I kind of sort this all right this is on a log scale so a small difference can make up here can make a big difference in real money this might be doubling the the median budget yeah all right that's interesting we also not looking at time at all and that's I'm you really have to explore especially if these aren't adjusted for inflation which I don't know that they are I can take a quick look at the data does it mention anything about it all right so what I'm gonna do now is say uh let's take a look at time budgets over time we've got release date here and in particular I'm gonna say I'm gonna sort it by decade that is part of me group it by decade - a decade is we have date so we can get the year of a release date and we can round it to the well I'm gonna actually I'm gonna round it to the lower one which is a bit of a trick I can do 10 times the floor of that divided by 10 and now if I count the decades we actually get the the first decade before this point a better than the latest decade before this point now I see it 2020 I thought that I filtered out for that so I'm gonna take a quick look if I say filter decade equals 2020 I see this movie release date 2020 did I I said release date is less than 2018 what does that sing something I don't know why this movie is singularity release state if I take this and I actually say filter least state less than 2018 or 101 it's gone so I don't know why I was getting it before let me try running this line one more time did I possibly uh did he possibly just not run that line no I ran that line filter released oh that was silly abhi I I filtered the release date before I turned the release date into a I'm into a date all right now we can run the up I like that a lot more laughs account by decade what I see is one movement in 10 another from 9 to 20 and generally more and more movies from each decade most of our data we notice right now is from 1970 on wood or so ok that's really good to know if I grew by decade it's really gonna be tough to work with time when we're looking mostly from 2000 onwards but I'm gonna just get a quick sense of it if I say summarize I'm gonna summarize act each of the columns of a production budget I mean I summarize add I have to say all the way from I'm actually not going to do this cuz I like to I'll do it I'm gonna take from production budget all the way to worldwide gross and I'm going to take a median of all of each of those columns and we need to say ni AR M equals true some of them to be missing this lets me find the median domestic gross worldwide gross and production budget all in one graph so one thing I find here may be the data is adjusted for inflation but the production budget has actually been um has actually been decreasing a little bit the median production budget of movies in our data set since 1990s as as has the domestic gross as and the worldwide gross has been pretty steady this is making me think the data is adjusted for inflation I think it's very unlikely that would be the case if we were if we were looking if we were looking at um not unadjusted data another way I can check what the data is is just a for inflation just lets find out the top money-making movies of all time let's say descending worldwide gross frozen minions Jurassic Park Star Wars Phantom Menace maybe that's not adjusted for inflation a lot of those are newer movies but we definitely only have a limited data set a lot of these are coming from the 90s and 2000's alright what if I said domestic gross I can sort this way phantom menace you hope yeah the fact that New Hope is up near the top is making me guess these are adjusted for inflation I know that Star Wars is one of the most money making movies of all time if you don't that if you um do adjust for inflation alright that's interesting I wonder if I can find that out anywhere if actually I go to the article should I like to avoid the article find the word inflation here hmm not so sure okay okay now I'll stick to take it to taking it out at its word I think I'm probably at some point gonna want a filter for movies anyway that are only in say the last 30 years or even just the last 10 years but for now just to get this idea across I can make a really neat breath I can actually say let's gather or say with his Halloween would do a ghastly gather and later perhaps we'll do a spooky spread and here we gather everything besides decade and plot all three of these into a great line plot love doing my plus over time we say color equals metric now you get a sense of typical distributions of met of these metrics over time I'm gonna quickly throw these y-axis why tenuous labels equals dollar format alright this is a line graph where we'd say what's a typical production budget median production budget that is what is a typical gross I should probably have filtered out the ones that were really really rare these rare decades I didn't it it's not too crazy down here generally it looks kind of like for us the domestic gross peaked in the not in the 70s which definitely that makes me think that this data is adjusted for inflation otherwise that would be pretty strange and worldwide gross has been on flat alright I assume it's just at for inflation which is also gonna mostly let me ask my questions like about difference between genres without worrying too much about change over time later I might explicit filter for movies in the last 10 years and see if that changes my conclusions okay there is so much variation among these five genres I looked at production budget each of these and there's so much variation within each each genre a lot of that is probably due to distributor so I can start by saying let's see I can combine both of these into one graph I can start by let's see there are only five genres so I'm actually average on reserve I'm interested in ideas faceted to represent another metric to represent the distributor and I'm also going to filter where we with it we know that in case we know the distributor now there's gonna be seven facets not my favorite number of facets but at least I can actually tell what some of the differences here are alright so I see for example that across all six of these major uhm across all six of these major studios action always makes more money part of me always costs more money action-adventure generally cost him more and horror ooh Walt Disney probably producing very few horror movies looks like her usually costs the least okay that's interesting and what if I change this what if I wanted to say well if I said let's look at worldwide budget we're absolutely gonna be comparing those two part of me worldwide gross absolutely looking at the difference between those two how much did each movie make over its budget before we get to that let's look at the typical gross all right there were probably don't want to filter out the other we might need to look at other separately it's a whole different story it's like whole we say here's a longer tail especially in the lower end it's rare for one for a movie made by a major distributor to gross very little money it's it's not that strange for an independent distributor so we might be looking sometimes at just these big distributors okay let's start talking on profit let's say that are talking the difference between a worldwide gross and a and but and a production budget so I'm gonna move my what are typical budgets overtime down here and saying which genres cuz we're gonna be looking at Sean was have the big payoff so in ratio so I'm going to say pay up we're looking at a ratio of worldwide gross to production budget here I can tell you we have to be really careful there's gonna be some movie in here with a $1 production budget that makes a million dollars it's gonna throw off everything we do so when I analyze this kind of question I like to start by let's let's stop it adding a column I'm just gonna call it profit ratio I think that kind of gets the idea across and I'm gonna say profit ratio equals worldwide gross divided by production budget alright so I can see for example here are in the first the first ten movies we happen to be looking at we have movies like cube that made 36 times its budget movies like I married a strange person in November that lost a little of their money I arranged in descending order of profit ratio we're gonna get some infinities here some of the production budget of 0 we're probably just gonna want to ignore those but let's start by looking at ah Paranormal Activity and The Blair Witch Project those make a lot of sense we got some horror movies here that are making way more than their original budget yeah so that there's really getting us into the Halloween spirit the original Halloween we don't have a domestic gross foot but we have a production budget all right actually this is not nearly as strange data as I was thinking like making bad a few movies make it back a few hundred times the budget makes a lot of sense this is what isn't what the article looked at but I'm also interesting the biggest busts and these I'm suspicious we have these movies yeah 12 Angry Men is a classic there's no way 12 Angry Men made $0 none of these movies I think I think we just got a filter these out these are bad data maybe um maybe there's a way to fix it from the original data I'm gonna go ahead and remove it I'm actually gonna remove it all the way up in our cleaning step we know we don't have every movie of all time so shouldn't get too precious about throwing out some movies when we when we want to hear we have some movies I've never heard of which is really kind of the point these are going to be some really it was a big very very little money and all right what's the general distribution of profit ratios this I definitely want on a logarithmic scale there isn't even a question when you have ratios they go on a log scale otherwise they're very misleading and here we have some movies never quite pick you back a thousand times their budget 10 times 1/10 1000 I wonder how much the median movie makes back that used to summarize forever in the way for that median profit ratio probably above 11.7 all right median movie makes back a little more than it um then I'd put in okay let's look at some distributions I'm gonna look at the genre to profit ratio because this said I'm pretty sure this is what the the article was was really getting at that horror movies have unusually high ratios here we go yeah it definitely is the highest but it's so hard to tell with these um with these distributions when I look at these distributions I probably wanted it instead of looking at a at a box plot which is kind of misleading because we're over several orders of magnitude here like this is between lose nine-tenths of your money and make Tet you're 10 10 times back that's a huge gap but it's only a small part of our graph I'm gonna instead look at the median profit ratio so I'm gonna do a median profit ratio equals median of profit ratio and arranged in descending median profit ratio alright the ones that make their money back the most are horror two and a half times uh the average one adventure comedy action drama that's pretty interesting if I were just gonna make if I have this stop right now and make one graph I don't I'm about halfway through but if I would had to stop I would probably make a bar plot I would say I definitely coordinate and I'd put on the y-axis median profit ratio and I would say if I say I'm that let's make the y axis labels equals I can't remember if this is a function but I would do I would write one myself would say it's x times taste your um your number times X that is because I want to be 1x 2x that's pretty good so these are movies that go bust these are movies that make back your same profit these are these are movies that are here we go the order your genre by your median profit ratio I use them a plus instead of a pipe makes the most money back that's the way I could look at it do these conclusions change if I look within a particular period of time I'll start very uh roughly and I'm probably gonna undo this imma say what happens if I say release state 20 has been since 2010 nope pretty similar ratio what if I say only remember I filtered our 2018 only movies released last year still pretty steady if I said notice I'm doing genre all the way through I like this graph I'm gonna keep it I'm even going to promote meet me the profit ratio up to my original data cleaning I need to run the data cleaning step if I want to see if this changed over time I could group instead of grouping by genre I could group by genre and a year and I know as soon as I do this it's not going to work for every year I can do what I would do is change this to genre a year and include the number of movies in each year I need to define year as years and is a Liuba date function year release date so I can tell I'm not gonna get anything out of knowing that the one move action would release in 1934 it's not useful here I could filter for years greater than let's look at the new millennium only movie since 2000 most genre year combinations in this aggregation had all the action ones have at least 40 50 movies I feel better about that I arranged my movies I can see do any of them have very few relatively few in horror it's worth noting that it's worth considering that it's gonna be higher noise I still feel kind of good about making a graph out of it 2000 2001 having only eight movies okay I'm gonna go ahead and not keep and um use this and ask the question let's see most of this is all good I just changed this to a geo mine I put the median I put a year on the x-axis a color equals genre and I need to ungroup after the summarize otherwise I can't keep editing the genre variable I got these two uh I should have dropped the cohort flip and I forgot to a filter - a filter a year is greater than angel - mm I had neglected to put that back in the code all right aha now I see a story now I see a story the story is for a long time horror made a little bit more money than the other genres in terms of the ratio of making back its money but there was a change at a phase change shift around 2012 this will be 2013 the first year that the median horror movie made six times as much money as as a cost isn't that's something I wonder if the budgets changed or the the money it made back change they also wonder how this relates to um to distributors okay so hard I would actually this is so interesting horror movies started being more profitable around 2013 that's really interesting alright my favorite horror movie was cabin in the woods came out in 2012 I don't think it was too much of a box-office smash but that that predated this this dude looks like kind of profit Renaissance alright 2013 knows what bad was when Paranormal Activity came out maybe maybe that makes some of the difference all right let's look buy I want to look by distributor one of the problems when I looked by distributor is we're gonna have way too little data there were eight movie horror movies in some of these years there for distributors gonna narrow it way more down so if I want to look by both by three variables genre something distributor and something about time I'm going to need to do it by decade so I'm gonna promote my earlier decade analysis where did I say decade equals I'm gonna grab this code I'm gonna put it into my original mutate and ensure that I always have decade when I want it and I'm now going to group in this dataset by genre distributor and decade I may be able to filter only for $19.99 now I'm gonna find out I try 1999 I was a part of me in 1990 this needs to be decade okay and they arranged in ascending order of movies I still have years with only one movie action and adventure but it's a part of me that distributor na is particularly uninteresting I'm gonna throw that into my filter is not is in a distributor don't even I don't even care if you're not if we don't know the distributor forget about it so generally it's it's studios that produce very little horror that makes sense to me so we're gonna find out which of these um actually in a moment we're going to find out which of these distributors started cranking out horror movies but first we're gonna find out I saw a relationship to to UM median profit ratio and by that we're gonna we haven't to add another variable to this analysis we're gonna facet by distributor I don't do I seem to be missing something here Oh decade x-axis needs to be decade not here I realized we have seven ooh this is gonna be this is a pretty clear story I'm gonna get right back you Paranormal Activity the the clerestory here one thing is if I got a drop Walt Disney out of this lets but if I'm looking at horror movies it doesn't make sense to keep including Walt Disney that had so few horror movies in this period of time that means when I do FCP lump I probably want to change the FCAT long to only have five I cuz I just happen to remember that that Walt Disney was the sixth we can drop it out the graph looks better everyone looks happier except we still have this situation going on alright isn't that something what oh that is gonna be Paranormal Activity if I arranged in ascending order of movies of number of movies we find out that twice and three cases there were only three movie two to horror movies in a whole decade oh this is maybe Blair Witch Project I'm thinking about it we'll get back to paranormal activity alright I can't do much with these um with this movie by distributor data at least in terms of number in terms of if I look at immediate of only a handful of numbers I'm gonna need to start adding confidence intervals this conference interval is gonna be too wide I'm not crazy about it I'm gonna stop asking myself which which ones which distributors a bit of profit and still just ask you the question where are the hard Woody's coming from all right this is actually a more basic question than our profit question so it's they're gonna say what are the most common genres over time I should have looked at this first that's a pretty typical movies I dive into a question I'm interested in before I ask these questions about counts so I'm gonna take our movie prop movie profit count decade and genre in a moment uh probably do year and ask myself by decade colored by genre how many movies of each genre coming out and what I really see is most of our data is relatively recent happens from 2000 through Ted to the 2010 decade and is the ratio changing to ask that I'm gonna change this from being a on a graph of lines of number over time to Rutten by decade adding a percent column where it's a n divided by some n turn to a percent change this percent and throw in a scale Y of a percent format start asking what movies make up a different percentage all right so now we have which which movies to make up and it turns out it looks like our old data we mostly have action I'm not going to say there used to be more action movies because we know this this data set does not include our movies only 7500 most of our old movies are action and for the last three decades the breakdown of horror comedy drama adventure really kind of has been staying steady it's not necessarily about the number or perhaps by distributor so I'm gonna ask what distributors make the most movies as most of each genre as a separate question count by distributor and genre and here this is a really classic case for a for a faceted bar plot we have two options of what to put on the x-axis the genre or the distributor I'm gonna put John R on the x-axis because I want to be able to look at because I want to be able to look at one distributor make it if you have caught a bar plot how do I look at one distributor and say what was that distributors breakdown of movies and you can't make a bar plot and bass basically I was doing a cohort flip I think I neglected oh I've neglected a filter not isn't a distributor and I probably like to put the scale scales equals um let's put the x-axis on a free scale so they can each be here we go all right one thing we learn from this is most of the movies it's actually a little bit easier to read this if I give each each of them a fill and then I tell it not to write a legend all right most of the other we have our action vast majority I actually had expected most horror movies to come from other expected to be other to be shifted towards horror movies but I was completely wrong it looks like a lot of horror movies come from Sony and Universal okay well more little more rarely for 20th Century Fox okay other than that not a huge difference the action ones point the props our adventure movies are usually major Pitt major distributors that makes a lot of sense okay knowing about those counts was pretty useful knowing horror is always the always the rarest but especially rare and other is really useful okay now I'm going to go on to um I don't move back to the questions questions about payoff payout so horror on median the biggest payout especially in the last couple of years again we can't divide this down by distributor I'm gonna virtue the question of what is assess what were the highest money making um this is actually a graphic probably should have made it first is is I probably in a in a blog post I would say wow horror movies have been very profitable in the last few years what were some of those profitable horror movies I know take it and say let's filter down let's say filter for genre is um and I just realized how I'm going to be showing this over time this is gonna be a pretty neat graph but let's find out if I filter for horror I really can't see the names of these a range a descending profit ratio quickly typist of view to view what I see really is is Paranormal Activity Blair Witch Project Halloween move us from all over our our spectrum and I'm gonna I'm interested are a lot of these recent or what are the recent ones that are driving this might seem a lot of them I'm not seeing I'm obviously maybe they be a lot of movies around 2030 12 24 30 2014 that had huge ratios I don't think that's necessarily what's happening let's graph this a bit well I'm gonna actually start by saying let's have a data set called horror movies let's take the top ten and let's just plop that let's say movie but movie and profit ratio and make it a bar plot court left and I know I'm gonna need to me to reorder that movie that's a pretty good-looking graph you know it's it's Halloween so I'm probably gonna want to do a little bit of coloring on this I'm going to say fill equals orange makes sense to me and I usually don't I say a ratio of let's make this a blog post ready graph of worldwide gross to production budget instead a spooky orange it's not the most spooky orange but it'll do I'm gonna bring in my scale wide I'm gonna bring in that X I kind of like that something such as such times and a title I'd say what horror movies have most out-grossed their budget with a that's a reasonable title maybe they could think of a better one and yeah we get we get a top ten from this they've got range from 100 times switch times 300 times I could put this on a log scale it would be realistic to do that but when we're looking at the top view it is kind of important to know it's important for a studio to know that Blair Witch Project and heroine on paranormal activity made far more than any up than any others I made this orange but I could have as an alternative I could have said let's make color equal distributor so I want to say ooh what distributors made all this money I actually want to use failed equals distributor which distributors made all the money from these movies we see that combination paramount for some of them Universal for others all right so that's somewhat interesting now I just realized I filtered for horror movies in this in this but I might want to work backwards and I know I already saw that some of the most money making movies especially of recent recent time were horror but I could make a variation of this plot we're instead of graphing just horror i graph allmovie i graph the most out those out grossing movies and I'm gonna put this in the section about what I'm already talking a little bit about what Jean was make the most money and I do it with all movies I say movie profit arranged in descending profit ratio ten and I say fill equals dhunda and I asked myself the question of these top movies what tends to be over-represented and we see five of them are horror movies three of them are dramas adventure movies don't make this at probably cuz they tend to have high budgets and also high on the high budget high return but not not you'll get these same crazy ratios alright whatever make a graph like this and sometimes realize we'd probably want 20 items or 20 items on it alright so you know if please we saw that in a higher median but not by crazy amount having said that I noticed a lot of these are recent a lot of the horror movies are recent hmm we could look at this ah we could look at this only in recent in a recent period of history we saw a bit of a change in 2013 I'm gonna just keep it as it is we could say what movies have most out-grossed their their budget this is a pretty solid graph I would say Bill equals I the capitalized genre this could go in a blog post I'm into this graph found out some interesting information like I definitely I did not know Bambi was this giant hit that's cool alright learned some things from this graph and and then we could say we saw that horror we could absolutely make the horror specific version of this well extend it to 20 movies and and share this version and then color could represent distributor that's one way that we can um we can represent this I probably also add to this movie I notice I keep talking about time I probably want to throw in the release year for every one of these movies I thought I want to say let's take movie and year of release date that's a good-looking graph you see these kinds of things a lot would people throw in a year not a particularly unusual number of recent movies especially given that's what our data set is based is biased towards but um but yeah that's a solid looking graph I couldn't also scatterplot this so what I can do is I can actually say let's put I might do it for I might do I'm gonna maybe start by doing it for all horror and maybe then do it for um for all movies I'm gonna kind of play it by ear is I'm gonna make our move his own chunk and that I'm gonna take this put release state on the x-axis the profit ratio on the right y-axis I'm absolutely going to scale why love ten but that's me doing things a bit out of order I'm in a G on point and I'm probably gonna filter we have so few movies before in 1990 I really think it's it's not a great idea to include all that time period so say movie horror movie since 1990 what has been standing out in terms of our profit ratio and one more thing hmm just in horror movies all right I'm going to try adding in titles when you add in titles onto a crowded graph like this you absolutely need check label check overlap equals true and I say label equals movie we and I adjust them to not overlap each here we go and I noticed this upward trend when I see an upward trend I like to throw in a gym really any kind of trend I like to throw in a thick and see a house would change in last thing is I love really like my earlier idea of this a prettiness X on to each so I'm gonna throw that into here on beyond on the x-axis like doesn't seem to work oh I'm uh oh I see I need to put the labels into the same scale log10 that I'm already using all right this is informative graph I notice here is like we did we used to sometimes see like n fold making back your budget or AB plus for five folders so er here I know what you did last summer a big summer hit but then we saw Blair Witch Project changed a lot and then we saw more and more movies in the 2000-2010 then made back more of their budget in fact when I have the graph like this I really realized that I need a break Adam wow these are some really wasteful um there's some wasted space down here I'm just gonna go ahead and filter for your profit ratio needs to be come on more than 0.01 maybe Oh point one one and ten I'm gonna start here and yeah let me see all right I'm 100 yeah that's a better looking graph already what I can say here is look at that we have a upward trend some of these movies yeah really kind of making a lot more than than they used to and they some of these might be driving that difference where horror movies are make are making more profit than other types all right if I made this for all movies and I colored it by genre how different would it look you see and I say I'm gonna have only one trend I don't love the idea of a trend for every um for every genre they say colorful genre I might come back to a trend for every Shawn I'm gonna play it by ear a little bit here happens is were way too crowded oh that's that's actually exactly the thing I was interested in that overall it's flat even though for horror was going up I try getting one trend line for each of our for each of our genres still too crowded to graph but it's gonna help us pull out what we might be interested in saying let me see oh the one trend line is gonna have to go right here call equals genre too crowded I might start faceting ooh look at that horror being the one that goes up I'm gonna go ahead and fasten my fasted rap bionet drop color I might might put it back in and facet wrap by genre this might be an interactive graph because we're getting really crowded here action adventure comedy drama horror haha see we're very much seeing this story horror movies have started a trend of making back a lot more than they than they put in fact dramas might be negative maybe not it's very hard it's very hard to tell that's really cool I think that's that's pretty informative I'm going to hmm I'm gonna try making this interactive I'm gonna do let's save this as a graph G and library GG plot Lee part of me it's library plotly and ggplot Lee on this graph see how this looks wonder if did you plot Lee supports so check overlap if it doesn't it does not that's really good to know ooh am I gonna crash our studio I hope not what happened is it looks like ggplot Lee may not support check overlap equals true I've never tried it before and that's that's a bit better and the reason I want to make it interactive is I'm people may want to zoom in on particular movies right now if the zoomin it's not going to tell them much it'll just held them to a release date and the profit ratio but if I throw in one more aesthetic I saw this last week and I need to remember what it is I think if I do I'm at a hundred percent sure on this I think if I say label equals movie nope that's not going to work oh maybe I need if I do it up here perhaps it'll work oh I like it better and I make this interactive this would be one way of expressing what we learned is we could say oh I notice each of these are flat in terms of the release date and the ratio the ratio of the worldwide budget to the production budget but horror has been going up and a lot of it is driven by these extremes like a paranormal activity like the devil inside and unfriended and split and get out his movies with fairly low budgets that became blockbuster hits them again to also look through some of the others we learned that movies like open water which I'm oh wait that was the shark one yeah so some of these low-budget movies chasing amy open water Pokemon three Napoleon Dynamite Full Monty some of these movies but they kind of are flat through our time whereas horror they've been kind of cracking open this formula of the low-budget blockbuster that's really interesting alright well that's all the time we have there's a lot of things I didn't explore I didn't explore if I added a future work to the end here I'm gonna throw it all the way in the end I didn't look I didn't look at the mess Tech versus worldwide budget I bet we find that there's a part of me movers worldwide I'm revenue I bet we'd find that there's different ratios of worldwide to domestic and that for example action movies and horror movies may play unusually well overseas well drama tends to be more focused in the US it's possible we didn't look at ratings at MPA MPAA ratings it's well known that r-rated movies tend to make less money because they restricted who can enter I wonder how that plays into horror have these horror movies that make all this money cracked open some of the the ways to make a I make a highly rated highly money-making r-rated movie and we haven't tried a predicted model try to predictive model I suspect if we combined genre release state and studio we would be able to and production budget we'd be able to get some sense of predictions in terms of the expected amount of profit having said that I think from the data we have here there's very little we could do to say what's the next blockbuster we'd probably wanna look at a lot of other things that could look at the script we could look at subgenres we could look at the actors that there would be a more interesting and useful model if we wanted to say let's predict the next blockbuster having said that we I'd say we definitely got across an in telling a story about how how horror movies have recently become increasingly money may do some money makers in the last couple of years so thanks very much for joining me I look forward I hope you watched it watch the next screencast where I'll be analyzing next week's tidy Tuesday data and happy Halloween
Info
Channel: David Robinson
Views: 4,600
Rating: 5 out of 5
Keywords: rstats, data science
Id: 3-DRwg9yeNA
Channel Id: undefined
Length: 62min 55sec (3775 seconds)
Published: Tue Oct 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.