Tidy Tuesday live screencast: Analyzing Super Bowl ads in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson and welcome to another screencast we'll be using r and rstudio to analyze data i've never seen before as usual the data comes from the tide tuesday project an amazing weekly data project in our run by the art for data science online learning community and as usual my favorite part of doing this live is i get to interact with people that are watching so if you're watching this right live i definitely recommend you join in the chat i am able to read everything give your ideas visualizations ask questions about the code it's something that i find really exciting about this process so let's see what we have this week this week is super bowl ads so i am not a huge sports fan what no dave you no i'm actually not a huge football fan and um and i'm also not i would say like like a huge ad fed up so i but the few times i have seen it i have um enjoyed the ads more than the the show i think that's pretty typical but the um uh yeah but i am going to uh yeah let's find out what we have with our super bowl ads so i'm gonna go here and let's see uh let's i usually start with how to choose they are use tidy template what load in the uh loading the data also theme set theme light and library scales bringing all the code that i usually like let's see what data we have this week we have youtube.csv and that's the only one all right so it's about ads and what so this is like is all the ads from you youtube okay this is things like most popular ones i don't know anything about i haven't seen anything from here it's what our popular ones what are the brand what's the url the youtube url uh is it funny a bunch of logical ones uh it looks like they've been like coded so you can also say like uh the youtube id the view count and save one of the most popular things like that okay and uh these are already snake cased so i'm going to save this as youtube and let's see what we have what we have from the uh the data we have 247 entries and um most of them have urls we have our brands funny show product do they show the product quickly are they patriotic uh do celebrities so we could ask questions i could do ones that um are ads that use sex do they have higher view count higher like count higher dislike count higher like rate if you want to say the like rate divided by the the view count more comments uh there's definitely no question based on this that uh that some get a lot of comments and some get very few comments or views some of those we have some we have some missing data here in terms of the view count and things like that definitely on the bottom we have some missing data we also have ones that have zero feet uh favorite count looks like it's usually zero maybe always zero okay favorite count looks like maybe it's like a bug uh comment count looks like it's mostly meaningful publish that i'm curious does like the the date that things are published are they often published years later that might affect some of our data if like if something was only published very recently despite being an old commercial uh but maybe not uh and yep and there's a channel and a category id do we have a category table youtube content category okay so i can't see it here we can find it later all right so um some of the questions that i would start by asking i almost always start with a count in this case i'm gonna start with account brand sort true okay and then i can do something like you've seen me make this graph a hundred times if you've seen any of my screencasts you see me do uh n brand gm call uh and mutate brand is fcp reorder brand i really need a a shortcut for doing this let's do this all the time so like what are the top like um brands oh so it is only actually uh how many brands it's only 10 brands that we're looking at uh okay so the um so i could do a so we definitely do some modeling here if we want to say modeling on like 270 data points is a little bit uh it doesn't excite me all that much you can't really do like cross validation and stuff we could possibly say like at least like machine learning modeling but let's ask some questions okay so how what years do we have let's make this in a gm call and all right these are our years throw in a fill equals brand heck change this and make it a geom bar we don't even need an n we just say here's our number by year and by brand uh gmbar does the the adding up by itself you know uh it looks like it's pretty a lot of them are pretty consistent but it's very hard to tell on this on this one so do by brand theme legend position is none i just want to see like are there some that are more recent than others not hugely toyota it looks like we don't have anything free 2004. uh e-trade maybe a little bit rarer than they used to be hyundai uh only in the last 15 12 years so mostly a lot of them are spread out over the 20 years of data that we're looking at uh okay that's you so i probably am go but anything i use in terms of like their the number of views and things like that let's let's take a look at that number of views i'm most interested yeah so view counters are going to be interesting i'm going to histogram it and i can already tell you it's going to need a log scale i saw some huge numbers and some small numbers i might want a bigger bin width that uh oh uh in in the histogram i say i want to bin with here we go uh so videos have a log normal distribution uh that uh and i can add scale up labels equals comma so like hundred to ten a thousand so they peak around um a hundred thousand views and sometimes as low as ten sometimes are as high as a hundred thousand so let's do x's number of views you know i'm thinking about this and i actually kind of want uh i want to do to take this but i want to apply this approach to everything and i often use gather even though i'm supposed to use pivot is if i gather into the metric uh and value and do everything that contains count or ends with count i could have done then i can do the same thing and facet wrap it by metric favorite is going to be a silly one it's going to be all zero yeah i don't want favorite count uh so to say uh i could just uh i could honestly just remove that in the cleaning step so i've never even like attempted to use it it looks like that was included by mistake remove favorite count why do i do this i want to get a sense of like uh of okay comments are low the comments are well relatively rare the median is like 80 comments dislikes relatively rare uh light dislike to like so i could do things like a ratio of of like to dislike or a percentage of dislike to view count uh those some of the ways that i could um like examine this closely i'm probably going to do to use some of those uh first dislike i kind of like kind of like i do like percentages like dislike percentage or like percentage um i also um ratios yeah we can do a ratio like the the liked dislike or dislikes over over likes plus dislikes so these are some of the ways i'm thinking in terms like well how can we say what are the good or what are the popular ads is kind of what i'm thinking about uh let's start just with view count because i'm gonna be so for starters i'm just going to be interested in something like some box plots where i want to say uh what is the um view count by brand gmbox plot we know we're going to need to use a log scale and we know we're going to want to reorder uh look at it not work no this none of this working why didn't this work was it grouped it's not grouped brand is fct reorder uh brand by view count and brand is a factor oh the finite value oh okay the missing values so that makes sense i have to do filter not is in a view count but if it's not grouped then it's usually missing values it wasn't able to find a median uh cool so the story was that uh hyundai has a lower view tends to be lower view count nfl doritos have the higher view count i know doritos ads tend to try to be funny i doubt the car commercials tend to be i doubt they try to find these often so the um let me throw in my labels comma um so that that's one possible explanation so we're going to check out that actually i'd test that right now by saying philip was funny uh and actually this isn't quite confirmed it looks like okay the ones that are funny and within a single brand maybe the ones that are funny tend to be higher i don't really think so i guess the few however many this represents ikea and hyundai ads that are funny do seem to be a little bit a little bit more popular but we're also we're looking at pretty small numbers here uh for each of them but at least was my experience maybe car commercials just aren't as shareable um the other explanation that we need that we really need to consider here is time so i had to do something like uh look at year view count group equals year all right i want to make sure that it's not like okay older ones have been around for longer get more views newer ones uh maybe i'm more uh spread more by social media they it doesn't seem to be that clear trend in fact but actually i take it back because there is like a trend here look at this like the 2014 the median was about 10 000. if i drop this on this log scale woof uh well i definitely want to drop the logs of this is one that just was out of this out of control just with with whatever this is uh 150 million uh but i think i may want to look at the median views per year [Applause] and i probably want filter this will be good for a line plot with an extra geom point which has sizes and maybe no uh all right so this is like the median number of pages each year okay it does look like in 2000 uh what is this 2007 2008 median number was a really high number of paid views dropped again 2014 it's like nothing went viral i know that's 2013 it's like it's almost like nothing went viral this one year 2015 16 2017 it was a lot higher but it was also a relatively small number of ads maybe this is just one ad can't tell from here uh so so um i would add uh here scale y continuous labels equals comma it's not a straightforward before after it's nothing like like earlier later it's it's it's more complicated than that the relationship between time and views uh and um let's comment yep number of views of i'm also gonna say you need at least three to be counted all right so there are at least three here if i did say at least five is that good okay no no every year has at least five points that makes me feel better about like uh that makes me feel better about this spike is okay it's really these popular ones but it's at least five all right so this is at least like like one interesting let's actually let's dive into that year for a second i'm really curious what was that year that um filter uh year is 2017. only five ascending view count view okay so that was a year with one there's 28 million eighty eight eight hundred sixty thousand three hundred ten thousand eighty seven thousand really a lot of variation here uh inside these lines we have our differences ooh i didn't see there's a text field uh so this is like the brands of budweiser some nfl kia hyundai uh yeah this was a year that only had we know that budweiser and nfl tend to be more popular ones maybe this was control for brand a year that we only had these on in the data set this is a little hairy but i'll try saying what if i say only years with at least um i'm missing something why is the data still sticking around there i i'm saying n is greater than or equal to ah because i'm doing a mutate i meant to do a filter uh i can start to present any with more than seven but i can do make it more like any more than ten that's not that's not so cool i don't know i don't know i can i can do something like this say only ones with at least ten so this spike is probably real there are at least ten i don't know about all the other ones um but that's um uh oh literally seven but yeah i don't know um we can uh we can lose a few ways this story does look like there there was at least a peak here and then maybe it goes down a little bit other things i can do i could round to the nearest like three year segment i don't know i'm not quite gonna do that yet but um all right so that that's we're saying time has some kind of factor but it could also be the brand having an effect on time i'm not so sure about that okay let's look at the qualifications be like uh what do we have we have funny to use sex and we might be interested in all of these i'm gonna want like a box plot based on all of these really i'm going to want to say gather the um gather and we'll say boolean or we'll say category uh and uh value i don't have a great name for this funny to use sex and let's try plotting it where we say category value box plot oops it's not value we care about it's fill equals value it's the distribution of view count that we care about and scale y log 10. what's up with this okay no yes the the the story is the data here is repeated uh this point appears in every single one the question just is which of these two categories do you end up in i don't know that's the that's so the best way to show this data the story is like whether or not it contains animals has no correlation with view count but whether or not it's patriotic patriotic is that one yeah it does have a correlation with you how patriotic ones are do the higher view count so this this is like this is not an amazing way to look at this but i think i might prefer is to start just by asking the correlation um and just thinking one sec yeah i know what i'm going to do is i'm going to group by category and value summarize and then i'll show it without a filter so what i want to check is that all of these have at least a few median view count i want to check that all these have at least a few yeah there's no there's no like side of any of these pairings that has fewer than like 40 here so then i can ask um then i can start by visualizing like uh category uh median view count instead of the distribution just look at the median geom call and say fill and position dodge and ask are there differences here so here we can say like okay the ones that involve quote danger or quote patriotic uh do seem to have higher view counts maybe the ones like animals uh it's like celebrity and used sex have lower view counts we wouldn't necessarily have expected uh with like you could say like sex sells or such a little cliche is that it would have higher um youtube view counts um and me and the ones that are funny are a little more common but mostly there's a patriotic gap that's interesting uh the um i wonder if it has if how it relates to things like brand well the um the other way we could ask that we could ask that question is do this gather and then group by category summarize with a correlation is between the value and the um uh the the correlation between the value and the um uh yes or no is of this category now the story here is that don't use a piercing correlation because they're a lot you know they're logs so i want to do a correlation between value like true false and the log of the meat of the view count and i better i'll do plus one just in case there's any zeros i could have also used a spearmint correlation which is um what you call non-parametric and less likely to be thrown off by outliers but i definitely can't use a regular old correlation because we saw the view count is like log normally distributed really wide relative to normal so danger is higher patriotic oh it's only a little higher median's a lot higher but overall if i said method equals spearmint would i get different results a little bit not hugely i keep it correlation with the log okay so the general story is like danger funny a little bit higher and then animals use sex maybe a little bit lower that's one way that we can look at this the other way we can look at this is let's estimate each of the correlations independently other way we could look at this is do a um a linear model of log i like to use log two it's interpretable view count explained by patriotic plus funny plus show product quickly plus celebrity plus animals plus whether it uses sex and data is youtube and it will move the the missing data for me once you put these together we actually get that none of them have a significant trend that could be just because we're testing so many of them and there's maybe some correlations if i i'm curious if i just tested danger and patriotic that you shouldn't do this you should uh data equals uh youtube if i just use danger yeah the correlation just was too weak okay so so like this is one other way to look at it is like not sure we really see a trend here at least um just based on these okay so the um uh it's just it does it does look like there's a danger patriotic difference but doesn't really hit the boundaries and we throw them all in in model together we just don't have enough data to really see something here okay um i wonder you can also ask questions like are these getting more common over time so if i did is it more likely that they'll be funny or use sex uh and so if i do a gather and then a group by yeah i don't need this group by year summarize i'll add category and year uh and i'm actually i am going to divide the year round the year to the nearest like two uh remind myself how how that works i guess i can do oh yeah i can do um uh floor here uh what is it oh wow uh year overdose is this it's two times floored year over two uh and percent is mean of value true or false always good to have an n and now it's 2 000 2002 2004. so i can say year pct color equals category i have a hard time seeing it seeing a huge trend uh there is actually one trend i see and that's in celebrity let's facet wrap this by category and while we're at it let's make the category more readable string to title string replace all category underscore with a space fasted wrap by category i do not need a legend so this is oh and this is our percentage of i group together pairs of years to make the um percentages a little bit less shaky uh to your round uh time two years percentage of ads coded with this i should call it a quality rather than a category all right so then um uh so then we can ask some questions like okay actually i do definitely do see trends here they're not more likely to use animals over time they're much more likely to celebrities in like 2019-2020 danger is the same funny has been decreasing at least in this data set of 278 patriotic increasing and what's interesting is i would have expected patriotic to be at its peak around 2001 after like to tour right after 9 11. um show product quickly has been steady they they do they don't show the product quickly and it's been the amount they you that it uses sex has been decreasing it does look like over time all right i did i'm not fitting a model here for each of these but i can fit a model how would a model work well i would do something like this i do a glm let's do it just in one category then i'll show how to do it on all if i want to know what has the proportion that use um animals would change for example use animals to change your time that's just no but can i check that i would say um has the i would say uh for saying yes i'd say animals explained by year data equals youtube family equals glm this will be a logistic regression and type that to a summary and try again formula missing no i've got the formula right animals but animals by year oh family equals uh binomial is the word i was looking for but um oh summary this is what i was looking for summary built in so then i can say like okay is there a trend of animals over time do a summary look at our look at our intercept and our slope p-value nope makes sense looks real flat to me what if instead of animals i was looking at celebrity yes there's a statistically significant trend uh assuming a binomial a um logistic regression model we want to say like has the um the the rate of celebrity been increasing over time so i could fit a model for each of these terms uh i uh like just one by one or in a loop uh if you know me you know i like to use the broom package for these so what i like to do is say uh gather category value uh i'm gonna call value true false just like really try and make it as clear as i can and um uh and we do that on should we have said quality or something i'm going to call it value because i've been calling it value all along but i know that it's not the the world's clearest way to put this uh what i'll do then is i'll um i'll nest by everything except for category so now it's like oh i've got seven tables and in each of those value means um means something different so now i can apply this in fact i only need to nest another thing but i can actually summarize and create a model right away so i'll say model is you need to put in a list to get it out and it'll be this expression i don't need data because it'll be within the context of this once it's a group by and a summarize oh it's instead of of celebrity it's now whatever value is here we go and just like that we just hit seven different models one to each of these uh one of each of these terms so i'm going to now um tidied is mac model broom tidy the groom package turns these into a tidy two data frame uh two row data frame unnessed the tidied model now we've got our terms we don't care about the intercept uh the intercept is like where would it start at year zero it's not even like meaningful uh so i say term is not equal to intercept and now we've got our terms over over uh time i can arrange in descending order of the estimate uh and what we can see is okay we ideally want to correct for multiple hypotheses here uh which case maybe celebrity doesn't make it but the generals and somebody no it's it's like a p-value 0.01 what it looks like based on this is that patriotic has been increasing over time you sex has been decreasing funny has been decreasing and according to like a logistic at least in terms of presence in this data set we don't this is not a complete set of all the super bowl ads uh we don't know if there's some kind of selection bias but these trends it's not random noise fooling us on celebrity funny patriotic and used sex although it does look like these three show product quickly animals in danger aren't meaningful and what that means is i actually might want to take this call these like co coefficients uh it takes a coefficient like the year term coefficients and then i might revise this visualization that i just did i might want to actually only show the ones that show a trend over time uh so i can say like uh okay we did this join now say interjoin coefficients by uh by category uh oh i still have this this thing that's no good i need to move this mutate out here and i need to get all my my parentheses in order there's the extra parenthesis and i still am missing things let me see next parenthesis here here it is and i can now do a filter where i say p value must be less than .01 could do some adjustment but this is mostly going to work uh because i guess it would have been multiplied by seven for benchmeni but um i feel honestly i feel pretty good about this though saying that about doing this one where we can um even though we could use more formal methods we say okay these are the ones that show a trend over time uh and and here's what and here's what the trends are uh so this is this is neat in terms of like you notice we jumped into modeling universe like and we did a little tidy modeling four different models of seven different models we used that to get our terms and then we used that to make our graph better we could have just filtered for these four but now kind of a principled reason uh for it all right so that was us taking a look that was taking a look at frequency of these some of these types over time and now i'm going to want to now what else would i want to use um i'm also i'm actually a little interested in like in how brands use these qualities man i've been using this gather all day uh i've just been like gathered by a category by this whole thing i should probably call it like gathered categories category is not an is not an amazing term it sounds like it's like gathered categories i just i just want to like i want to stop doing this this repeating the same logic over and over so i'm doing a little bit of refactoring cool so my last question might be i want to group by brand and category summarize [Applause] i don't know how often is each brand each type of ad uh how often does he friend like fall in each kind of category so for example bud light is almost always funny almost never patriotic makes a lot of sense to me i changed my understanding of like that be your brand uh so the so if i uh i can facet this by thinking no i don't want to fast this by category i hate fasting when they're they're seven uh never any fun uh but pct brand gm call and uh let's do a little bit of reordering i'm gonna reorder within yes i'm gonna reorder with them so i'm gonna do going to say brand is on it to ungroup brand is the tidy text package we can say brand is reorder within brand uh within these bran uh by um pct within category and whenever i do that i need to do two things i need to um add scale y uh discrete go scale y reordered compiler text and this is scales equals free why so this is a way we can we can start by saying like okay which how do brands map to these categories uh and i may want to do that reordering on that renaming on the categories you know what i'm going to do this renaming way back here because no reason it needs to be snake case through all of this looks a little bit better now when i when i visualize each of these this is not it's not amazing but actually now they think of it this one is better now that i'm thinking about it if i if i uh flip it around a little still not not what i'm going to keep for my final version but i just want to show some of these uh this one this line won't this this code won't work but you know what it um it uh actually this one would work because it's not on the gathered version okay uh this one and this one work and yep have the right names okay there's a little bit of refactoring realizing i could put this operation back up there all right but the the interesting part is i've got here we go like what percentage uh so we'd say percentage of this brand's ads what percentage of this brand's ads have this quality this kind of question yes okay so for instance nfl never uses animals coca-cola often and budweiser often do well like that like the toads and the and the polar bears is what i'm thinking in terms of commercials um nfl and pepsi often use celebrities you trade never has uh danger but lighted in doritos use danger sure doritos almost always tries to be funny nfl almost never does you have ones in the middle like toyota and kia uh budweiser and nfl often patriotic but leading doritos are rarely patriotic uh doritos usually shows the product very quickly kia and tends to wait and uh keith has to use sex pepsis sex cocoa and nfl don't so getting a sense of like some of the things of these brands notice that i could have flipped these two i could have said instead of examining the profile like the what is at the top and the bottom of each of these qualities i could have done like a brand profile um which i think if i were an advertising i certainly would be interested in where i would have said something like uh reorder category within brand and put category on the y-axis and facet by brand kind of flipping these two around so i could i do this with reordering you could imagine without a reorder if i want to this is kind of like a brand fingerprint i actually going to use that name right kind of what's each what is each brand's fingerprint in terms of the ads terms of the type as if so if somebody asked me somebody asked me what's the difference between coca-cola and pepsi in terms of their super bowl ads i would have said you know pepsi they both show the product quickly pepsi tries to be funny and uses celebrities uh that's less uh cocos things pepsi often use often uses sex coca-cola almost never uh those would be some of the breakdown that we've said uh the um let's see we also have like oh you know uh rub suggested facet grid but another way we could have done this fingerprint graph instead of bar plots i'm not sure about how i feel about this but we could use a tile uh so the tile would look like this i take this i am group by branding categories so i do the same like brand by grand category i'm just always be refactoring like work on ways to shorten and reuse your code i'm just thinking like this data rubber giving you a faster grid i'm not sure i would use a grid for this i think i would use a geom tile i think what i do is i'd say let's put brand on the x-axis category on the y-axis i actually think i like it the other way around category brand and say fill equals pct and do a geom tile and uh and also a scale i don't like that uh fill i kind of like gradient two low equals blue high equals red mid point is point five uh and i probably want to do a little bit more um uh ordering here so this is like this we call it a heat map uh i do generally want to order things so the the correlated ones are together like funny and sure product quickly looks like they might correlate that's that's called seriation and i don't have like a super fast approach to it um i could make this a heat map do we want to make it a heat map that's kind of a of a hassle i'm i like a base r heat map i know it does do seriation yeah why not because it'll do some uh it'll it it does it does so many things like i think i've done this before but i actually don't know a way to do this better than reshape 2 which has not been used in a long time a cast pc um brand by category value bar is per pct yeah old school like pre-tidy verse code um what i'm doing is now i can try heat map that check this one out uh so this is like a built-in r has built-in heat maps there's a lot of packages out there for heat maps uh and but this actually does kind of give a sense of how i might break this down you would say i wish i could i don't know enough about base started to extend this on the on the bottom but stories like funny show product quickly select funny and show product quickly are correlated um that uh you sex and danger are at least a little bit correlated um and we also like in terms of the the and then like in terms of their fingerprint pepsi kia nfl looks like they might be correlated coca-cola's is over here with budweiser as they both drinks and uh but you still can see differences budweiser tends to be patriotic more often than coca-cola nfl is most patriotic at all often use celebrity shows the product quickly whatever that definition uh is here uh and yeah then you can see things like the funny cluster bud like doritos and e-trade uh all like to be fine some of these are kind of all like to be funny um yeah this is like uh this is one way of looking at uh these these clusters are based yeah in terms of the um of these fingerprints this is one interesting one back when i was in molecular biology everyone loved this kind of heat map and yeah you can kind of get some things from that you can't get from gg plot two um so that's so that's at least like one uh interesting these are your these are like items these are your features how do they split up all right so let's look at the relationship between brands and categories and that tells us some things we looked at we broke it down a couple ways this is like the most detailed but it doesn't really give you that idea of okay what things cluster together it doesn't um it doesn't communicate that but it does show us things like okay like nfl never uses animals or pepsi they're all funny they tend to use funny celebrities and things like that uh but yeah all right so the um so we get some clustering some heat map uh that's i think enough of the ones with the quality i think i want to look at whether things people like and didn't like uh okay so the um so i think i'm going to do i'm going to land on on dislike on dislike percentage i'm really interested in what are the most disliked ones dislike pct is dislike let's start with dislike count overview count and i'm gonna start just by grabbing like brand and uh brand and year and title and description and dislike pct i just want to get a couple things we can look at this for a minute without dealing with any kind of conclusions because i've actually looked i haven't watched any ad so i recognized any kind of want to do that a little bit arrange descending view count these percentage are so low that it just feels silly to use a discount pct even like the most unpopular ones here i maybe i want to do the dislike the dislike ratio like count over like count uh and um kind of keeping these numbers in place because i don't like this like count of a light count i don't want to i don't want to like uh so this dislike ratio what's that what's the idea here it's like your your dislikes over likes your likes over dislikes uh you know what i'm going to call it a like ratio because i because i'm in charge here and i'm setting the the terms i do it over like count of a dislike count so like some of them uh are liked by three times as many people as dislike them others like the super like the nfl super commercials are like by 23 times as many people uh so like us and also they're ones that have no dislikes that's not very exciting uh but i mean it's not exciting it's because these are probably the ones that just have not a lot of um of items in them i'm gonna drop the dislike percentage say like count like count dislike count if i sorted by like ratio it's like okay 74 likes zero discounts get out of here i don't like care you know uh what does it mean uh so what i'm gonna do is i'm gonna filter for like count plus dislike count must be greater than or equal to a thousand and now at least like the ratios um and yeah i could do some empirical beige shrinkage i'm not i'm not going to go into in this case could just add some number to the top bottom i'm gonna um i'm gonna stick with this with ideas like okay only ones that got at least a thousand so then we can say which ones were liked by a lot more people than dislike them well here's a hyundai commercial that only eleven that a thousand will like only 11 people dislike uh same with some other cars some pepsi things uh things like that and then i can scroll all the way down and say okay doritos is much more polarizing these have a ratio of only like three to three likes to one dislike like this one is 275 000 likes and 92 92 000 dislikes uh so like doritos sling baby i hadn't heard of this one top out of 20 i'd heard that it sounds stupid honestly a lot of doritos that's kind of sounds stupid to me uh and coca-cola cat star in any the i don't know the uh these are so some of them just get a lot of dislikes nothing was this like none of these like ones that had a lot of flight of data a lot of data points none of them were disliked more than they were liked but there's a big difference between 94 and 2. so i'm going to try viewing this as view count like ratio geom point scale x log 10 scale y log 10 is this the way to do it i don't know like this um i'm going to call it uh like discount like dislike total is this this is a really important things like you don't want to look at those ones that have really low numbers um it's at least like dislike totals was a thousand uh huh drop the selection anymore um i'll leave it and i'll say likes plus dislikes likes over dislikes so here's what i'm doing is i'm i'm like asking um i generally expect like the most interesting one and i know there's just points i'm gonna put in some text in in a second the most interesting one's gonna be like okay these are ones that got tons of likes and dislikes um yeah we can see these ones got tons of likes and dislikes and still are really uh not like weren't very positively viewed the ones down here these are ones that like a lot of people um a lot of people disliked uh so it's like only a three to one so there's one thing what this is showing us is that polarizing things often were the ones that got the most likes and just like that means it's not just random noise this was actually got a ton of feedback of like a dislike it like went viral but it was pull and it was polarizing you know what i'm going to explicitly try putting the view count on the x-axis instead of like supposed to say because those two are linked uh and this will be total views you know this this kind of suggests is like it could be a story of polarizing is good for being for sharing like that's kind of a story you can say it's like these are the ones that are really polarizing these are the ones that are pretty unambiguous like everybody just likes them uh and um uh yeah it's like a hundred times many likes as likes i could do this as a percentage like silver likes plus dislikes maybe that ends up like a little clearer yeah i'm going to try doing this i'm going to try saying i'm still deciding what the right metric is i'm wondering what's really what is intuitive here i can say mutate dislike pct is like is dislike count like count plus this like count and uh use this like count on the dislike pc key on the y-axis drop the log scale it's not a ratio anymore and um yeah i kind of like this um this i like this approach i'm gonna add a couple i gotta i'm not looking at any specific ads yet i just want i want to like tell this story of a polarization about polarizing ads and here we can see okay these are the four most polarizing commercials uh in terms of their super bowl uh likes and um likes and dislikes and the um uh and these are the foremost players and you notice they they're among the most uh viewed like none of them got fewer than these top four got fewer than a million views uh and three of them are in the top four expecting very disliked so it's like polarizing is good for being uh shared that could be a lot of things it could be people hate watching it because they heard about how bad it was it could be that once something gets pop that uh that other people that the kind of thing that people find really funny other people find really tasteless it could be that um the things that are uh that that are shared really widely get to an audience that tends to dislike them more there's like a lot of it's not necessarily a causal extension of yes you want your your your ads to be disliked because i don't make them popular that's not what i'm getting from this uh but it's uh i think the last thing i add is it say geomtext as label equals title and uh check overlap is true check overlap true will make sure that doesn't keep all these in i'm also going to some amazing graph right here less than amazing but should i truncate any strings yeah i'm going to truncate the string too i say string trunk this 60 60 let's find out how that looks didn't affect a lot a lot and maybe i want to keep most of it and keep most of the keep up the whole born the hard way title all right and try sticking digging to this and i also i think i want to do is labels equals comma i really have had this image i i didn't know what the graph would look like but i knew that i want to tell the story about polarization and popularity and this is like actually a much cooler story than than it may be expected at first that text size is too small the uh could also geom text repel that might not look good here uh so so here we're saying like okay the four the three three of the four most polarizing ads of all time coco the co-commercial catch the budweiser born the hard way the doritos sling baby i'm not gonna absolutely oh should i watch these that i'm like no i don't want to watch a famously disliked ad what am i doing here um and uh there's like a jordan helmet there's other exceptions like jordan martin scorsese i don't know what i i hadn't heard of this one but like up featuring during her mother's wastay looks like it's pretty unpopular um and not that watch there's like 200 000 it's on the lower end of things being washed watched somebody i think the nfl ones are out here plenty of people watching them not that controversial not that polarizing we could actually uh throw in a color equals brand somebody uh so uh ramana pointed out is there a oh uh yeah uh that's i totally missed i totally i'm missing this but the um i can do brand is fct recode brand has a spelling error i can say um i say uh uh hyundai is here spelled canoed here uh and uh let's let's go here oh and uh but yes i was going to say color equals brand and i only want to do that in the gm point i do not want to i don't like colored text it makes it harder to read but this starts to give us a sense of like okay can we see a trend well i don't know that i can see a trend from here uh like just looking at this budweiser has some popular nobody dislikes the wasab commercial really i'm not sure and um yeah if i want to actually look at this i probably want to do make a table and if i want to look at the effect of the relationship between brand and polarization i probably want to do something more like this to do um dislike pcp by brand gm box plot and let's throw in a reorder aha here we wow coca-cola the most polarizing got all those pepsi fans going in there maybe um uh so the um so i say mutate uh brand is fct reorder brand dislike pct all right so here's like where's kia what happened to kia maybe kia didn't have any that had enough uh like dislike total uh pop-up i wonder if hyundai only has like one but e-trade not only trade online is one honey might have one like uh one that made the the cut e trade one is two if i if i decrease this tiny bit just like try this out ah no it's well that'll key it just pops up with one i'm gonna leave it like this uh even so uh so the um scale x but i am going to do filter like this i need to do it here because okay yeah i'm gonna do here so i still want just keep this kind of graph so who has the what what brands are most polarized in terms of the um the ads they make this is a problem with this graph is that notice hyundai has a lower median but has this like maybe one or two that are really high uh uh so maybe i want to reorder by mean which i'm not crazy about doing but we can we could we can do this so what uh brand what brands tend to produce polarizing ads is the way we can ask this oh this should be dislikes over likes of this plus dislikes little fixes yes so this is telling a story that i would not have gone to pretty i really would have predicted let's say doritos bud lightweight pepsi of all their funny stuff or at the top i would not appreciate that coca-cola has the highest um they should be x y is brand but i don't need that that's in terms of because coca-cola has never produced an ad that got less than with the dislike percentage was less than about four uh and the median one was like 16. all right and so there's only ads where they have at least 500 likes plus dislikes all right so the um uh yep uh yes this is the one that i most dislike these are the ones that are sliced like each tree made two ads and none of them were all that disliked uh and budweiser made a bunch of ads and none and only a few kind of have these uh these outliers all right so there there's an uh yes there's a lot more that we could do to do with this i'm interested in like um uh maybe likes over overview count or dis or something like that but this was kind of intuitive to me in terms of dislikes over like uh over the total um this is kind of intuitive to me all right so let's um uh yeah let's review what we did today uh so the first thing is um is i just looked at okay we're only looking at 10 brands here we i looked at did they change over time uh not hugely it looks like i was also interested in the distribution not just the view counts but of all of the um the the view counts like logs get uh on a log scale in terms of common count disaccount like how to uh et cetera i was like i say okay they all they're all best to analyze i'm like a lava in the scale but i also noted at the time i probably want to take these two and look at their ratios which they will do later uh so then i was interested in breaking things down by this cat these categorizations like funny and and uh not funny and such and um uh oh but yes i did want to look at like changes over time so okay there was one year with a lot of really popular ones or really really highly viewed ones but then i started looking at our categories started looking at like this is not the visualization i loved i really want to look at something like um like okay the median patriotic ad gets more of is the median non-patriotic one but they're not terrible doesn't seem like any of them were statistically significant the correlations were fairly weak then i can look at like um uh then i could look at a model and look at where the frequencies of those not related to the view count but the frequency those changing over time the answer is yes more often used in celebrities less often try to be funny or using sex to sell the product uh then i looked at like at a little deeper in terms of the relation between brand and category and i used uh three graphs for that one where we focused on the categories we said what brands uh most exemplify this like coca-cola and budweiser being most likely to use animals then we looked then we examined the brands and said let's look at the fingerprint of each of those brands in terms of like okay the um the same data but faceted differently uh and um funny more often or celebrity less often things like uh things like that and then we looked at the and then i looked at a heat map which really kind of um uh combined it was kind of a compromise between those two you can look at it either way in exchange for a little bit precision in terms of knowing what the exit what the actual numbers are uh you can say okay like um most of the very common to be funny and use the product quickly uh show the product quickly and that's true across a lot of these you can see a little bit of clustering nfl is kind of the outlier in terms of the fingerprint from the others which makes sense to me um yeah and that's how like we would use a heat map for visualizing this and finally we looked at polarization i'm a big fan of looking like percentages and ratios in terms of likes and dislikes uh and as long as we make sure we only keep the the relatively popular ones we saw that some very polarized that some of the of the most popular ads were also the most polarizing uh which um kind of makes sense i don't know that i would have gone expecting exactly that but it definitely like it's it's remarkable how we see this cluster and then kind of everything else and i was curious how that related to brand so we also looked at the relationship between brands things we didn't look at i didn't look at comments i bet they're pretty correlated with with um with views and maybe and maybe likes but maybe they're also correlated with dislikes um uh i'm not sure i did no text analysis of the description of the title or the description i suspect the title would have most i think mostly this would have popped up like the same brand effects so i don't know that i would have gained anything from from parsing these apart into words uh i didn't look i didn't use when it was published and uh most and i didn't use the category because i'm not sure i wasn't sure what to map it to uh but it'll be interesting to see what what that was linked to all right that was today's announcement some a bunch of graphing some modeling some explanation even a heat map uh if you enjoy if you enjoy today's um uh screencast please be sure to like the video and subscribe to the channel i i hope you had fun i certainly did i'll see you next week
Info
Channel: David Robinson
Views: 1,966
Rating: 5 out of 5
Keywords:
Id: EHqFDXa-sH4
Channel Id: undefined
Length: 61min 22sec (3682 seconds)
Published: Tue Mar 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.