Model the Bechdel test for the Uncanny X-Men with bootstrap resampling

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Julia Sookie and I am a data scientist and software engineer at our studio and in today's video we're going to use a data set from the the Clairemont run project which is an academic project studying a run of comic book of a comic book series the above the x-men the uncanny x-men and we're going to use data from this project that counts up how many times different characters speak and think and are portrayed in different ways and then use bootstrap resampling to train models to understand some things about these issues we're going to train two different models and show how we can use the same approach to understand different things about these issues first we're in a trainer model to look at the locations of the issues and say can we understand if issues that take place in the x-mansion are are more likely to have different characteristics in terms of how characters are portrayed and then also we're going to look at the vestal test which is a measure of how women are portrayed in media and see if we see differences in these issues in how how characters think speak and act or or portrayed depending on whether they pass the Bechdel test so let's get started okay here we are let's get started with this so this data from this week is from a long-running comic book series the uncanny x-men as it was called then and so and there's this academic project called the claremont run project where they have they have collected and curated quite a bit of very interesting data around our about these about these issues of these comic books and so there's there's quite a bit here too to explore so one of these data sets here has many rows data for these not all the issues are in this data set but we can see how many issues do we have this detailed information for so we here we start around the issue a 100 and go about to issue 300 or so and we have for each of these issues we have information about how characters are portrayed what do we see them doing so we have we can see who let's see let's do a distinct character character so who is in this data set whoops oh did I do it wrong summary let me try that again character visualization distinct character like so okay so we've got 25 characters in here so this this data set has for the main 25 characters Wolverine and Cyclops and storm and these like most important characters in here it has for every issue when they are portrayed in costume and then also not in costume how many how many times do they speak so how many speech bubbles are there how many times do they think how many thought bubbles are there how many narrative statements are there about them and then how many total depictions are there there are there are some issues that it turns out are about other non need not these main characters so there are actually like if we were to look at I was a called depicted depicted let's say less than one there actually are quite a bit of well this is for these characters but we can we can see what's going on with that so that's one data set I'm like per issue per character how do we see them portrayed are they thinking or speaking or are they doing something and then we've got a couple other data sets here there were actually multiple in this week's tidy Tuesday but we are going to work with three so the next one is this locations data set so it has for every issue what are all the locations that are in it so this first one you know it takes place in space in a dream and then the x-mansion and the present and these other places so we can say what are the top locations in these issues so the x-mansion appears the most perhaps unsurprisingly but there are lots of other places as well lots of other locations that we see here and let's um let's check out let's say so locations let's say group by issue and then [Music] summarize is let's let's call the let's call this new let's call this new variable mansion and we'll say is the x-mansion in location like this and then we can say okay so yes they are in the x-mansion in these first three but then in 100 it is not so locations let's filter to episode no why are you seeing episode issue issue it was 100 they are in space in issue 100 so they are not in the ex missions our x-mansion so this this right here will tell us which which issues take place at least partly in the x-mansion and wish don't so like what what happens in issue 101 they are in a lot of airports they are just traveling around and they end up in a hospital oh no what happens there so so so this this deep liar statement here will help us find out which of the issues take place in the x-mansion so that lets so let's use that for one of these models that we're gonna train here and then the other data set that will come by we'll come back to at the end is this data set on the on the x-men best L so for all the issues here so you notice we've kind of got a different set of issues and each one of these data sets we're gonna have to be doing some inner joining there but for this for this one so for these issues does it pass the best all tests so the best I'll test is a kind of a heuristic for kind of assessing gender representation and fiction so something passes the Bechdel test if two there are two characters who are women in the in this piece of fiction and they talk to each other about something that is not a man so it is it kind of is you know like was originally like tongue-in-cheek like Shirley's is like some kind of bare minimum in terms of gender representation and fiction right that there's like two to two characters who are women who talk to each other about something who's not a man but it you know it's kind of an interesting way of assessing overall like what what kind of stories are being are being told so it's a sort of common way a way of talking about representation in fiction kind of just kind of a heuristic way of talking about what fiction is like and so we will down here we'll come down to the best I'll test down here so what we're going to do is we're gonna be training some models using bootstrap resampling because this data set is small and I could train a model like on a per episode a price re per issue basis but there are not that many you know there are not that many issues here and I would like to use the you know the most robust statistics that I can to understand how sure am i when I get some kind of result so I'm going to use bootstrap resampling to be able to do that and so I'm going to come back to this to this to this character visualization data set let's remind ourselves what it looked like here so this has many rows because this is a guru this is aggregated at the per issue per character per costume level and I'm gonna I'm gonna just gonna aggravate this all up so I got a group by issue so what I want to know is in every issue were there more is there more or like how many speech bubbles thought bubbles narrative statements and depictions are there so there are this is just to remind us this is for the top 25 characters that are included in this data set I'm so I'm going to use a summary so a group by summarize and then what I'm gonna do is what I want I wanna I wanna I want to summarize these four columns so I'm going to use the across function that is in deep liar one point no I'm gonna say across these four columns I want to take the mean like so and then I think I think I need to ungroup I can't remember like this so let's save this as per issue so I had something that was hurt like per issue per character and even in costume versus non costume and I want to oh I did not like that speech did picked it across let's see so character visualization here so I grouped their summarize across speech depicted I've got my across across if we want to look at my mistake was with how to use a cross I'm a fairly new to across as you might be so across the function is inside of a cross so we want to summarize across the whole you summarize across the whole across and the first from the first argument to across as the columns and then you put in the function so it goes it goes like this like so so per issue now we have a now we have a data frame that has for all the issues that we have in this data set which is just shy of 200 how many thought Bob how many speech bubbles were there how many thought bubbles were there how many narrative statements how many depictions of these 25 most important characters okay so now let's um so this is let's call this x-mansion like this so this is wit this is a logical for each x-mansion it logical true-false true-false does it or does it not take place in the x-mansion and then let's um join those up let's say per issue inner join x-mansion like this and let's call this location locations join like so so locations joined looks like this i've got the the issue these four things and then mansion like does that take place does that issue have the x-mansion as one of its locations or not and now let's let's make one visualization here so we can see what's going on so locations joined I'll see I'm gonna say mutate Mansion if else so if it is the mansion let's call it will say X mansion cuz this is gonna go probably on the x-axis so I don't want true/false is there I want like some words this is currently kind of wide and I want to all to go on one plot so I'm gonna I'm gonna pivot it longer so instead of it wide and I want to pivot those same four columns and I want let's call it um I don't know I can give it a better names like this is a it was called character visualization so this is how how our characters visualize visualization and now I can pipe this to ggplot so mansion here I'll just show with that and let's make sure it works okay so we have per issue was there an X mansion or not and then how many of each of these things are there so we're going to put mansion on the x-axis that value on the y-axis and then let's fill it fill visualization and let's put let's start with a box plot I don't need the legend and let's facet wrap and by visualization and let's look at that okay so is a good start Molly we don't need that why label or the X label and let's the Y the scales are unfairly different X and y-axis so let's do this so we've got and boxplots here we can kind of see where the median is depiction is a bit higher speech it's hard see you so where some of these because these are small numbers this is actually small data and with small data it's nice to just see it like we're actually is it and so I like to switch out for a dot plot like this the dot plot I'm not as familiar I don't use it like every day as much because it's really best for when you don't have a ton of data so I think what I'm gonna have to do is yeah I think I have to say the bin axis is y here and I want to stack the I want to stack it to the center like that let's see if that does it yes yes nice all right let's look at this okay so I like I like a nice dot pot I feel I like to be able to see I like to be able to see the air it's like a nice way to be able to see the actual data it's pretty good so now we it's kind of off the edge I wonder if there's like a there's one more thing I feel like bin positions what does it do okay let's try that it's like bin yeah all bin positions all so that it makes them all together versus yeah yeah yeah that's what we want that's what we want all right come here there we go very nice that's perfect that's very nice okay so here we can see where the actual different comic book issues fall so the ones that take place in the x-mansion or over here the ones that are not in the mansion over here and we can see you know we can see the shift in the depiction the depictions it's hard to understand really these other things because the numbers are kind of low and I would rather use like this isn't something I really want to just kind of guess by I like I would like to use some more use some more robust statistical methods to do so so let's do it let's do it so let's load tidy models we're gonna use the bootstrap let's load it so I can have my precious autocomplete so I'm gonna use the bootstraps function and we are going to boot strap that locations joined data we want a ton of them right because this is the this is the idea here we're gonna say let's say a thousand we could probably go up to two or five thousand but this is enough for now and we want to set a parent equals true because we are going to do some we're going to do some functions that need the if we want to look at what bootstraps let's boot straps here what a parent equal is should we add an extra sample that is just the entire data set and that's we're gonna do we're gonna we're using our bootstraps to do some robust estimation and so we actually need the whole data set in this bootstrap here at the end let's call it just boots like this let's let's set a seed because this involves like randomly resampling so we're Matt as a reminder bootstrap resampling is it's the same size so if locations joined is 183 when we bootstrap resample we're gonna end up with 183 at the end but we're gonna be drawing with replacement so we take so they'll be they'll be duplicates so you know issue 103 might end up in there three times and 102 won't ever end up in in some given bootstrap resample and so we we make a thousand different bootstrap free samples of this data set and then if we fit models to all of those bootstrap resamples we understand get a more robust estimation of what of what and the effects are that we can see so that's what we're about to do here so we're going to take our bootstraps resamples and we're going to use mutate here so let's say let's call it model and we're gonna use map from the per function so what are we in a map on so let's let's look at here so boots boots oh I never did it so boots here has these splits so like if I look at the the first split there are you know there are 183 it's the same number there this is the original this is the the the analysis or for example I could if you want to look and see what it looks like I can say analysis use the analysis function here and get it out and notice that they're all scrambled up in order and also we have duplicates and also some of them will be missing and the ones that are missing are in the assessments and set here okay so I'm going to the first thing I'm going to do for in the map the first thing I do is the splits so that's what so map okay the first function is a thing you're mapping over and then the so the first argument excuse me is the thing you're mapping over and the second argument is the function that you are going to map over with so we're gonna build just a little model we could write this with parsnip from the tidy models framework but I'm just gonna write it with plain old GLM so I'm gonna say mansion mansion and true or false um it's gonna be predicted by speech plus thought plus narrative plus depicted so which of these is more like predicts is more like makes it more likely to happen in the x-mansion which of these makes it less likely to happen in the x-mansion I'm gonna say family equals binomial and I'm gonna say the data is going to be equal to the analysis of the dot and the dot is the thing that's being sent in here because and I use this little Atilla T here okay so that will train all my models um let's see so that's a map and then let's do another let's do another map Co if let's call let's get some coefficients out and we're gonna do a map and we're gonna map on the model and we're gonna tidy it like this so we're gonna say and let's save this I think I got this alright let's call this boot models like so I got something messed up my parentheses are unhappy all right now my parentheses are happy let's see if that works all right so now it's training a thousand logistic regression models fortunately any one individual model is not very big so now we have the models are here in this column and now the coefficient info is here in this column so let's get that out we can unmask the the coefficient info if we would like to see it like here so we can so for every speech thought narrative depiction we have we have a model so we have these models for every single one of them and they have standard errors and so on so now what we can do the our sample the tidy models package that has support for all this resampling and stuff has a function called in I think I use this one percenter has functions for bootstrap confidence intervals so if we do this interval percentile we we send it a data frame containing the bootstrap or samples which is what we have and in the statistics where are the statistics so for us we're gonna say int percentile so for some the bootstrap concert confidence intervals I'm going to say here are the boots here's here's our bootstrap results and the the co if info our remember our results are right there so here we go so here's the estimate but we're interested in these lower and upper things right lower upper so depicted is positive narrative goes around 0 speech goes around 0 thought goes around 0 so it looks like depicted is really the only thing that is very very important here for us to look at we can we can boots let's say this boot like this alright buuut and that let's actually we can actually we can actually plot those distributions of all those different coefficients that we had and look at what the results are so we can say let's say the estimate what was it estimate estimate and then let's just make let's just make a fist agrument histograms let's make them a little bit transparent and let's say like 25 bins and let's we don't need that legend because we are about to facet wrap the term and add the or us let's just do scales equals free like this all right okay I don't need the intercept so let's filter term not equal to oops intercept like this and let's add a vertical line oops let's add a vertical line it's going to be at zero so we can see where the zero is and let's make it like gray and dashed and transparent and kind of thick that should be nice and a plus oops too many commas there we go almost look at that alright let's look at what we have here okay all right so for these four kinds of things that can happen in comic books for the 25 most important characters they can they can have speed they can speak they can think they can be portrayed in the narrative sense or we can see depictions of them so the strongest effect is both in terms of effect size and in terms of how sure we are that it is real is this depiction so the more and it's positive so the more depictions there are the more likely some issue is to have a location in the x-mansion so I think this makes sense instead of people going off on like little solo adventures or whatever or with a small group they're all together in the x-mansion so we see lots of depictions of characters lots of them together in the x-mansion the speech one is less we're less certain we see in effect there but I still you know I would be willing to say I think there's something real here and this so the speech one is negative and that means that in an issue the more speech bubbles there are the less likely something is to take place at the expansion so so these things are in opposite directions whereas the effects from Nair active and thought are see where the zero is relative to the the peak of these distributions that we see we do not see any impact from thought bubbles or from narrative statements so this is so we used bootstrap resampling to be able to measure the the impact of these four kinds of ways characters are portrayed on this on this on what whether the location had the x-mansion what was the x-mansion was in it or not so that's um so that's great let's do that whole same thing but for the best I'll test so let's copy this let's put it here so we are going to join this here instead I think I am going to turn that into from a yes no to a true false so we'll say let's use let's use let's let's see we're just going to add it here mutate pass beshte let's use the same kind of if else so pass best el yes we'll call it true if not equals false so that way we can just like this will all be so much the same and then let's see and let's call this passes best el and let's call it no best el like this all right let's zoom in here whoops [Music] okay and so we can see again here we can kind of look at where the data is it looks like we've got you know maybe more speech up here but again it is hard it is hard with this small he's like doing the issue the number of issues issues is this small that to be able to see this as concretely and I would prefer to use the bootstrap leanness to see it to you know to be more sure that I'm not fooling myself with these particular plots but this is very important to be able to look at our data in this way so let's um let's copy the the thing here that does the the modeling so oh you know what I'm gonna just call this best best l joined best Allah bestow so we got that let's call this here and instead of predicting mansion we're going to be predicting pass Vestal like this so now it's training another thousand models and now let's see what we have here so this is again those those bootstrap confidence intervals so it looks like looks like depicted is negative now we've got a few things that are very again close to zero and this one looks like it might be interesting so that will be interesting to see so let's now now let's um let's copy this and make the same plot that we made before and we'll be able to this is the same so this should all be the same except that the our data is now about predicting the best I'll test okay okay this is interesting so okay so the the one that is the most clear here is speech the and it's positive so unlike with the x-mansion it is now positive the the more talking there is the more likely it is to pass the Bechdel test so those just of the more characters are talking and speech bubbles altogether more likely to characters who are women are to talk to each other that's something that it's not a man so nice I like that result I think it's good and what else do we have here I'm depicted is negative so the more characters there are it like more depictions of characters there are in a in a issue in an issue the less likely there is to pass the Bechdel test so that so issues that have just tons of characters in them are less likely to have two women talking to each other about someone who's not a man so that's kind of interesting um narrative and thought narrative certainly like I don't you know I'm not gonna stake any claim that I believe there's a strong difference here thought mmm you know are we are we am sure that there's some kind of difference there or not I don't know what did the thought um yeah maybe okay so so the more characters are portrayed as thinking in thought bubbles the more likely something is to an issue is to is to pass the Bechdel test okay so that so we were able to use the same approach here of using the bootstrap resampling and then training lots of models answer two very different kinds of questions all right we did it there wasn't there are not a whole ton of issues in the Clairemont run of the uncanny x-men and so we use bootstrap resampling to be able to to make robust estimations using the data we have to understand what are the relationships between for example how often these characters are portrayed and whether that issue takes place in the expansion or for example how often do characters speech how many speech bubbles are there and does does an issue passed the Bechdel test bootstrap resampling and other resampling methods are powerful methods to have in our toolkit as people who work with data and I really like that we were able to walk through and show how to use this I hope this was helpful and I will see you next time
Info
Channel: Julia Silge
Views: 2,030
Rating: 5 out of 5
Keywords:
Id: EIcEAu94sf8
Channel Id: undefined
Length: 36min 13sec (2173 seconds)
Published: Tue Jun 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.