Impute missing data for historical voyages of captive Africans using tidymodels

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Julia Zoe and I am a data scientist and software engineer at our studio and in this week's screencast we are using tiny Tuesday data from this week which is recognizing the Juneteenth which is which is also this week so all the data sets focus on the experiences of people involved in the the transit the transatlantic slave trade historical data sets understanding the experiences of people who were involved I'm really excited to participate and explore while one of these data sets more deeply as a white woman in the United States my own education has been sorely lacking in this area there's so much of this part of the history of my own country and like everything that has been involved with these kinds of experiences in this part of history that I have not been as taught or learned about and so this book me participating this is something that I'm really glad to be able to do the data set that we're going to explore is the African names dataset which is from the slave wages database it is a Danis dataset of individuals who were freed from slavery during the last sort of last couple decades of the transatlantic slave trade when um when slavery was abolished or restricted by certain European nations so we it has some missing data in it and so we are going to talk about how to do imputation for missing data what it is that we're doing when we do imputation and then use it to learn something about how the population of people of captive people who were then freed changed over those couple of days okay let's get started so I am really pleased to be working with this this data set this week I so this data set is from so all of the data sets this week from tidy Tuesday our about our data is related related to understanding more deeply the transatlantic slave trade the experiences of enslaved people's the particular data set that I'm going to be working with here is this African names data set from the the slave voyages project Nestle voyages database and where this I did a little bit of research on this on this data set and where it comes from let's use a skimmer to look at it a little bit here where this comes from is so there were you know hundreds of years of people in Africa being taken captive and forced and then undergoing forced transport and being taken from their home countries and transported across the Atlantic to the Americas and other places and between you know when we got into the 1800s the some European countries made slavery either either prohibited or restricted it and then there were some courts that that were given the the authority to to have to take when these ships were suspected of being slave slaver ships of of liberating the people on the ships and then you know taking the ships away from the people who were using them and so what this data set is it is as a data set of people who were liberated from these ships it's really unique an amazing data set that I am happy does it get to explore some here because it has these data sets all these names and this is a population of people that you know because of the way history is recorded and written we don't often have data about people who experience whose experiences are like this so it is really valuable to be able to explore it this way so we can see here so the name for example is complete because it's a data set of name so who do we have names for the year is also complete we can see here we know when were these people where were these people liberated in these courts the we can look at for example the the the port where they disembarked this is where the important works were they disembarked this is where those courts were these are there was like these courts or tribunals in these ports towns and you see that the most of these people were freed were liberated in Freetown and Sierra Leone which is in Africa so they were this is on the eastern side of the Atlantic the you know this is before journey would have been made these ships were the like the these suspected slaver ships were seized and the people on them were liberated and then the second biggest one is in Havana Cuba so that's on the western side after the Atlantic journey would have been made is when we have the second group there and then these other ones are much much smaller we can also see where where these ships embarked so these would have been the ports went where these slaver ships began their journey so where the captive people would have been big on these force these journeys these forested journeys what a big I would have begun those there so um what we're gonna be doing is building a model to understand a bit about the characteristics of the P of the people who underwent this these forced transport so the data set or the time period here that we have in this data set is from like 1810 1860 or so let's see what the distribution is the reason this is the the time period that we have here is because let's see let's make our histogram see let's do like 20 bins and like this okay okay this is incredible these are the people who were who were liberated are freed and these are you know tens of thousands of people here in these Peaks here so this is um the reason that this distribution of years looks like this is that the this is when these courts were running these courts are tribunals were were freeing people was during these years it looks like we don't have data we have like just a few people up here after 1860 but mostly this happens before about 1850 what I would like to do during the course of the screencast is to look at where we have missing data because that's you know um when we deal you know in an econ in any kind of modeling like a missing data as a problem when you but when we start to talk about historical data sets it's also a big thing to have to deal with and figure out how to deal with and then I want to after we impute missing day using tidy models framework I want to use a simple linear model to understand where their changes during this time period from say 1810 to 1850 or so or their changes in some of the characteristics of the the people who were captive and then freed can we see any changes and and who was there so let's do a little bit of some exploratory data analysis before we start getting going on that so we've already kind of looked at where these people started their journeys they're forced journeys and where they were liberated let's look at let's see what all should we look at so I'm interested in I am interested in the age so let's look at what how should we do that let's look at so let's filter to just to those years that are before 1850 and then let's group by that year and then let's find the mean age mean age and remember from when we looked at skimmer that there was missing there was missing data in age so that's why we need to put that that n/a dot RM there and let's put a line here that is like kind of a little bit of a thicker line like this whoops Oh yep classic okay so on a line like this I often find it helpful to make sure the zero is on it zero like this as often helpful and this let's see so we've got the line of the data here and let's put a let's put a smoother a flat smoother on it and then let's kind of zoom in here and see so it looks like there is a bit of a shift over time from between 1810 and 1850 that looks like there's a bit of a shift to older ages first of all this is shockingly young I not sure what I was expecting but I this is the mean age so that is a that is like it's below 20 the mean age of the these people who are transported was below 20 so that's pretty shocking but we're seeing here these um us maybe a slight drift to older ages let's look at also so some of the if we look at that skimmer again just to have it here we've got you know age height so height actually has not very much missing data and as well gender is something that has a little bit more missing data in it but let's let's look at what how that is encoded so we can do let's do let's just make a a box plot so let's put gender on the x-axis year of arrival on the y-axis and let's put let's color that box plot and we don't need the legend because it's just for visual clarity so let's look at this okay I definitely seen some shift here okay so let's notice a couple of things gender is is coded as boy girl man woman and then here we see those n/a values that we have also it looks like those examples of from after 1860 there's a lot of missing data there there in the data set but we don't know much else about them so we're probably just gonna include those all together or exclude those I'm sorry altogether but we have this shift it looks like with time both you know if we're gonna if the ages are right we see that again that shift to older is that the opposite direction yeah I don't know we'll we'll fit the model and find out but we have this um these shifts with with time and let's let's uh let's also check the age how are the ages distributed okay okay so there are children who are coded as man and women and then there are people who are clearly adults who are coded as boy and girl so we probably want to recode this gender so that we have boy and man and girl and woman together this is a bit of a challenge this gender category is the historical records aren't are not our viewing gender in a very binary way and I you know I don't think these n A's are here because of a broader than binary definition of gender for the way the way these a historical records were kept but like what am i is the animal how am I is the analysts going to approach this because you know my my personal belief you know is is that we have that gender is not some simple binary thing but like what am I gonna do with this historical record right I mean I'm going to tell you right now that I'm probably am going to like recode this into a into a man and woman categories here so that I can understand what the changes are but that's something that I have to like look at here and face and and understand that that's a choice I'm making when I'm doing my data analysis and that it is something to note and to perhaps think about if there are if there are other choices that I could making one of the one of the I think one of the really most important and beautiful and like valuable parts of this data set are the actual names so let's make a visualization with the names so I get a group by name and then I'm going to summarize I first let's say how many people have that name let's let's just find the mean age let's find the mean year of arrival and let's let's like make a let's make a scatterplot let's filter first let's find everyone who has more than there's more than 100 people with that name and let's start this so year of arrival goes on the x-axis age goes on the y-axis and lets slips let's make some points point and let's make the size of the point equals to n like so let's make those a little transparent like that okay so let's let's make that go down to 50 let's make that go down to 30 and let's let's just emphasize that that size equal that is number of people okay so the so let's think about the size of these dots these circles these bubbles so notice that the the bigger dots are near the center cuz we're taking a mean of more people and so they kind of they go to the middle right like we just take the mean of more things and they go to the middle but that actually seems more true for year no for age and it does for year which is interesting to notice there's more like diversity along the year then there is along the age that's kind of interesting to notice let let's put the ages of the names on there so we can see them I think I'll put them on top of the points you know text repel I didn't want I didn't load GG repel which is super useful for pointa for plotting points so what if you want to say a EES label equals name like this and so the to get the the fonts to match I have to put my same font here I'm using at the beginning I said the theme I set my theme and so I have to set my theme down here I have to use the same plot the same font here if I want them to match alright let's see if that works let's make this big enough I can see something oh that didn't work at all did I let's let's make the let's bump the size down here a bit take a look at that yeah okay that looks nice okay so I I like it this is you know just a simple scatter plot but I really like this view where we can see for you know for people with these names you know when on average that were they were they liberated or freed what was their average age we can see how what the most common names are we can see I mean this is pretty incredible so we have you know people who we have boy here and unknown these are probably you know probably they were actually very young children probably I'm guessing there we see things I see some examples in here of quite different like a name names on quite opposite sides here that have different spellings just with them H that's interesting to notice so this is an incredible amount of information to absorb in terms of what this represents in people and I I like representing this in this way okay so this is a bit of exploration that we've done of this a valuable and interesting data and now let's talk about what we're gonna do to try to impute some of the missing data so let's start so let's so we've got this African names data set let's start by we're gonna filter out those couple that don't seem to have the the data in it very much and then let's recode that gender like we talked about case when we're say when the gender is equal to boy let's call that man when the gender is equal to girl let's call that woman one and then all the rest of the time is called gender that includes all the n/a values that we have and then let's do a mutate if is character to factor because that is helpful for some of the modeling functions that we're going to use so liberated let's call that liberated DF so these are the these are this is gonna be the data set that were going to work for for the modeling that we're at gonna do so we are going to bill we're gonna use we're actually just gonna use recipes today not all of Tidy models and recipes has steps for imputation of missing data so we're gonna start by declaring by specifying our recipe so our recipe we're gonna say I mean actually I guess we don't even need that we can just say without a predictor we can say I what I want without an outcome excuse me I want gender plus h plus height so the idea here is I want to I want to build I'm gonna impute these things using each other and the data here is liberated DF like this let's call this impute recipe like this like so and and then what I would I what I am really interested in so height is something that was recorded about these people in a somewhat in my opinion dehumanizing way but I can actually use this to help together with age to two in to infer to impute them some of the missing data about gender and age so what we're gonna do is we're going to do let's load recipes we're gonna there's several there's several different kinds of imputation you're gonna do we're gonna do mean imputation on the height so you know what free let's just do one more plot here before we just so we can see this it'll help us let's use the nanny our Nani our package and we're gonna do of these things that I'm interested in so let's say let's take the African names data set let's select gender height and age and let's do a Gigi miss what is it called Gigi miss upsets yes let's do this okay so this is an upset plot which is about sets so we have bars over here so gender has the most missing values height is next to than an age and then this tells us about this tells us the number of how many cases there are that are people how many people there are that have that a combination of things missing so this many people only have their gender missing this many people only have their height missing this many people have their gender and their height missing this many people have their height and their age missing and so forth and so I would like to under like my question here that I'd like to answer our is like what are the changes in say age and gender over time over this time period in who who were the people who were I'm liver liberated from these um from these slaver ships but we have these missing data and I would like not to throw away this data I would like to keep it because I mean especially in this case this is a valuable hard one um data that tells us something important and you know it seems even especially more important in this case so I would like I would like to keep it so what this tells us so what I'm gonna do here is for the height I am going to impute it I'm going to impute using the mean so we don't have out of like the 90,000 right like I don't we don't have that many that are missing so I'm just gonna acute the mean and then I'm gonna use a nearest neighbors model to imputing and age from the data that is there so this is um what we're gonna do so mean and pute for height and then step K and then impute and we're going to say all predictors so what this does is so I'm now set that up like this mean imputation for height and then K nearest neighbor imputation for all predictors and then what I can do is I am going to prep it so when I have prep it what prepping the recipe does is it estimates the the quantities that I need to actually compute those things so it will estimate the mean of the height so that we can you know put that number in and then it will train a nearest neighbors model so that we can say what the value for gender and age are for everything so so you know this one's really fast this one is not so fast and we'll take a little bit longer so that that estimates those things and then I can juice to get that data back out if I wanted to apply those same if I wanted to apply those same values to new data I wouldn't use juice I would use bake so juice is just like a shortcut for baked when you already have the data like the training data notice I didn't test I didn't split training and testing data here and that's because I'm not really training like a machine learning algorithm here where I want to have a high predictive accuracy instead I'm working more in like the inference regime where I want to I am you know using a machine learning algorithm like k-means here - I'm sorry K nearest neighbors here - to do my imputation but at the at the end my goal is not predictive accuracy but rather in inferential in in nature so now here's my imputed my imputed data oh darn it I did want that's I did want to put your of arrival here all right let's do this again because that all right yes all right so let's run this one more time because to get the data back out I need the Year arrival there because that's what I'm going to Train on when I do that very simple linear model that I just told you that I was going to do so let's run this again because I do actually want that out and let's get it ready here so we're gonna test oops we're gonna test out changing what what happened here so for example I'm interested in gender here and we're gonna see what happened before and then what is gonna happen after imputed so let's let's look at these so notice before we had a lot we had all these n A's and we don't have the n/a zaft ER and afterwards because we use this imputation and the distribution looks you know about the same which is you know what we would hope if we'd done a pretty reasonable job of imputation we wouldn't expect that to change a lot and let's look at age how a the distribution of age changed so we had you know a thousand Enys before and we have none afterwards they all we imputed with all the values and notice that you know very little of this change the distribution stayed about the same our median and mean and everything stayed very very close to the same which is good so we did it we imputed value so I'm just gonna take like a moment and say what did we do when we imputed the when we we took our missing data and we imputed values for it we didn't add any new information we don't have any more information than we had before and in fact you know if you ask me like how sure are you about any of those individual imputed values I I'm not I'm not gonna I'm not gonna you know tell you I'm super sure about any of those individual imputed values instead what what missing data imputation does is it allows us like as a whole statistically as a whole we when for the observations where we have some you know some of the variables are there and some are not we can keep the information and use it for the data that is there and we use statistically like we use the information overall that's in the data set to be able to get as right as possible and for those missing data and then and then be able to keep information that we do have and not throw it away and I'm especially glad to be able to show how to do that and to do that with data like this where I am where the value of it is so high and seems so clear okay so the model that I'm going to do well here is you know just so so so simple I'm just gonna fit your of arrival with gender plus age and I'm gonna use that imputed data there so let's just call this fit LM like this so this is you know super super basic like this and you know we can do a summary we can tidy it like this and so what we see here so let's think about these Val you know what what is it that we're seeing here so we see like we saw in those exploratory plots we now see here and the output of our modeling we see the evidence for it so we see that we see evidence here for you know as some shift with gender that as as time went on during this period towards the end of the transatlantic slave trade as time went on there were there were also in the earlier years there were proportionately more women while in the later years there were proportionately fewer women so that we saw that shift in the population with gender a women and and girls so it is good to keep in mind that what that you know me mean age was so in speaking of age we see also a small in effect size but a small shift in the age so there was a as time as as the time as time passed we see this this shift in age as well so that the age was was changing in this way so that is oh that's interesting so the as time so with every increase in year the age is shifting and that is the with an increase in year the age is shifting down so I guess that is reflective of so I guess this is a I guess this is a when we make that plot so this plot ah yeah I guess we did see that we got a little bit of like a Simpsons paradox thing going on here so this because we made where did it go we made this plot that so these so where the okay yes I'm where was the other one so we have this where the overall age is going up but let's see this and we want to put here here so the overall age is going up but the age but the years for these groups are not as if you divide out by gender yes so this is this is a Simpsons paradox kind of situation so the the girls mean ages of arrival is later than the women's and the boys than the men but the men and boys are alive later than the girls and women so we have that sort of that switch in in sign so this is something we were we were able to see because we did use the modeling which is which really demonstrates the value there so let's let's let's just look at that that tidy method one more time so the so as time as time went on we we saw so as time went on the age is going down so younger people and fewer women okay I was really glad to get to spend some time with that data sets to be to participate in this way I'm really open to feedback and to learn how to be a part how to engage on these topics better and what we did today was to take the Senate dataset explore it we imputed missing data in the data set and then use use it to understand how how the population of these captive people who were then free to change over time and we ran into caught in the wild example of Simpsons paradox which was pretty cool to be able to see that and see how that can inform our understanding and we like what we know you know like what we know about who is it that was impacted and like what the experiences were that individuals and people were having during this this part of history that has informed so much of what has happened in my country for sure so thank you so much for watching I hope this was helpful and I will see you next time
Info
Channel: Julia Silge
Views: 2,332
Rating: 4.964602 out of 5
Keywords:
Id: z4oQh_5YMVk
Channel Id: undefined
Length: 37min 30sec (2250 seconds)
Published: Wed Jun 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.