Tidy Tuesday screencast: Analyzing incarceration data in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Dave Robertson and welcome to another screencast where I'm gonna be looking at a dataset that I haven't seen before exploring it in R and seeing what conclusions I can draw so as usual usual I'm gonna be working with a PI Day Tuesday project so each week tided Tuesday releases a data set and I take the new each time I take the newest one and see what I can learn from that so this is going to be a data set um I just know the topic is on incarceration trends so one of the most important things to note about this dataset is it'll be important for me to use our us use our best judgment and be respectful and careful when reporting on trends seen here so remember that I haven't seen this data before I'm working within one hour and I could make mistakes I could I could I could have bugs I could um running issues and it might not give it the full detailed topic treatment that a topic like this deserves so please keep that in mind when you're watching this screencast having said that I hope there's a lot that we can learn from the data set and learn about statistical analysis by by analyzing important data set like this one so we take Euler this is looking at UM add questions about justice particularly level jail data and prison data over the last couple of decades particularly looking at racial and social discrepancies um a honor of Martin Luther King jr. Day so we're gonna take a look at this um okay it's five datasets all right I'm going to you see ok county level probably want the full raw data might be a lot I'm going to look at the processed versions I'm gonna try taking a look at UM prison let's start with prison summaries the first one if I click this do I get to the raw version all right so I go into my arm D I want to say prison summary I'm gonna do library tidy verse prison summary reads csv of the data and i'm going to take a look at the data and we have time urban necessity which probably is rural small is it small middle i'm a trip suburban and urban looks like the datasets we have we also look at categories these include let's say we have race in terms of white and black we have um gender in terms of male and female white black and Native American looks like we have totals we have other we also looks like have Asian as well okay so we have a couple races and this looks like incarceration rates so number of people I'm going to guess me to take a look at the data here there's a prison summary by year race gender a total rate within a category of a prison population per 100,000 people I don't know so one thing I don't know if this means is um I'm guessing that could mean within one year what fracture well in any given year any point in time what fractured people are in prison as opposed to say a lifetime probability of going to prison I'm gonna assume that's the UM that's the meaning here okay rate per 100,000 okay so take a look at we have county level and prison level is this per no no we're not looking for prison now this is across all prisons this is you see where's the county level I don't think we have data on individual prisons here oh I guess that would be county level chip county level up yes so this is overall the entire population and type of county I think we might want to start with okay I'm gonna start with this to get a general sense of trends um but I'm probably gonna move quickly to be looking at prison populate the more detailed data rather than the summaries okay so we could start by by saying um we have relatively few Urbanus cities it's not necessarily meaningful but I'm going to take uh let me clarify that I'm going to take a look at rage per and thousand I'm going to color it by urban a city and I'm going to facet it by population category this isn't super kosher because population category mixes together I neglected to add G online mixed together different types of uh for example races genders and so on but and it looks like other only existed towards certain point at which it was split up it looks like I've split up into Latino Native American Asia we don't see another category anymore all right and um this is a total will be overall again it doesn't make sense the fast at all these levels just want to get a sense the date of the wind they were taking a look at we could start by looking at just a traces to do that we'd filter just for um let's see pop category in and we could add we could hard code that were looking at white black Latino and Asian did I miss Native American and I think that's did I miss any I caught a cow pop cat category to find out two three oh yeah it was a total of I'm skipping other because it the data was available only at the very beginning um and yeah I sorry I don't it's worth I think leaving out okay we might want to look only since the taught in 1990 when the when we have all the data we can make some some different conclusions here what first things to notice is clearly there's enormous racial discrepancy in terms of percentage of population in prison by far highest in black sub population followed by looks like Native American Latino white Asian there's also a large gap between in urban a city mostly in terms of suburban versus of the other types of um of populations so the UM see mm-hmm whether they nameste to look yeah I think about how this else was to be sorted out because any particular they don't besides ticket making a better theme I'm gonna theme Sat theme light this is a way to communicate a lot about the overall population I don't have much I would I um I would add to this graph from this data the one thing worth noting is that there's a the gap in terms of suburban versus other types of will be types like what is it called regions county type the difference between suburban and other types of regions that gap grows tremendously within the black sub population and it starts really around it really sees a big gap of 2000 in other populations it looks like it's mostly constant over time so we can do some fiscal testing of that one thing I've been trying to do a little more statistical testing we're not controlling for a lot of confounders here yeah for example populations don't stay in place over time the makeup of populations across types of um of counties is going to be changing during this time so there's there's really a lot we can't say here in fact I'm not I don't think I'm gonna do a statistical test yet I there's just so little they can controlled for what I was thinking of a statistical test is that we see an interaction term where we see a general suburban a suburban populations have lower incarceration rates but that that is chained that the relationship between that and other urban cities has changed over time particularly within one race so that's Table one quick look at the summary data I could also take a look at the pretrial data I'm actually I'd rather start by let's dig a little deeper into prison into the prison population data and we'll look at it by county so what I'm going to do is take a look at an X at the next data set this is I believe an aggregated version I'm guessing the prison summary is an aggregated version of the prison population data set so I'm going to download that this one I'm gonna do it in the same step here because a prison population again I could tidy this up a little bit but there isn't much besides to change the labels there isn't much I would add but this graph this is a lot more detail this looks by year but it also looks by state by the region um that's like the region I'm guessing the region of the state so really just for large recent regions if we want to do a graph that would be another way we could divide this data down and then there's the overall population and the prison population so not just rate that's useful if we're going to want to be do certainly if we're doing statistical tests that'll be one a central part of it okay let's take a look at hmm okay I'm gonna start by looking at today's data I'm not gonna look at that change over time just yet I'm going to say prison population we have a choice we have this many variables we choose what to segment on I'm going to look at the most recent year we have which is 2016 I'm going to add rate if I state so one thing I'm I'm interested in is I could summarize the total population I could summarize oops do I need any RM equals true probably I'm looking at the total population today and I can say the population incarcerated prison population hmm that's not what I like to see I wonder if this if this has missing data I wonder if it certainly wouldn't expect 0 across multiple any state let alone multiple ok we have loads of missing data we see that's definitely important um where can it let me see how can I see how much my data were missing I could say group by year and summarize the sum of is n a prison population this is how many rows are missing in each year I view this we you see Missy no I'm instead of some gonna do mean say what fraction are we missing I'm doing it III didn't mean that step I meant me in here so we used to never have data on prison population it's been improving improving in terms of getting more data but it looks like there's still lots of it could be counties that don't have their own jail but I'm guessing more likely it's just missing data so that we don't have one thing that surprises me I wonder hmm does it and one first I wonder the codebook mentions that one thing that I'm suspecting is that here I'm looking at one County aha in 1983 we started recording at okay and then I'm wondering if you just filter out any case what prison population is n a of course the challenge there okay I'm I think that's probably the approach what's important there to notice then we're not looking at the total population of the state we would be looking first I can't do it and I'm 2016 we're missing prison data 2016 but we have really this is a surprising amount of missing data to me that forty five percent if I look at prison population 2015 if I view it he's saying that not far from half of the rows is only 2015 one row per County not far from half of the rows will be missing data okay it looks like for example when missing data from all of Arizona all right we're also if I go to California or missing data from specific counties that it could be smaller counties here's how um sometimes it's worth when we're missing data try and get a sense of why we why why it's missing and there's no question that some states are just fully missing data let's go ahead and remove those for a moment so remember the question is what kind of data are we missing this is a data cleaning step that are we working on I'm gonna take prison population I'm gonna group by the states I'm gonna filter some not is in a prison population is greater than zero that is you have to have at least one non xenon na observation I could have said any not I probably seem easier to say mean any not is n/a at this value so now I could say if I counted state it used to be 50 one there we only have data from 29 states reasonable we can still we can often work with that within that subset what fraction of data are we missing I can say summarize some is bioship done mean is n/a an ungroup and ask mean is an a of prison population so I could say we're missing 14 smaller the other question I would have is are we more likely to be missing the data for smaller counties if that's the case it's it's definitely worth remembering as a bias and it's also worth understanding is that because they don't have a prism or is it like maybe it's set to another county or is it because the smaller count we just we haven't collected that data that'll make a difference in any conclusions so I'm gonna call this let me see alright so if they want to answer that question what I would do is hmm I can probably Bend the population you see yeah there's probably the way to do it what I do is say prison population is cut open okay back say population oh I just remembered oh I completely forgot there was very critically there's a pop category here as well so that's a really important question is do we have any is it could be one subcategory it could be a tiny town with with no people let's say no Native American people in prison and that's why I sent a one of the critical questions is do we have any case where prison population equals zero yes we do so that means the NA s are probably not the NA czar probably not zero DNA's are probably missing data alright how much population are we missing in terms of um let's I'll say group by state I'm really do well in the missing data that's because you could get a lot wrong with if you don't realize how you're dropping out missing data and you say if I group by state I could summarize what's the total total pop missing prison what I'm looking for is the sum of the population not across the whole state role in the cases where the prison population is na and that's total pop I could also divide that by the center population it's kind of like a weighted mean I'm also getting all this wrong because I need I just if I don't have the population if I they filter not is in a population I can't do really anything with that I think okay so and if I arrange by sending total pop they see in prison so what this is showing is in Nevada I think no nope nope North Dakota we're missing about ten percent of people live in a county where we don't have the prison data in a couple other states up to seven nine or nine percent but we move away down in most states were missing only only a few only a few percent of the population lives in a county where we're missing the this data so that's one way to ask the question another way to ask it is here we go so I'm gonna actually say in our non missing states oops go we can also ask that's one question i'm going to ungroup back here in a missing states we can group by the category the population so he say population category is I like the cut function it's built in but it says I can say cut the population by zero ten a hundred a thousand ten thousand one hundred thousand up to infinity and then I summarize the I think the aisle need to filter out that oh I I already filter not is in a population and then I summarize the fraction the fraction of County is missing so I'd say percent missing is mean of is n/a population prison population prison prison population prison population here we go I also want observations n all right the zero to ten is probably a little aggressive we have very few above a hundred thousand so I'm gonna I'm just widening these a little bit a few a um and is I guess could be population zero well that's annoying we have four can flow counters population zero what this is showing is in our tiny bomb in our up well this is showing is everywhere from we're missing data we have n a prison population in anywhere from 20 to 35 percent of counties that the fall in this lower range like these low population counties but the high population counties we almost always have the data so this this looks almost paradoxical because remember we're missing data on only a small fraction of but most people live in a county where we have the four the prison population data but we don't but many counties do not have prison population data it's worth remembering that that distinction is me same smallest pretty small kind of small and then large and these large counties more than ten thousand people which makes up a good chunk of our counties and a lot of population we're usually not missing the data okay there's a little exploration of missing data if I were doing proper statistical testing and if I had more than an hour I would spend more time digging into that and an insurer and we understand the reasons why in the meantime I'm mostly gonna be working with non missing states so we're dropping we're dropping I think I was so we have 51 states including District of Columbia so dropping 22 nama see states distinct state is it we have 29 so dropping 22 states where we never have prison population data that's one thing worth knowing but we keep all the data otherwise okay so within these we're also dropping and I'm going to add this here look at yeah I'm saying we're looking after year 2015 most recent with prison data will drop in 22 states we're probably going to look over time I just notice that by looking at 2015 we got a better sense of what data is missing the problems were on - and we're dropping let's see oh yeah we're dropping the what is my original data how often am I missing population how often am i missing the population column 20% of our rows I would put this in a percent here so that I actually say 20% of our scales percent of our rows of observations where we don't have overall population data again what's missing doesn't mean it's zero okay so let's start digging into the data we do have what I would look at is I do need to filter for not is in a prison population which is even less data and I would need to say I want to say what if I want to do it as state aggregation I could start with summarize some of the state is some of some of probably oh I'll let me do it I want to aggregate both population and prison population I've removed the missing data so I can actually do a summarized Act for population and prison population and say some it's pretty handy for aggravated multiple columns and now I can say incarceration rate I think our customers rate is total first population is divided by population um if I'm missing sign it should be a lifetime risk just note I don't have as much experience with this kind of data so we're looking at it so now I'm looking at 29 states and their incarceration rate at least in terms the part of the counties we have data on and that I could I can arrange and descend an incarceration rate okay so the incarceration rate basically never goes above 1% its highest and I believe is Arizona and MS Mississippi I think this is so if I wanted to get a sense of these trends I could start just with incarceration on a state level what I would want to do that is say um what I want to do is go to my ID well I have some people I have some teach about to set up for that I have map data scape you always wanted tabled if this these are my regions so we can say um this is actually data on every region it's a longitude and latitude because I'm going to be creating some some Corp left's of of state level incarceration rates so the remind myself quickly how am I going to turn the state codes into a region names why is it locates really didn't think it was lowercase last time I looked that's funny alright well I'm well there is a state abbreviation which is every abbreviation there's also state is there state name there is I feel like that should be a datasets that's are certainly being in a dataset somewhere I'm trying to remember if there's a table sitting around that has the states is there this is no I think these are all built into our but I really wanted them in our who this is ah yes alright this is some identification of we've got an abbreviation we've got a name in the map the poly name is crazy annoying that's like a whole joining thing I don't quite know how to mess with that but do I have anything else way wait all these are in our built-in yeah they built in isn't that something really is okay I'm gonna teach you a trick I've got names I want to turn I've got no uh-uh huh I've got names I want to turn them into abbreviation abbreviations I want to turn them into names I have my abbreviations I have my name's anyone seen a match function back before inner join was all the rage we and are used to use match we would say take our bi-state summarize data and say bi-state mutates name equals it'll be state name I was call it state Brack where the name was matched with this value the state when we called statement was matched to the state abbreviation vector this works because of greevey ation and state that name are the same length and they correspond one-to-one so I just added the state name as a column I could have done this other ways I could have created a data set I feel like there's one out there but um I'm also gonna do to lower on this I'm gonna do string to lower from the from the tidy verse packet from the string our package and I've got our lowercase state names why did I go got all that effort because I need to inner join it with our map data at the state level can I do it by state by let's see state name I'm gonna actually call this region it makes it so easy do this join every one of these joins gonna be a little bit different but you've probably seen me do these core plus before now what I have is our variable of interest incarceration rate alongside our map data so now is a snap we made a choropleth we feed this into gelato we say I want well I want sea lobster and the x-axis latitude on the y-axis groupís group and I want genome polygons this creates the states who I realize let's left join oh I meant right join because I may want to keep our vol 48 of our continental States that's our map I can actually add theme map oh it's in GG themes I think theme map oh I do not have GG themes installed I'm gonna quickly install that because theme map is pretty great I don't know who created GG things it might be ba Brutus um but I could be wrong in it and I'm creating a choropleth okay I also usually want to use coordinates this will in enforce particular projection all right so far just a map by ad fill equals incarceration rate that's where we get to create this choropleth yeah now we can see is when missing data it's not completely random it feels if we're missing data in the northwest we may see and beard in some particular built nearly like the mid-atlantic states here and yeah a couple other states these are ones we never have prison data it also looks like incarceration rate varies a lot and we get a sense of how it varies regionally so we get a sense of notice the South has the highest incarceration rates New England the lowest C okay yeah again this is not looking this isn't looking over time or anything like that it certainly and it's not looking by counted as aggregating at the state level but there's a way we can start to get a sense of the UM of the distribution here another way we can take a look at we can we could look at this by a county level we actually have it originally before we did our aggregation you see where's my not gonna look at nom missus is gonna look at the original data prison oops prison population this is it by county I can ask a question let me see you where can I get us County let me see our map date I haven't actually made this crap before so what I may want to so I'm gonna ask where can I find Matt but the counties hold on this might be right what I need is a county level data set of on the US here we go oh wow I think this is it might be easier than I thought it was what I do is map data kalki cable death oh there's my there's my counties region and sub region so I've got my region already named I'm going to region comes out of state I'm gonna take a look at prison population I'm gonna add region I oh wait does it already have region oh no it has we elk this is region it's not the right place oh but I do have state so I'm gonna call it I'm gonna replace region here because I want it to be easy to join now it's Alabama etc it's lowercase and the sub region is going to be let's take a look at um audit Roga county um let's take a look at their data ah trying to remove the word county so if I look at prison population and I count county name they'll really need to count it it looks like it always has the word county in it I wonder if it ever has a word like Township whose one's like census area they're probably relatively rare so I can actually check that really quickly by saying filter not string detect county oops I meant yeah County look for the word County and it'd be called what's the column called County named serve I changed County if I since if I create a column called sub region that is really County and I say string remove Oh string to lower string remove oh I just checked this how many counties are missing we have parishes we have boroughs we have a lowercase city I'm gonna go ahead and I'm gonna me see I removed perish I perish looks a little common and I might as well remove City what I'd say is is take your county name and remove oh I'm gonna string to lower inside I'm gonna say turn it to lowercase then remove the word County or the word parish or city I need to fix up a little it's not gonna match everything I'm not gonna go about I'm not sure house a census areas is going to match up we're not going to match everything but here now I can do a little closer to left join on the county data so now I can say let's join on it's um let me also add incarceration rate is prison population divided by population I'm keeping the percentages rather than out of a hundred thousand I tend to think a little easier in percentages and we join it by our two columns in the county data so this is more learn how to do Cora plots is you find the data in this case the county database and then you and then you add your region sub region and is a huge joint isn't it oh I just remembered oh oh no I somebody we I didn't divide it down at all I need to look only at pop category total I didn't look that bad earlier either which messed up my state level results not yet this was a that was a mistake I need to say what I need to do is say I'm by state I would say pop category I'm starting with total okay it's very similar results but I'm a different approach I would look now here I'd look only at pop category equals total oh I need to look oh I also need to look only at 2015 I couldn't leave it to an animation I'm not I'm not doing it quite yet take this data and now if I join it to County okay yeah now I now I feel I'm gonna write so I'm not left join so this I've been thinking a little bit ahead I've been working fast for them and speaking this is our prison data I've added region with a lowercase state name a sub region and incarceration rate and now I'm saying I want to join that to our county map data the reason I have is now it can make a choropleth I can use the exact same code I used earlier latitude longitude fill and now here we go we have a way more detailed map so one of the things I notice if I take a look is is we have an outlier here that I think is messing up the color scale in general there's one County it looks like has a um quickly call this County overall it's only or it's only the total populations not any of the subgroups we looked at and if I try taking this and saying arranged by descending incarceration rate we have it looks the up they said Nevada County where it shows the total as being 32 out of three hundred three people in prison I have no idea if that's Thurston County have no idea if there's something about maybe that say it could be it could be real it could be a mistake I don't see anything saying it's a town that has particularly important prison but notice that's a 10% incarceration rate the next highest we have is three and generally hmm one of the things we can do is filter for a minimum level number of people let's see we could out we can also just we can just limit if it's one case we could this is not amazing but I could say well I wanted sure the incarceration rate is less than 0.05 given that we know it's only removing one person notice we've made a more informative graph already because there's more of a distribution I probably want to change the the scale call it fill gradient I like son call gradient to where I would say low equals blue high equals red I need to give it a midpoint let's say here 0.02 maybe omid midpoint is point O two and I also really want to say this is a percentage Able's equals percent format I'm gonna put the midpoint a little lower the point where blue shifts to red and I'm going to oops I need to put this before the right choice I need to leave some data grey okay so we notice our big blocks of missing data and we also that that's one of the reasons it's great to make a map is you understand it early we were wondering why is the missing data here not here it's a cod it's really geographically structured so the so I'm taking a look through here and yeah we see a lot of counties with high incarceration rates include Texas there's one there's some that very high ones here in the UM in the southeast and yeah so we're learning a little bit about this um about this distribution at a county level I can also I can do this as an animation let's see I can there to see how this trend is been changing over time let's say I looked at it only I'm gonna look only for people this is this is County overall 2015 I'm going to I'm gonna be a little easy and just copy it and just copy and say County overall overall time I'm still filtering just for total I'm not gonna do I filter the incarceration rate I'm gonna have to take take a quick look at this so here's our County overall across all time I'm also gonna I'm gonna hear and say filter not isn't a incarceration rate oh I need that in the after step and a sort sorted okay what I'm seeing is McPherson has been unusual County wow it used to be a way higher percentage of people in jail I don't have it could it could be a data entry bug it could be it's in a bit unusual about the best city um okay I may just remove it uh it is certainly hmm not have set an explicit midpoint it might not be as bad what I'm trying to think through what the animation will look like that's what I'm when I mentally I'm working on I really feel I'm concerned about this county that's gonna dilute to kind of blow out the whole scale I'm going to remove it it's not that important today that the map overall moved a county with an unusually high rate throughout history and I'm going to say I'm gonna load up choo Jetta mate I'm gonna remove this filter I'm going to and I'm going through the right join here actually I take it back I like the right toy that's part of the plot because it makes the graph not so useful what I'm going to do is add transition manual I'm going to say I want a separate plot I want a separate plop for each of these these a separate frame for every item here I'm gonna say transition manual by year I'm gonna look over time from 1983 looks of luckily frequency filter arrange a count year yep from 1983 to 2015 I'm going to be looking at at the distribution of prison pot incarceration again so far just looking geographically and I looking at a race but the or gender the I go I am going to say transition manual year I'm also going to make this slightly easier I'm gonna say year mod 5 equals 0 I didn't think every 5th year I'm you know try and make the graph a little faster to render I don't want this heck I'm gonna do it every decade you know one frame for each of our future data points so it's pretty fast I was wanted to see the animation really quickly here's our change all right great that was fast enough I'm gonna look at every even year so it's taken a minute to render but the goes whatever it's gonna be don't you take a be looking at here is an animation of the changing today I'm trying to I'm trying to figure out why where's the missing data alright so this is a distribution of changing incarceration rates over time one thing is we're having a similar problem before of washing out where the here it is well there's a farm where we have where we have our temper sensor our making everything else um look strange so I actually probably need remembering how scale fill works something like limits yeah I think I can say limits equals maybe slim I don't quite remember 0.04 I'm going to do it without a transition and I'm gonna filter for one year so let me ensure the for I take spend a bit of time rendering it yeah so that looks that app limits is right that's good to know so if I say point three oops I need to say a year I'm gonna take it I'm gonna put it back in I can like you create animation of the incarceration rate so this is incarceration rate per county over time take a minute to render it so I'm 30 30 something years so yeah it's it's really um I used to think or plus or something you had to really work a lot in geodata to use but I've been so excited with how did you plot to the tools did you plot to allows for visualizing them when the maps also remind you of is regional confounding so if we made a scatter plot the missing data is switching places one thing I discover here is that um GG animate handles missing data differently than it than default data would hmm Yoshi some regions turn ah look at that did you see textures turn red like that watch Texas watch the bottom of the math at the center bottom of the Mac does years with his missing data but you can really see I may want to filter just for Texas and try that visualization okay so I'm going to you know what I am going to do a joint version here it is and I'm going to I've got this graph this is one this is one graph the overall graph just I'm gonna look just the Texas so knows I am doing some copy pasting of code is really not my favorite approach but I can say region is Texas we call that Texas is lowercase and now I'll get just that section of the map be less data so it'll render a little bit faster here's a choropleth attack of just Texas what no promises ah there it is so I really wouldn't need a title here to say what year it is it's very frustrating to not know when it's not when it turns the 1990s it looks like there are some videos with missing data and then well what happened to there and that missing data woof this actually makes me think that over time by state is gonna be some interesting trends and I'm looking by county here but we were looking by state before and if I said let's see it's not missing states no that's just 2015 am i right yes let me see it's just when the teen if I look at prison population and I filtered just for notice I keep going back to the original data creating new clean versions and I filter for region as always no bits state is always TX Texas no I'm out I'm not going to that my bad final filter not is in a population and not isn't a prison population again I'm gonna now I'm going to keep looking at total I'm certain I won't have time to look at subgroups there's a lot that can be done with this data so if I look at pop up here's our total I can group by both year and state and summarize at the population and prison population variables take the sum of both similarly I can add incarceration rate incarceration rate is and now I have by state and year remember this is not the total population is the population counties for which we have data that's gonna be certainly a complication I realize in fact with Texas it might be a nightmare let's take a look at what Texas looks like over because of that that this brief time would it could be passing through 1% but I think it's missing data so let's let's take a look at this if I say by state year filters state is Texas I should have one row for each year I'm gonna throw in I'm gonna throw an ungroup after that summarized I don't to leave it grouped and if I say what is the population of the state oops um year population total population here we go we have enormous amounts of missing data so we see a trend of increasing population except for a region where we have missing data that would mess up everything especially because those years those populate I'm gonna put the expand limits Y is zero in particular remember that the counties that are missing are not a random sub population it could be counties that have unusually high or low incarceration rates the um you see yeah I I'm feel about how one would deal with it I think I would start to deal with it by by having a summary uh having a summarize in terms of amount of missing data so I'm instead of alright I'm gonna approach it a bit differently I'm gonna group by year in state I'm gonna say population is the sum of population but remove na s prison population is the sum of prison population but remove n ace and I'm going to and then I'm going to add one called missing counties and it's gonna be I'll say just fractional powers and missing the fraction of population in those counties might be more in fact I'm going to say no way I wouldn't have of course I wouldn't have them missing I wouldn't have that necessarily follow them sonís a mean is an a prison population so now if I look at Texas I'm missing something oh here's a population oh right it starts at UM 1970 bb0 missing prison mean I would have expected it to always this to be mean is in a prison population I would expecting I've expected this to be one I'm a little confused what the prison population is n a track mean is in a prison population would be the I would expect to be the fraction of the time is it ever not zero my sorted in some strange order it never says missing prison mean how often is summer what if I did this without grouping by what would happen I don't know what I'm missing in this mean oh I do the problem is that I'm wow that's a very fresh very frustrating problem I need to put it first because it was usually I think of or even rent of this to perform screencast is that it was using another column as the other prison population call but I just defined as the answer you go oops widened population ah here 100% is missing the prison but our population is now study it's the I'm not filtering for a Texas ah there it is we used to miss 100% of the data and suddenly we miss 85% and then we miss 2% that was me trying to take it so that's where missing data will really kill you in fact I'm probably gonna want to say filter at remove any point where I could probably say more than 10 per looking here to say more than 10 percent of counties are missing is it's a it's very sloppy but it'll get in the right area of like let's make that graph but let's not include it because if I wanted a graph incarceration rate is exactly the wrong kind of thing that would happen is that Oh suddenly some counters are missing prison data they're treated as zero what I need to do is I need I do need to filter only for I need to say only we're not is in a prison population in this summary go to a bit of work out whoops I have not saved once in the wrong directory that's okay I'll get it later here it is I'd say of the ah that's actually a little better but I'm actually doing is saying of the counties where we have a prison population what is the what fraction of our populations incarcerated that's a better graph it's still not perfect because that the counties that are missing or not are not meaningful I probably still would say filter missing prison is that most 10% we have most year data oops percent so this would jump over those years what we didn't have as good data from Texas but this I think would be a place where to start looking at change in incarceration rate instead of saying state in Texas I'd say stay I could compare a couple of states I'd say Texas New York California Massachusetts I could say color is state now I start looking at it instead of looking at trends over time we also understand there's missing data all throughout this we see some areas like California's been increasing and decreasing we could take a look at Arizona which topped our current level of incarceration we could take a look at what else was better top I think that's Mississippi yeah we start to get a sense of some of these trends unfortunately I'm out of time for today but this is this is just a start in terms of how I'd start exploring the data and it could have sliced it a lot of ways and asked more unfortunately wasn't able to look at I'm a trace within the or agenda within the prison population data this is really it's really interested in relevant data but it's also data we have to be careful with not just from a sensitivity perspective but from a date from a data quality perspective we are missing some data and that can really throw off our conclusions so to some extent I am glad that I was able to spend time really digging into what kind of data was missing how it was distributed both geographically and across time and how we um and was able to create some some animated Cora Platts was where we got a sense of the changing face of incarceration within the the United States women I'm really already seen here if I had more time that I definitely would have looked at is it looks like you have a change changing trend I know New York has famously Duke city particulars famously gotten safer in the last few decades and my understanding is incarceration rate has generally gone down too but you see an opposite trend in states like Arizona it wouldn't surprise me if there were social and racial justice components to that that are really um that are really important what what happened in Mississippi is somewhat between the 90s and today is somewhat extraordinary yeah so I'm glad I got to take a look at this data set I'm sorry I didn't get to dig into it I'm a little deeper and a little more and more carefully but I hope you um we're inspired to take a look yourself or to honor Martin Luther King jr. Day or in whatever way you see fit with with data science or otherwise so thanks very much for joining me I'm David Robinson and I'll see you next week
Info
Channel: David Robinson
Views: 2,568
Rating: 4.9466667 out of 5
Keywords:
Id: 78kv808ZU6o
Channel Id: undefined
Length: 58min 27sec (3507 seconds)
Published: Fri Jan 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.