Tidy Tuesday live screencast: Analyzing wealth and income in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson and welcome to another screencast where i'll be using r and rstudio to analyze data i've never seen before as usual this data comes from the paddy tuesday project an amazing weekly project in our run by the art for data science online learning community and as usual we screen i'll be streaming this live so if you're watching uh feel free to comment and you can share uh questions or suggestions for the next graph that i make it's one of my favorite parts of doing these this screencast is having people give their suggestions and kind of um engaging with them so let's see what we have this week i know it's about wealth and income inequality over time that's all i know so far let's see all right so there are nine charts of racial wealth inequality uh let's see and uh articles a ton of different data sets all within the scope of wealth income or debt over time and by race all right so it's cleaned as well as some raw data just reading this reading through reading through more appropriate for summary plots wow and comparisons looks like a lot of data here that's cool okay then i'm going to use the um uh tidy tuesday r use tidy template function to create my rmd i also like to do library scales and themes set theme lights some things that i kick off my analyses with and load up my tag tuesday data i usually don't use the rest of the code here and now i've got the readme present and i can work with any of the data sets i can use like these um let's see that like lifetime earnings student debt retirement homeowner race wealth income time income limits income wow uh so do all of these have all right i'm just getting like a like a zoomed out look all of these have a year i think all of them have race yeah um no not income time limits uh uh no not income time it looks like and uh yep that's percentiles income limits does income aggregate does income distribution does and income mean each fifth and top five percent each fifth each quintile okay well this really there's a i think there's a lot we can do here um across a lot of data sets uh we can go in a lot of orders these summary plots are like so comparison data sets facial breakdown so for root for some in terms of summary plot versus comparisons let's let's see what graphs we make out of a couple of these let's take them one by one i'm thinking about ways that it can combine across them an example might be through like a shiny app that lets you choose what you're examining uh but um and we might do some of that but let's take a look first like lifetime aaron uh so this does not have year but student debt all right so that's to so here we can we can make a graph but it's kind of only oh yeah so this is only one graph we can make here i can think of basically if i look at um i can try to try to make a let's make a graph of each make one graph of each data set actually a pretty solid place to start here you can see basically one graph that i can make i can say lifetime earnings and i probably want to do um race on the y-axis fill gender geom call and uh sorry really important isn't on a graph like this i would need position dodge so it's like a summary plot that you can make uh and i can i can clean it up a little bit labels equals dollar and yeah there's three races two uh in this state so there's three races shows two there's two genders uh generally men having higher lifetime earnings uh than women and the highest levels was white than hispanic than uh black americans right so that's like one one uh breakdown here so the um uh let's let's continue making one graph for each of the this is not the one it's here it is yes let's create one for student debt so i might do here it's got year loan debt and loan debt debt as a percentage uh the percentage one is what i'm pretty interested in because i can do year loan debt pc key fill this race this is what i think it is then it's interesting because i can get set up this no see i expected oh percentage i'm misunderstood i thought i was percentage of a total but it looks like it might be where was that looking student debt here it is uh share of families with student loan debt so i mean well in that case i don't want a stack plot i want a line plot and i probably want to do a little more adjustment on it say start it at zero y is percentage of families with student loan debt in the last week we looked at college graduation rates by race and over time and one thing we found it's interesting uh rates of of black cultures have goa i've gone up but so have rates of um of loan of student loan debt uh labels is percent and similarly i might want to do uh the loan debt in dollars and what is it what is that one it's um is that average loan debt let's check i bet it is i don't know if it's average or median or what that's the way we can do these two graphs they share a fairly similar story a percentage um of families of uh with student loan debt and loaned average loan debt raising generally with uh black having the highest level of loan debt then white then hispanic um i might want this is the case if i want to reorder a column like this but it's still a fairly simple aggregation these are kind of the only these are mostly the only two graphs that i think i would make out of it out of those um ah nine ch there it is the nine charts i bet we were making fairly similar ones to nitrates and wealth inequality in america let's actually do uh do let's do one more adjustment here did i get the yeah i got i see the ordering tripped me up the first time black white hispanic a way that you can fix that is a um is say races fcp reorder race by actually on the negative side lone dead otherwise it'll be in the wrong direction black white hispanic yeah wrong direction in the sense that i wanted to line up with our three lines so we can easily uh read it all right so that's like three graphs we've made so far let's look at the uh let's look at the next one so we looked at nx's retirement all right that's what is that amount of money every time uh average yeah i keep you know i'm looking at each of these i probably want to like give them more i want to give them more useful uh i'll say here we go average family student loan debt for age group 2016. always useful to have this kind of stuff in the in the graph itself all right and then retirement i want to know average family liquid savings yeah this is um go we have retirement we have year race retirement uh and i'll just paste this in oh no and i got this one wrong here we go all right uh so one thing we find is that white families have much larger retirement savings uh than black or hispanic families and it had and that is that gap has been growing over time been going up a lot over time even adjusted for inflation so we'll learn from um some things from this about about uh racial income inequality uh ratio this would be a case of a wealth inequality and uh let's see what we got next we've got um i can i can uh take a look at it we've got so that's retirement now i've got home ownership let's look at tt homeowner again three columns here's something i'm learning we've a lot of these have basically three variables year race and something else okay yeah here's what i'm going to try doing i'm going to do a shortcut i just i'm noticing that whenever you whenever you start writing the same code again and again you probably want to do some kind of um if i want to write a function so you're going to say uh plot line and something i want to do at this point i'm going to say data and a column and the idea is which column am i visualizing it'll always plot i'm going to call say by race that's what it's doing here and now do by negative column i uh notice i'm doing these um this tidy vowel embracing that means um that i get to pass this in as a bear name and it will go yup uh it will uh get this as a um uh i also want labels equals let's do a default of dollars most of these have been dollar uh and this will this i can pass as a better name and it passes that bear name along as if i was sticking the actual name in here this is a useful way to create your own party versus like functions i almost always want to start at zero so it's pretty good but i am going to give them different y-axis so i'm just going to take that out of the function now let's try refactoring the things the plots that i already made so i'd say plot by race and i'm plotting loan debt pct by race and um i need to keep and i'll say labels equals percent uh what was this what just happened oh i missed a um i meant to do a plus not a pipe on this yeah so here we go there it's the um i got this here we go pop erase slabs on this and notice now i can just change one thing i can change this to loan debt in dollars and put in my axis my y-axis here we go yep and then i can do uh and this is also dollars so all i need so it does all these steps the reordering and the you know i said i did the reordering but i don't see the let me check this let me try this again uh black white is red that order makes sense white black cement ah so it's doing the reordering but the problem is it's doing it based on the medium i'm actually going to do it based on the last value so the s t reorder takes a um a function that it's using to do the aggregation and it's easiest to read it if you say i could go from the last value of this yeah so down to plot by race um oh and i'll need um i'll need a retirement and look at me always forgetting to do this and now we have this as dollars yeah other things i can do is like there's always year on the x-axis and fill it as i color is race i can pull those out into the function even though i'm not going to pull out the y-axis so yeah now we have another graph why am i doing it this way because now i can say plot by race home owner pc key labels equals percent and i've got and now i'll copy this as homeownership percentage now it doesn't and we've got a homeownership percentage over time that's a case in which there hasn't been a ton of change over time but there is definitely a racial gap so that that's one thing we can uh we can see so we got this graph this graph yeah all right so we made it here is one bar plot and then four line plots these have a very similar shape which is why i was able to pull them out into a function and now i'm going to see we've got race wealth this i think is a little bit different yep we got mean and median both still similar we got yeah we got average and median you know it's nice to have both average and median uh you know especially for data like wealth that is uh tends to be would call um log normally distributed like a really a really long tail uh so we'll probably in this case i would have to adjust this i'm going to create a new graph you shouldn't overly focus on um or hmm through the second yeah here's what i'm going to do i'm going to make this a little bit more flexible plot by race right now take it can take an extra argument of a dot dot an ellipsis which then gets passed on to the aes that means like extra aesthetics so now if i say plot by race i know that i want y to be wealth family but then i also want line type to type so now i've got the mean and the average and the median it's something that's interesting i could also do add facet wrap by type i actually think i might like this more because you don't compare the average medium directly necessarily and there now we have oh i see now it's non-white up to this year and then it splits off yeah i think i like this variation and uh let me remind myself what the documentation this was was family wealth 2016 um yeah what i'll do is family rep so i'll say family wealth 2016 dollars you have the truth we have a choice here do we um free the scale there's i'm i'm a little bit um i'm not 100 sure i can free these up uh and then they're like uh then it makes it look like they're the same when they're on different scales i actually think this might be the right approach because it kind of lets us focus on the shapes are similar and then as a secondary thing we notice this difference in um uh this difference that the the average can feel it higher than the median why is the habit higher average to the the median um the um uh uh because it's it's long tailed it has this this uh we call log normal distribution um and you end up with like you throw in extremely rich people into here it's going to push the average up a lot without pushing the median up and up all right i just noticed that i had a couple questions that i hadn't seen so one was can you add the late so let's go through a couple of questions on these first graphs can we add the can you add the labels to the ends of the line i'm not uh oh like put the labels here um yes it just takes a second and let me think should i do yeah yeah i'd be excited to let's try this with one of them with one of these uh graphs uh so imagine i want to put this in um uh in here how could i put like race uh put it into each of these what i'd have to do is i have to say take my data uh uh and arrange by year and a few ways i i i could do that i need to grab the last actually i'll do group by race uh summarize uh last um uh i'll do top and descending year one just grab the highest from each year oops i promise i don't have um i should call really not name this data it's not a good habit the name the table is not any better so what am i doing i'll say tt retirement okay so the question is how can i add like labels here here and here so what i'm going to do is to take my oh i did top one descending i meant top one take my last year and now add a geom text a s label is um is race h just is i think zero to put it on uh i think that's left to just though i always say that i actually always forget so let's find out and let's do let's see why uh what i'll say is why is um uh actually i can keep the other ones and say data equals last year so now it pops up these three at the end of the lines now this is not i can tell this is not perfect yet uh a couple reasons up with it one is i probably actually uh first is that i do expand limits to um say let's say x is 20 25 or something like that uh or 2022 this seems fine uh the other is i might actually not want to do separate colors for each and i also can then i can skip this oops i cannot skip that i need color equals rays here but i can skip the legend put this on the end of the graph it is i can skip the legend because i've got this information and i am just here it is there's my plus i was missing that plus do we want to keep to keep the color i'm going to skip the color uh minakashi would points out that i could use the um the annotate uh i like just putting g and that that is helpful in terms like going off the side of screen i like putting here because it's very easy to figure out where the y um is the annotate is another strong um approach yeah uh a couple of questions we have could check the proportion of students by race i think said none of the data we have so far if that's in the upcoming data set we should definitely visualize that in that way say hispanic students are expert constitute x percent amount of student debt yeah uh other questions um what is th two people ask for the difference between the sim and the brackets uh the brackets is just the latest uh i find it easier to remember another way i could have done it would have been sim i think yes i think it would been like this let me double check that i've got an oops uh nope doesn't like that maybe i could have done uh what is it as symbol truth is i've gotten pretty accustomed to using the brackets and they sort of quote just work in a way that i really like uh like is it bang bang as dot simple see i'm not that so it's just another i'm not getting it i'm getting like as symbol of column it's telling me it's trying it's telling me you can't uh find this like it's trying to evaluate it so i don't know it's not it doesn't seem to be going through here i just really like this double braces and the recent guides to non-standard evaluation of the tidy boost that i've seen they do recommend using the um the double braces uh just really nice so now let's keep this approach and one thing i'm going to do is i don't like how it's just it's right there on it i'm gonna say h just um i'm gonna also add nudge dot nudge underscore x is point one point two i'm just moving it a little to the right yeah that looks pretty good okay so i add this in now i say here we go here's another one and here's another one yeah all the approaches we could have kept the legend i think this is this isn't neat as it happens to uh we can fasten it uh one problem with the fastening is we probably need to expand limits a bit more because uh that adjustment isn't gonna work when it's all squeezed together so that's one quick thought uh yep alternatively they could have put them on top of each other if i would have worked pretty well all right so the um so yeah this is what we tried in terms of creating a plot for this all right now um yeah let's look at the next uh the next set we've created four plots out of three data sets and then we did two from student debt nope i take the back we did one two three four five plots out of four data sets so what's the next one we might want to take a visual to take a look at yeah the um we've got uh i said what did i say i said um we did race wealth uh and yeah let's take one more look at that between average and median yeah and then i've got income time uh oh this is this is cool this is a different visualization first it doesn't involve race but secondly it's um [Music] here i think it's pretty neat is that i'd want to do uh income family color equals percentile is this it's three percentiles all right so i'm going to show two ways of doing this one we color by the percentile now we see this visualization i'm probably going to need to expand limits y equals zero uh and a couple other things i can throw in i throw in a scale y continuous labels equals dollar and a little bit of labeling year why is what is that average income no percent uh family income the idea is percentile all right so i can show these three i probably also want to reverse the order of the um of the percentiles uh which i can do the same way i did earlier doing a um take percentile is fcp reorder percent uh tile by negative uh inc income uh you know fine not perfect though this is not actually i would actually rather show this a different way i'd rather show it as a um as a a ribbon because this showing is a median a 90th and a 10th percentile this band of typical incomes so don't love that it's three separate lines here's how how i would visualize instead first i'd spread them you could also use pivot wider i happen to just find spread really easy to use i can use spread percentile by income family and now i've got the 10th 50th and 90th percentiles so i could visualize this again but say year and i'm going to plot 50th notice i need these back ticks because uh this is not a a legal r um variable name and then say why min is 10th ymax is 90th so i needed to reshape this data so that i could do a geom ribbon gm ribbon lets us do an xy but also a y min and a y y max a ribbon generally needs to be transparent to be readable so i also want to throw i neglected to throw in a geom line to represent the median maybe make this a little less here we go family income median median with 10th and 90th percentiles that's a better visualization see it's that now i don't need a legend at all i could just like show this um uh variation let me show year family income uh medium meaning with i could have had a title or something like that but i'm trying to fit this all in the y-axis um yeah so this is the visual edition show for that notice i did that spread then that plot with a y min and a y max is how i represent that kind of uh ribbon yeah next we got the um the that was uh what was that income time now that income limits what's income limits uh this is by race and quintile so could you uh go back to a to a ribbon uh on this so this for a minute for each fifth this is really cool yeah i've got some thoughts on how i might um my visualizers i might start by spread income quintile by income dollars so that i can do uh not necessarily but now i can do that lowest lowest uh [Music] fourth uh second etcetera and then show the top fifth what kind of ribbon do i want to have here what kind of what do i want to have here um one uh i could start just with the first second third i'm trying to look at like the income quintile third quintile is that that's like the 60th so this is like i'm confused about i'm actually trying to like note what is the lowest quintality that 20th percentile 40th 60th 80th 95th all right i think that's what that's what this happened here's 20th 40th uh yep income limits okay it's like the top of that uh that limit all right so that makes sense i've also got cur uh dollar type so oh current meaning like at the time i'm just gonna filter for dollar type equals 2019 dollars i don't want to um i don't want i really don't don't want inflation to have an impact on this okay so we're going to look at a distribution um oh just saw a question from uh matt could do a box plot over time not so so uh in general yes not for this data though because this data the um uh doesn't have it doesn't have a 75th or a 25th percentile it doesn't have outliers it only has three data points for every one of them if i did a box plot i could do a box and it would look very um it would be very misleading i'm gonna i'm so this is a great question so i want to make sure i answer this if i did a um a box plot here of income family i also need group equals year it's going to uh it's some ways similar but it makes it look as if there's these all these percentiles yeah not a silly question at all it's great it was a great question just happens not to fit with this summary data that's a great question yeah so the um all right so let's look at take a look at uh at oh yes we're doing it at 2019 all right i'm going to start with the geom ribbon similar to the last one we did um and no i'm actually going to start without spreading it and let's do ggplot um can i do yeah i can actually do plot by race the same way we do before plot by race uh with um uh with uh actually and i can yeah here's what i could do i could say filter income quantile quintile equals lowest i can start with just one and then plot it by race with numbers we're plotting so this is what is the 95th quintile of um of of income uh something's up with this oh um let me try wait that was lowest oh yeah number is just wrong numbers are total people silly me woof you had a number of people okay income dollars what i want to visualize this makes more sense um all right so this is one we can you can say here we go i'll need to expand it a little wider all right we have a uh looks like we have do potentially duplicate data at a couple of these so if i filter for uh yeah i can see you've got you've got like two bits of data for some of these so if i say race black alone for example emma and income quintile is lowest do i have duplicate data i don't see it i don't see the duplicate data i see income dollars uh oh it wouldn't yeah they wouldn't pop up separately if they weren't so i think maybe i have two let me check on what count race sword equals true black alone or in combination black alone all right yes i probably want to remove the inca or in combination i'm not for me i'm not quite mean they also look really close to these um so presumably it's it's some small technical difference it's gonna do not string detect race or in combination that was close but uh i'm still getting why am i seeing a duplicate data point here uh let me see oh this is black alone i don't know i'm seeing like duplicate data here is one of them nope they both look like asian alone that's a little strange hmm i do not understand and also this jagged effect usually happens when um when there's uh when the yeah when there's duplicate data when there's like two rows for each year so if i say filter race equals asian alone nope looks like one for each year um yep it's still duplicating what am i doing wrong uh count race that doesn't make any sense there's only one plot by erase i did an fct reorder that must have somehow split it up uh that's a little bit odd it doesn't make any uh uh if i said i'm stuck in a uh limits i am stuck in a bug stuck in a bug in a rug race by a dollar from dollars uh yeah then so far it's like this but what am i doing uh two for 2018. oh uh oh this duplicate year i don't see that no i don't see i only see one for 2018. this is this is fun i'm i'm just i'm actually just a little like confused i understand how i take this and then i say g plot year by income dollars let's see how let's see and if i say color equals oh i just figured out what happened so wait so first of all why is this jagged that's actually a separate there are two different issues first is that i did top one like a like a jerk because the problem with top one is that it uh if there's a tie it'll um uh if this duplicate yeah so okay so there are duplicates because if there's a tie here then it must be uh duplicate and um and then how if i do count year oh yeah there are duplicates i don't know why what's going on here but anyways that's what's that's what's happening with here is that there are duplicates even within this data so i'm actually going to do is that's so frustrating i'm going to add in distinct race year and income quintile uh but keep all equals true keep all the other data points uh really to win the reason it got duplicated was that i was doing that trick with showing the text at the end which i had completely forgotten about all right so the um that was uh charming i'm sure to watch uh so there is duplicated data on um on uh asian alum all right so these are like the lowest quintile so ways that i can try yeah i think what i want to do is the lowest and the fourth that's 20th to 80th percentile i want to do a geome ribbon uh so this is like the so this is actually i can start with the fifth with the top five percent um i think that's one visualization i can make here as i say filter if i say say spread filter for income quintile is this this is top fifth um top five percent income uh income level so the story is um within across all races someone's in the top five percent of income if their annual income is in 2019. income level limit 2019 if their income is about 260 000 um to reach that point among blacks it would be uh it would be a little under two hundred thousand tweets following asians would be more like three hundred fifty thousand so some things we can learn from this um visualize this visualization i can also ask in terms of a ribbon and if i wanted a ribbon here's the fun thing i'm going to copy that earlier ribbon that i did and i'm going to say i'm going to add a few stuff i'm going to say fill is race i don't actually have the medians here so i'm going to drop the gm line and i don't need a y axis at all i want wanna lowest versus um fourth i think that it was called i don't need the back picks anymore and moment and phil is re and this one's going to be called income limits it's not called percentile it's called income quintile and this is called inc income dollars here's me learning how to get all these uh yup yup oh um and my mistake here was i need to fill equals race to be inside the aes uh moment of debugging uh something's up with this and i neglected to do i neglected to do a few things and i'd like to do this filter and this distinct that kind of data cleaner that i need to do on on income limits so now we can take a look at the um at the ribbons and i need to say let's see i don't have an i don't have a y-axis anymore i need to say y is 20th to 80th income this uh quantiles all right so the um so now we can look at is okay the like which is asian alone the ink the entire income distribution shifted up a little bit uh looks like it might be lowest for black alone i probably want to do that reordering that we've been doing where i'd say mutate um erases fct reorder race by let's say negative let's do it by the top for example we have to click at the top of the bottom could go halfway in between i don't have the median in the state in this data set otherwise they put a line for the medians but this lets us look at our ranges over time so i think that's pretty handy uh and yeah we see israel at the top then white not hispanic white alone uh all races hispanic black alone would be the the shifts um yeah some things we see here yeah mostly mostly it's kind of similar data to what we get from any of you from some of the other race comparisons but yeah these are some these are some quantiles this is me showing it as a ribbon um we could have shown it you know yeah we could have shown others we could have done it as um let's see uh yeah i'm gonna show one of the way we can do it without doing the spread we could have done a um year income dollars color equals income quintile i would definitely have needed to do income quintile is fc quintile is fct reorder income quintile income dollars i'm not they're not going to overlap so i'm not worried about a distribution and the um and your gm line and facet rap by race uh so here would be that kind of thing that i said we decided not to really when i landed on a ribbon uh scale y continuous labels equals dollar so say dollar and income quantile quintile and this one's not so not so radically i always forget to do the negative here notice the legend is in the opposite direction i want to fix that all right so these are two ways of looking at our income quintiles this has more information but uh like it it shows like lowest second etc also now that i'm looking at it like it's not that hard to compare across these especially like it's not that hard to compare across them so this might be the way that i would end up landing on them on a visualization like lowest second third i'd probably rename these 20th uh 40th etc but um but i'll go with their limit their names uh the last thing i note here is that in a visualization like this because it has so many individual numbers i probably would make it an interactive plot i haven't been doing uh and i can do that really quickly with gigi plotly why interactive because somebody might want to know where we're going oh yeah zoom so you might want to know what is the distribution for um of like the top fifth uh of like the um top five percentile in this particular year and it just is really nice to be able to zoom in on it get a particular number when i just see this many numbers on on a page i kind of usually like to do it industry plotly and see how fast that was just did library ggplotly uh wrapped up this in a function that was kind of neat last question is can you okay faster wrap on quintile instead of race sure the um [Music] the next question morgan the the uh if i fasted wrap by by income quintile it'll should be ordered uh the first thing is i'll need to flip the ordering because if i want to go one two three from lowest to highest and then color based on race and do race uh do this one and uh oops i'm just thinking about what is going to make this plot work out of the out of the uh off the bat uh and i this always means that i put a pipe somewhere that needed a plus or something like that uh in this case an extra parenthesis yeah so i can facet based on quintile i might at this point want to make them um the labels be uh free i don't want to put the text in the in the graphs because that would be a lot of repetitive text uh but yeah i could probably do i could do this then put um the uh the labels inside and this this this is a question of do you most want to like examine the distribution within one race at a time or do you want to compare across the races and choose one at a time which quintile do you want to take a look at those are some of the ways you could approach this problem all right so though that's two that's how many graphs on on income uh quantiles uh quintiles uh four total we have we could just look at one and you know what i'm going to skip this i can just delete that one because it's included within some of the others here's the 20th i could look at just the 20th to 80th i could look at the income distribution for each of these or i could look at the um at uh the races with uh and how it has this quantile or quintile changed gonna comment this out but it's an idea cause that's what we did all right so uh i said i would make a graph on every one of these these plots and uh we we should be able to do that because i'm going to go to the tt again we looked at income limits now let's look at income aggregate so this is a share okay that this i think is cool because i'm going to guess that if i group by i don't know we have total population which i haven't looked at yet uh if we can we can group by year and now see income share nope i'm i think this is going to be i'm i'm going to interpret this wrong does it add up to yeah we'll actually let us know it's not going to add up to 100 or maybe it is oh i bet it is if we skip the top 5 let's find this out if i do sum of income share nope it adds up to well over a hundred percent oh i forgot to do by year in race actually yeah and what if i said filter not um uh because i bet top five percent is overlaps with highest is what i bet you uh so if i could say like income quintile is not equal to top five percent let me give these like the five here it is i was adding up to 100 are pretty close and the um uh just rounding arrow will cause not to be exactly so i believe in percentages on an out of 100 scale um i believe that strongly so i'm going to do it uh it also makes it easier to visualize and the cool thing i can do is oh yeah this is a really cool graph i'm gonna do a filter uh there's two ways i can do this i can say we already did not five percent i'm going to all i'm going to do ggplot um year income uh income share and uh and fill equals race and here's the key gm area your area is great for this fasted wrap by race uh this will this um why is your area because it oh because it's a complete a nightmare this didn't work at all uh this thing that i just did oh i did phil equals race i meant to do phil equals coin income quintile that's the one all right what this is showing is a um the term for it i think it's something like a spinogram or like an area plot what this is showing is how has the income distribution changed over time and i also realized i need to reorder this income quintile uh so the um and the challenge i have here is that lowest second third fourth highest uh how can i make sure they're in the right order well i can use a uh the function notice that they're in order as it happens i can use fct in order of income share oh i did income share i meant to do income quintile uh ft in order says whichever one appears first boom now we can say lowest second third fourth highest so here's the the key here is that and now there are labs y is uh is percentage of share of let's have income not of wealth gotcha so the um and uh scale y continuous labels equals percent i'm going to drop the in combination ones they're confusing they're not 100 sure what they mean also that gives us six which is really um helpful yeah so the story here is we can say and let me do one more one more thing is uh fill this income yes what this does is it let's just show like um the the income breakdown the income breakdown over time and i uh share over time is really intuitive as i am as an area plot uh you can see it's like um okay the the uh we see in basically every race the share of income uh by the top 20 percent has been increasing and in most of the races it's been around 50 recently the top 20 percent earn 50 of total income the lowest quintile the bottom 50 percent it's fairly steady though it thinks it though it looks like yeah it might even be shrinking um i wonder which of these is shrinking the like where is that coming from you can kind of see the third quintile is shrinking where they click i've heard there's a shrinking middle class and in some ways we can kind of uh examine this and um yeah this is this is looking at that income percentage share of income held by each quintile there's a pretty cool graph so i'm going to say let me do uh um i don't need to say year so let's say uh income distribution over time this is pretty cool again this is as a as an area although as i ordered other tricks we had or the fct in order just because i happen to see that the first five all put lowest second third fourth highest it's good because i didn't have um i guess i could have reordered by total and it turned out it would have gone the same but the um yeah so that uh that's uh where we set we set this up and uh yeah geometry if you haven't seen it before really helpful now now what other ways could i have uh examined this one is i could have asked by the top five percent so i'd say income quintile it equals the top five percent i couldn't do that in this data because they wouldn't have added up to a hundred uh and i can just ask i can also do this um here we go this filter and now i can just say uh i can actually go right back to plot by race uh income share uh and the answer here is like it's pretty similar across all five uh the oops and i need to say labels equals percent uh labels equals percent and i'll need to do mutate income share is income share over a hundred uh in this case i probably don't and now they're so crowded this doesn't look amazing i probably want it as a second legend but truthfully it's like to me it doesn't look like that informative plot there's a big shift in the five percent i wonder if there's um uh if that had i wonder what that had to do with um [Music] i'm not sure like welfare reform or something like that in the 90s better see like a shift in terms of income share shifting towards the top five percent um and um but otherwise doesn't look like this doesn't look like there's notable differences here um i mean actually mentioned that we could use geolabel repel here i could but i'm nervous that it's gonna uh end up like we can do it but they couldn't still end up off the the screen the stories we could do like gm label here do do um where's gm label we could use gg repel yeah sure let's do it gym text repel uh that has has to drop the h just see that's actually my big issue here is that i'll show you um what would it look like on say this graph see the trouble is it doesn't uh know to go on the other side of it um or then or the nudge yeah you just felt just a little bit harder to control in that sense yeah i'm gonna skip let's give it this case but it's a good suggestion so um and i'll rerun that graph i think so you guys said notice that even in cases where the data is a little bit more complicated we can still use plot by raise for some of these visualizations and say income share income here we go all right and let me see that was income aggregate yeah i have two more i said i would make a graph of every single one and we are running we have like uh eight minutes left so let's look at income distribution and oh man we got a median we got a rat we got income mean margin of error aha yeah let me see this was mean margin of error mean margin of error income brackets aha so we can do this by break this is broken down by bracket uh yes this okay a lot of ways we can actually um we can actually visualize this one is income distribution by bracket oh i kind of like that it doesn't say that it hopefully it's not corrected for inflation uh so um is within a race yes we can do something really similar to what we did over here and things are going to change you're gonna say why did i group what are they doing here uh it's really similar except that i'm gonna say inc i'm still gonna say income distribution is now what's called instead of income share i'm not sure why and income bracket is this one here we go and now i can say income distribution income bracket no more filter here i'm trying not to create like a similar area plot but yeah this is pretty neat and pretty neat this status is not what percentage is is taken by this oh it's no longer called like a bracket this is a cool graph i think because the um this is showing like how many people in this like what's the breakdown within this um uh within this race in terms of people earning under 15 000 uh 200 000 and over uh and yeah this is basically you could zoom in a lot of these having said that it's like you really do want more um i think you want more like uh detail than this so so one one thing i'll note though that is really like there's actually like i'm not gaining a ton from this i'd rather have some some specific idea like the change in the median over time which we already were looking at um but something that's neat about this kind of graph is it can also look at the um i can also look at the uh the totals with very little change to this code so like here i was doing it based on the distribution if i just skip the distribution and i work based on the um was it the uh the the number of people i've got uh where is that number it's called yeah if i've got if i use number as the y-axis and comma format instead of a percent format now i've got stacked um stacked area plots where i can say here's that was actually a number of households i don't think this is a good plot uh it does kind of get across stated it draws your eye a little bit more to the larger ones i think if i were going to show a graph like this i would add one other step where i'd say um ink would say race is fcp reorder race uh oh no not um not reorder i do i do in freak we should put the um them in order of frequency that should have worked oh in freak weighted by wait does that work nope doesn't work oh well oh but i can do real fct reorder race by um uh by number and using some to aggregate them and still not working look at me look at look at me absolutely nothing uh that that didn't work one little bit uh oh it's probably because yeah hmm oh it's missing values missing dollars yeah that would probably do it yeah that's pretty good okay and i'm gonna fct rev anytime reorder doesn't work sometimes it's grouped tables sometimes it's yeah nifct rev reorders it reverses that order in and look at that yeah so this is another way i could just lift my pot by a number of households this numbered household is not numbered households no way because why is that 120 uh million what am i doing here why is numbers so high number of households yeah i don't really understand this madness we could free the scales but is this global i did i i assumed it was um it was united states but i haven't never said it is it global no it's in america i do not understand how we have 1.2 billion households across all races oh um households wasn't ah households wasn't broken down by quintile we can't do this graph at all oh no thank you for that thank you for pointing that out uh jamie all right cool all right so the um all right last visualization that we can do that was a mistake that i did is income mean also notice i could have done things with media and mean we've looked at it before being received by each fifth and top five percent that's pretty similar to so i'm going to do this by the quintile what's the income yeah we can do this a little bit similarly where we would ask a question like income mean i always drop this one and probably say yeah it's pretty similar this is uh this is actually really similar to what we were doing for limits earlier so i'm going to go back to that one and see what we did with that one right at income limits and we saw here is where your thresholds are this is very similar to that it's actually close to identical we just say take the take the exact same thing say income mean and instead of and actually maybe it's the same yep so the data share the same the difference here is like uh here i was looking at the um kind of quintile limit it's like what threshold puts you into the top five percent this one was saying what's the average in the top five percent notice therefore higher so we can this might be a good place to say scales free y and if i do a free y here i'm going to definitely want to expand limits y equals zero uh why did that because it lets you show like okay within every race there's kind of a similar pattern appearing um i think the does the gap in the average grow over time maybe is it higher for bigger income levels yes uh that i think probably through the nature of the distribution um but yeah the uh uh there's some ways we can look at it now uh all right um and yeah i noticed i put this up here with income with income limits all right so we made a lot of graphs we made grab space in every single one let's run through them really quickly some of them we basically there's basically one graph we can make here it's a bar plot taking a look at the um at lifetime earnings by race and gender uh then we realize there are a whole bunch of plots that fit such a similar shape that we kept them in one function plot by race we can look at student debt uh percentage of families with student debt over time or uh average family student loan data over time um showing similar conclusions we can look at average uh retirement savings uh we can see here is that there's inequality exists across all these metrics but it is different the inequality and the trend are different in each in each of them so it is giving us different uh sides of this picture we look at home ownership which is an area with plenty of inequality but the inequality hasn't really been changing over time uh we can also look uh we can also look at family wealth with a case where inequality has been changing over time and um uh this was a case where we looked at both average and median because they helpfully gave this to us then we took a look at um at income over time we showed this graph i said i wasn't crazy about this graph we showed these three uh lines but i prefer to have it as something like a ribbon because this is more evocative in fact i'm going to go ahead and drop this one yeah why not uh because the um later that if i look back i probably would just want to keep this one that shows the median uh and the shifting uh 90th percentile saying in comment earlier on is the tenth and median have been pretty steady it's really the 90th percentile that has been going up that's definitely a story in terms of increasing inequality the um uh next one is if i wanted to look by race at um at quantiles obviously the the next set i can look at the 20th to 80th percentile that was kind of a combination of of a spread like i did in the early one but then i had to do a um a geom ribbon uh and then we had uh looking at yeah by um then we took a look at the limits income quintile limit that was and limit yep we look at two different ways of looking at limit one was by race here are all the limits another was by by quintile here all the um here are the races depending on which way we wanted to slice this data they could both be informative we also use uh there's a very similar data set in terms of the average within each of those quantiles rather than the um rather than the limit that puts someone in that quantile personally i think the limit is a little more intuitive people might want to know what quantile am i in and get a threshold for that or what what quadratic does someone have to be have to reach to hit this quantile but uh yeah your mileage may vary then we then finally take a look at the um this inc income quantile like distribution and uh uh which was one that i got that i got stuck on and had that whole bug it was up here because i needed a distinct so i sipped right on by that in my uh in my uh summary but yeah that that bug took a while uh oh and then uh so then yeah we looked at income distribution over time and i pointed out that gm area is a great way to show this and then fcq in order is a great way to make sure these quintiles are in an interpretable order i also said we could just look at the top five percent i don't think that this is very informative business the top five percent uh the share of income remember the top five percent unlike the threshold 5 doesn't differ a lot between um uh races and we took a look at the um the income brackets over time uh and you know what i'd note here is i probably would look at something else i would look at something like the uh what perc what so like what percentage is earned by the the topper and how is that changing price i'd probably zoom in on one of these brackets uh uh and look at the share of income over time would probably be a different way that i would examine this but this actually this is definitely the uh the this is the table that i spent the least time um exploring and has a lot of information in income distribution uh like the median and the margin of error for each of these not to mention there i never really looked at the number all that much all right so the um [Music] yes so so uh what we learned from the big picture today one is you can make a lot of graphs you can make them fast if you have a little bit of practice and also that when you make a lot of graphs that are similar it sometimes helps you put them into functions like this which you notice i kept going back to whenever i could to see if i could make a visualization while doing relatively little boilerplate if i went back and refactored my code i could probably find a few more places that i could reduce that uh that boilerplate uh uh code and um yeah i'd be excited to see later maybe those nine uh uh inc those nine charts that that were described really wonder what um uh yeah some of these are similar see i did a ribbon while while um this plot did three uh points uh but yes some of them are pretty similar um and uh the bar plot is organized a little bit diff i i organize it differently in terms of fill and and axis but yeah it's um uh then they do some i think really insightful analysis i'm sure while i was being very shallow with my interpretation of these graphs all right so that was uh at least nine graphs i think i could do more on on this data set um i uh yep i hope you had fun i certainly did if you enjoyed it please be sure to subscribe to like the video and subscribe to the feed and i'll see you next week
Info
Channel: David Robinson
Views: 3,143
Rating: 5 out of 5
Keywords:
Id: WxKSauhOY4g
Channel Id: undefined
Length: 65min 30sec (3930 seconds)
Published: Tue Feb 09 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.