Explore demographic employment data with k-means

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silly and i'm a data scientist and software engineer at our studio and today in this screencast we're going to use this week's tidy tuesday data set on employment from the bureau of labor statistics and how how employment varies with demographic info and we're going to use k-means one of the best-known and most widely used unsupervised machine learning algorithms to understand um which kinds of um um of employment um like industries and occupations are are most um like each other we're going to tidy k-means output we're going to use the the sum of squares within with within clusters to understand how we should use k and then explore the output i'm using an interactive visualization all right let's get started with this with this tidy tuesday data set this week that is on demographics and um employment so this is a good data set for or this is a good screencast for folks who are getting started with k-means clustering so let's let's take a look at this employment data it's from the bureau of labor statistics and it we have information on industry and major occupation and then if we go over here minor occupation so things like sales protective services [Music] farming and fish fisheries office working in an office and so forth and so um there are the the minor occupations are a subset of the major occupations and people can work in these different occupations in the different industries oh it looks like there's a little bit of some um uh small problems with how the data was originally scraped like some of the employment numbers are n a so for starters we can um whoops that's not what i meant to do for starters we can filter those out so we'll say employed and for starters we will only keep things that have a um have a non-n a value for the number of people employed there and then what we're going to do is i'm i'm interested in looking for demographic differences in who works in these different kinds of occupations and i'm specifically interested in the combination of industry and occupation this minor occupation so what i can do is i can create a new i'm going to say group by occupation uh cue patient and i am going to say paste and i'm going to paste together industry and minor occupation which makes a new column that it here which is um just those two things pasted together the industry and the minor occupation i'm also going to group by this race and gender variable and then i'm going to summarize to say what is the mean number of people employed in that those categories so this data set has information over a number of years but i'm not interested in looking for changes over time i just want to say okay for people who work in say you know healthcare or education or um or mining in these different kinds of roles what are the kinds of demographic differences that we see can we can we can we explore that using these um unsupervised unsupervised machine learning algorithm so then let's ungroup on group and let's call this um this data set was already pretty tidy but we just tidied it a little more so let's call this employed tidy like that so now we have it is now um has uh we cut it down now it only has a thousand four hundred rows or so so um so now we can start to get this uh reshaped a bit to get ready for k-means so we can for example we can pull out um the total in each of these categories by filtering um in this way to say total here um but let's uh so so let's actually do basically the opposite let's say let's filter in um uh let's filter in um a women filter for women let's filter for black or african american uh folks and let's filter for asian uh people who work in those cat who are those categories in and work in these occupations um so let's now uh reshape this we're going to we have kind of a long skinny data set now we're going to pivot to a wider ones using pivot wider and the names to the new from the new columns come from the race and gender category and the values come values come from um n and we'll fill them with zero like that so this so notice we reshaped here and now it is um wider there and let's use the janitor package and clean up those names so now they're easier to handle and um and now let's left let you just use a left join and and attach this onto it so let's say left join and attach this on so that we can start dividing and find the proportions so we want to um we also can we don't need the um race gender um uh variable for that um for that uh total because we know it's the total and we can rename it to total from n here and so what we're doing is we're we're we're taking our our our pivot wider results and then joining on this total here over here so these are effectively the um the four variables that we are interested in using and saying hey um for all these different occupations like um you know agriculture and mining and whatnot how do we see how do we see how these are diff um similar in terms of along these four axes like how many people work in them what's the proportion of women what's the proportion of people who are black or asian um so let's let's uh we do have a little bit more of prep work to do though so for example we need to find the proportions we can do that we want to divide this number by this number this number by this number this number by this number and we want to do all the way down we can do that using mutate using the across argument so the first thing that you put into across is what variables do you want to apply your um function to and so you can just uh pass in a um you can pass in a character vector of variable names so we say asian black or african american and women and then the second thing you do is the function that you want to apply so in this case we just want to divide by the total so what we'll do is we'll i'm going to use the tilde for our our anonymous function use and so i say dot which means the the variable we have um so asian black or african american or women divide by total so what this will do is take so remember before we had these numbers and now we have proportions so this is here this is 20 percent of this category as women this one is you know about two percent of this category is asian um let's also let's think about how big is are these numbers um so look that's enormous so this is the total number in these categories and they range from liter literally zero to um gosh this is millions of people so so first off let's just um let's just put a hard filter on this because we don't need to be looking at anything i don't know under like ten thousand this is this is across you know all of america and then also let's take the log um let's look at um let's say total equals log of total because this you you know you can tell this is this is distributed not um this is going to be like a log normal distribution and so so let's look at that now whoops total equals uh filter oh i put this in the wrong name i'm in the wrong place it's not called total until right here okay great there we go all right so this is good so we're down to like 211 or so um uh categories that we're going to be comparing which i think is good for this um thing that we're looking at now the next thing we need to talk about we we did take the log here which is i think appropriate because these are proportions and this is these are like counts distributed across a really broad range but um k-means is um sensitive to how these numbers are distributed in terms of like how are they centered and scaled so what we want to do is we we want to apply we want to force these to all be scaled so um we want to we want to just force these numbers to be like like a good thing to do here is just to scale it like um center and scale it like may force it to be centered at zero and be um uh uh you know centered and scaled around one um or just you know do the default you know whatever the center equals true scale equals true let's just just do that so we can do that again here in this mutate again using a cross so we'll here we want to do it to everything that's numeric so if uh so we can say instead of giving a character vector we can say is numeric and then here uh let's do scale i think this does something weird uh yeah no i mean if we do it like this yeah see how it puts these one these that's annoying so if we let's do it like this as dot numeric and then if we put it like this yeah that that makes it back so we're forcing instead of being a matrix we're like making it be a numeric um okay so now these numbers are all ready for k-means um the last thing i think i'll do before we actually do k-means is clean up those um clean up those name the names of the occupations i'm going to use uh the snake case package to do that and say to snake case occupation so that they'll just all be more consistent like so so let's call this employment demo for demographic so this is this is employment by or demograph employment by demographics so if we for example so let's just let's just um explore this a little bit so let's arrange by women so these are the things that had the lowest representation by women so we see like mining construction transportation if we looked at the things with the highest education [Music] this is office workers private households so so you know this you know this probably aligns with what we would have expected you know if you live in the u.s like i do um and so what we're going to do is we're going to use k-means clustering to see which which occupations are most like each other in terms of the demographic representation among these categories that we have and this total which means uh remember is how many people work in them all together all right so we got all ready and now it's time to actually do it so the function for k-means is this k-means and so we put in um this the our data frame except we have to take out the column that has the uh occupation in it because we that doesn't go into the clustering and then we put an argument how many centers so um we don't know ahead of time what the right number of centers is um you can probably tell that that's the thing we're going to finish up by doing but so we say send let's just start with three and let's call this employment clust like so employment right that's not what i called it there we go okay and we can you know we can do something like summary on this object just try to see what's in there like that um there it's it's like a list of stuff there's various things in there but my i you know no shock here i like using um tidy verse and tidy models principles to deal with um things so i'm gonna load the broom package um and i we can do things like tidy the employment cluster object and what we see here these are the centers so these are in this like four-dimensional space that is asian black or african-american women in total where are the centers of the club the three clusters that we said to make and some other info there um so if we needed the centers we could do that we can also do other things like augment employment clust and we when we augment it we can give it um this data set remember this is the real data and if we augment it what that gives us is the data that we had plus which cluster does it belong to so which cluster does our data did our clustering algorithm decide it belongs to and so we can do something like um plot it now and we say color equals cluster and geom point and they might be on top of each other so let's do that like this and so this now is our our first attempt at clustering um okay so we can see that cluster one is like um bigger uh occupations that have more people working in them with more women um two is kind of in the middle and three is lower women like sm and especially uh just generally lower women it looks like it's cluster three stretches up to pretty high um pretty high totals but lower women so um you know you can tell there's some overlap but you know we could change now and we could put black or african american on the y-axis and see what that looks like oh that's pretty interesting that's pretty interesting there so the um that's that's separates more um into uh so down here is lower african-american up here is higher african-american and then this axis remember is do more or fewer people work in that occupation and industry or not great so that looks pretty interesting all right but the thing is we don't know that three clusters was the best number of clusters you can tell that this is not um you can tell already by looking at the data that this isn't um this isn't something where it's like that there are clean separation between the clusters this is this is like continuous um uh there's not there's not like um there's not like occupations that are cleanly separated from each other and so um you know like is clustering the right uh is clustering the right uh approach for this um maybe maybe not like maybe we maybe we're learning as we're looking at this maybe a better unsupervised algorithm for this would be not one that depends on having to have clusters maybe we want to be interested in something that has a more continuous approach to this maybe but we can we can approach this and at least go through how we would decide to find the best number of clusters so um let's try um k one through nine so this so we're gonna try um uh one cluster through nine clusters and see which one is the best and so we're going to do a whole bunch of things so we're going to first we're going to cluster them and so we're going to we're going to use map from per and we're going to say well what am i doing k clust equals map and we're going to map over k and we're going to say k means just like we did up here and uh instead of um uh like three it's gonna be dot x like which is gonna be one two three four five all the way to nine and then we are going to tidy it actually i don't think we need that but we i'll just so you know how to do this you can tidy it and so that means we map over the clusters that we just made and we tidy it we can glance it which is um a one row summary of how did the um how did the how did the clustering do it these are measures of um how how well does the cluster how well is the clustering doing in terms of fitting the data so we can glance and then we can augment which is the thing that we just used to fit this data so we can um augment so we do the same thing where we say map k class augment and the thing that we send in is the data that we have so we do all this so we on all of our data we say um so we've got remember this is we've probably seen this before where we have um instead of uh columns of data like um like strings or numeric value we have lists and in these lists we have little tibbles so little tibbles inside of other tables and now we can say uh i want to take these k-class i want to un-nest in this case i just want the glanced one and if i um unness glanced like so i have all these these values here that tell me how well did each of these do and what i want to plot is i want to plot k on the x-axis and um this number this this value here which gives us a va gives us an estimate of how well is the um is the cluster fitting the data so all right okay all right so um if we were in a situation where the data so we were here so one two three so the first one we did was here and notice it keeps going down which of course it's gonna like keep going down um if we were in a situation where we had a data set that really it would that really you know um uh had strong clustering we'd probably see some you know elbow and this doesn't have a strong elbow kind of situation it just kind of keeps going down um uh we could maybe keep going but i you know this this is the variables in this just kind of change smoothly we if i were to look at this and try to pick a best number you know i might pick like five maybe i might say five is best here um uh but uh that's the kind of thing we can see here like as this as this um is this within um a number drops telling us like how what's the measure uh within the clusters as it drops and drops um like uh what how much is it changing as it goes down so i'm gonna say five but five seems like a good number for us here so let's go back to um let's go back here and make our centers five and let's do this again and but instead of instead of making a static plot for our final plot let's make an interactive one so that we can look at what the um let's look at what the um the the actual things are in our final result so we are going to um we have color as cluster and let's make a name and call it occupation so that we can do a um look at look at a result and see what we've got here so let's let's make this a bit bigger okay so remember that the x-axis is size and the y-axis is representation by um black or african-american workers so these things at the top are things that have a high number of um high proportion of black or african american workers so these are things like protective services occupations a lot of them are protective services um transportation and then if we go down at the bottom we're seeing things like businesses and financial operations um business and financial yeah so we can we can scroll around and we're using this interactive um plot and see this and we can see it so it goes from from high to low and then it goes from big big um occupations with lots of people to small on on this way on this direction so let's look and let's put asian here so we can explore that briefly um so the shape here is quite different notice that we have um more more um of a shape here where the high people the things with high proportion of asian population are bigger this was different than um then for black workers so we have professional and business services wholesale trade so we can we can go around and then very low for um construction and whatnot and then let's finally let's put in um women for our final plot that we look at here okay so uh so again high women up here low women up here big um like occupations with lots of women over here low women over here so service service occupations not including protective like you know police and whatnot private households office administrators administrative support health services leisure and hospitality and then we go down here we've got insulation maintenance repair occupation what not okay and we can see we can do we can see also just to emphasize again we don't see clear separation and clusters which um we can we could use to understand um and evaluate um how appropriate is an is an algorithm like k-means to start with for this kind of problem all right we did it we use k-means to explore this demographic information on employment from the bureau of labor statistics and we um we use that that that that like total sum of squares information be able to choose k and we we were able to see like oh this these kinds of occupations um are more alike each other in terms of um like what proportion of women or people who are asian or people who are black um uh work in them so and i that being able to use you know interactive visualization in a situation like that is something that is such a nice step to be able to do so i hope this is helpful and i will see you next time
Info
Channel: Julia Silge
Views: 3,402
Rating: 5 out of 5
Keywords:
Id: opHDQzhO5Fw
Channel Id: undefined
Length: 27min 47sec (1667 seconds)
Published: Wed Feb 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.