Dimensionality reduction of United Nations voting patterns

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is julia silgi and i'm a data scientist and software engineer at rstudio and today in this screencast we're going to use this week's tidy tuesday data set on voting in the un and we're going to use tidy models for dimensionality reduction a lot of times when we talk about tidy models we talk about how to use it for supervised machine learning but we can use it for unsupervised machine learning too and so that's what we're going to focus on in this screencast i'm looking at how we can use the patterns of voting in the u.n to see which um which countries are similar to each other let's get started okay let's explore this data on voting in the united nations so i'm going to read in two of the three data sets that i'm going to use um that are available that i'm going to use here so this data set from this week's tidy tuesday un votes has um four columns here so it's this is for the um the roll call id so what vote it is this is the country country code and then how that country voted there and this issues um data frame has for each roll call id um what issue it is about so these are the two that we're going to work with here and in this screencast this tidy model screencast we're going to show how you can use tidy models approaches for unsupervised machine learning most um most of the screencasts that i have done have been focused on supervised machine learning where we take some you know some column and um and then you treat that as the outcome and then predict it using the other variables but instead here what we're going to do is we're going to treat this data as unlabeled data and then use an unsupervised machine learning model to in this case to do dimensionality reduction so this is going to have a lot in common with um a screencast i did a while back on um the cocktail data set from tidy tuesday last year so if you want you know you could compare and contrast and see what's the same and what is different and learn a bit about like get even more familiar with this so in this u.n votes in this un votes um data set i'm going to take the country the roll call id and the vote and then i'm going to do a little bit of some munging here i'm going to make that a factor with where the levels are no i forget what is it okay no abstain and yes abstain and yes and then i'm going to change vote to the underlying numeric value here so notice now instead of yes no yes it says 3 1 3. so we've converted it to a factor and then gotten that under level the underlying numeric value because what i want is um a whole bunch of roll call ids as the features and then um what the what the vote was as a as a numeric value value there and then um these i'm going to make the um the column names and they're numbers right now so i am going to make this a little bit better for a column name i'm going to say paste 0 rcid underscore rc id so now it looks like this so now these can be column names and i'm going to pivot wider i'm going to say the names come from rcid the values come from vote and it's going to be really wide so two there's 200 countries and so many votes look at all those votes that's a lot of votes notice we don't have a data for some votes oh and that's a tough call we don't have data for them so it's not a yes it's not a no we don't have a um we don't have a record for how that how that country voted on that so um this if i was not doing a screencast right now i might look into more like why is that data missing but i think what i'm gonna do is i'm gonna fill it with a two two like that so that is gonna so what is happening here is that i'm treating this like it's an abstention so i don't uh so when i do um when i do this um modeling if i had an n a it's going to be treated like an abstention um i'm only a little bit comfortable i'm only sort of comfortable with that decision but i think uh you know on the fly that's probably the best choice um it might be worth doing some more research to understand why some of these votes are um missing because it does i mean because like these are all here and then these are not um but it's not recorded as an abstention um i i think what we're just gonna move on so we're gonna say that's what i'm going to do for now and this is going to be my wide my wide you know so i could do like glimpse this if i wanted and it will tell me you know look all look at all that numeric data that i have created all right so it's ready so i'm going to do two kinds of um two kinds of unsupervised machine learning here the first one is our good friend principal component analysis and the second one is umap so we need a kind of um we need a kind of unsupervised machine learning machine learning method which is good for numeric data and so pc is that way and so what we're going to do is you know we actually could just look we could just use recipes because that's all we're going to use is um is uh functions from recipes so functions for unsupervised machine learning for um you know which you can use for data for feature engineering and data pre-processing to go in to a supervised model are in recipes but we can also just use them on their own for unsupervised machine learning so we do is we make a recipe well we don't have an outcome so we just make a formula that looks like that and then we'll say data equals un votes dot df and then it has it has um oh gosh that's not going to help let's just print it out here it has this country and that is not one of the um oh let me let me load it roll okay so we're gonna say country is not one of the um predictors not one of the variables that we're using in our dimensionality reduction so we're gonna call it new roll equals i could put anything in here i'm going to call it id like this and then we're going to normalize because pca needs variables that are centered and scaled so i will center all my predictors and then i will do step pca for all my predictors and step pca the default is that we're going to find five components we could change that and go up higher you know to like num comp six or eight there's quite a number of um of variables here so we could i don't know um i'm i'm gonna i'm gonna keep it the default but that's how you would change it if you wanted to go higher so let's call this pca recipe like this so what i have done so far is i have just defined the recipe like this um so it's what it's doing when it's telling me this is i have not yet gotten it ready i have i have just defined it i have not done any execution yet so if i want to prep it if i want to do something estimate something i need to prep it so this now that's what's happening it's going to all my data and it is prepping and so now if i look at it notice the differences here um it there's no differences in the variables in this case however now it says it has actually looked at the data and knows how many observations there are it knows there's no missing data it it computed the mean and the standard deviation to center and scale and it found principal components so notice it didn't it it has not done it yet up here so that's what prep does is it finds that from um from whatever training data you have put in there which is this okay so that is um that part's done so now we can start looking at the output um so we've prepped it and then the next thing i want to do is i want to bake it so this recipe we defined the recipe we prepped it and now we're going to bake it so if i bake it um often we will bake it with new data like testing data or training data but in this case we're not doing a supervised approach we're just dealing with this one data set so if i say new data equals null like this it will give it out to me for the same training data that i used to start with um so we've got our five principal components here which all um are you know this is the result out that we want for our 200 countries and um i can for example now plot this so i can put pc1 pc2 um i think i'll put some labels on there with the country i am going to um make some points let's make them a nice color let's make them sort of transparent and big and then we're going to put the text on there so i'm going to do it 200 is kind of too many to put all the points so instead of doing sometimes i'll use text repel gm text repel from the gigi repel package i use that a lot but with 200 that's really too many so i'm going to say check overlap equals true so it will just not print some when they overlap and then if i do like h h just equals inward like this that will work i think that will work all right so that worked so let's look at this a little more a little bigger so this is a principle component analysis dimensionality reduction of the um of the un voting records in this data set for these countries so we can see the united states so pc1 that's the direction that accounts for the most variation in these countries so we can see united states is all the way over by itself it is the most different from these guys over here so egypt ethiopia nepal um you know mexico india cuba you know very far over here so these are very different from the united states and then up and down is pc2s and so that um accounts for the second most amount of variation in the components at least of these five that we trained here um and so we can see you know here's a um australia canada netherlands united kingdom kind of cluster down here um here's a i'd be interested to know what else is in that cluster actually okay so we've got all these interesting clusters of um of uh of countries that are spread out in this projection of the five-dimensional space that we created um pretty interesting if you ask me i like it it looks good and now i'm gonna just show before now let's just do umap so we take this whole thing we're just gonna paste it um so umap is in a different it's in an add-on package um you know what let's take a step back because i've shown how to use step pscp step pca before but there are a lot of other steps i'm i'm just so you know you can do um ica um you can do um kernel basis kernel pca extraction you can do these um these these things that which you actually um uh you know you can say what kind of kernel you want and all these kind of things so we have many different kinds of um uh options for this kind of dimensionality reduction depending on what is appropriate to your um use case here so now let's go down here to embed and and move on here so we are going to change this to umap recipe umap prep and it's step umap and the um the default for umap is um two components so umap is based on um geometry like like um uh fancy manifold geometry um projections and whatnot and so um we could you know you could go uh you could increase this if you wanted um but it doesn't i don't think it really we want to match it per se because it's a very very different um a very very different um uh algorithm than pca which becomes very obvious once we look at um the visualization all right so we just have two two um components here we could of course make that more by upping the projections that we have and like notice how different is you can really so all we have here is two two components so we're seeing the whole space here into two dimensions it's unlike principal component analysis and that um in that situation where we had um high high dimensional like five dimensions that we were projecting down so this is just two dimensions and we can see here let's that you so united states is over here with these um and uh we have you know this far away cluster of things like el salvador dominican republic haiti down here canada sweden france greece so we can see what the united states is similar to and what these other clusters are like so you can notice how dissimilar like what a different thing umap is doing compared to pca it's just a very different algorithm um so so that's these two kinds of dimensionality reduction that are very different different approaches while i'm here you know what let's go back up because we can bake but we can also tidy so if i'm here here let's let me do this say tidy pca prep and it is the second so if i do this what what i'm doing is i'm getting out so the this pca is the second step and so when i tidy it i get out the components if i p if i did the first one i'm getting the means and the standard and then the standard deviations are down there too but i want the um the principal components and let's say i just want to look at um the first four so i'm going to say filter component in um pc like that so now i've just got the first four components and then let me left join with um these issues oh so where am i here i am oh it's called something else it's called terms okay so issues mutate terms oh and i also need to paste okay because i remember how i um i pasted rcid underscore rcid like that because now now they match up here oh notice we've got a lot of nas a lot of n a's so these are these are votes for which we do not have an issue that's interesting so now i have a choice what i wanted to do is i wanted to tidy this and then say what issues were accounted for the most variation in countries so for example um um what issues were more impo were most important to pc1 pc2 and whatnot but i have a lot of nas so i could add an issue that says um you know other or i could filter them out what i think i'm going to do here this is a real judgment call i think i'm going to filter them out the other option would be to replace all those n a's with um uh other and then say how many of those are how important is other compared to other things but i i think i'm just going to focus on the issues that we have records for and then let's let's say let's group by component and let's take let's do slice max let's take let's order by absolute value of value and take the top n and then ungroup so what we have now here is for pc one whoa it's acting crazy for pc1 here are the top um issues or the issues associated with the top um terms which were remember votes uh roll call votes um and are any so that's eight so that's right okay great so let's do that and let's call that pca comps and then let's make a visualization of that um how we can do this let's do let's put value terms fill equals issue and let's do a geom call and a facet wrap with components and we'll need to they're all um different votes on the y-axis so let's do like that all right okay okay this is pretty interesting one two three okay so some of them some of the votes have more than one issue and we left join so now there are two rows so let's say position equals dodge and let's this is super squashed so let's do position equals dodge so these are next to each other let's change the labels a little bit um y equals null fill equals null let's try that all right this is getting somewhere let's so some are positive and some are negative which you know of course happens with principal component analysis let's just uh value equals um well i think first of all let's notice that these are in the opposite direction to these that doesn't i don't know i don't get a whole lot of info from that but but okay and then let's make sure to notice on the x that this is um absolute value of contribution like that zoom all right all right so i could clean this up a little more and reorder this and all but i think we pretty much get the idea here so principal component one focuses on economic development and human rights so with you know there's one thing in there that's arms control but it is mostly about economic development and human rights so that is what accounts for the biggest variation in countries is where they fall on the spectrum of their votes on economic development and human rights principle component 2 is mostly about colonialism with a little bit of um well it's mostly about colonialism right and so that is that says that countries how countries vote on the issues of colonialism and a little bit of human rights says uh accounts for the second most variation and so remember that when we did principle common analysis we did not know like these labels were not even on those votes so it's very interesting that the the way that the countries voted aligned so closely so that we were able to see this the principle component three is economic development and colonialism so we see these different these different combinations of voting issues and how the different countries in the united nations are voting together and then principle component four is arms control and human rights together so so this is pretty interesting and i like a pretty interesting way to combine the voting data with the issue data and i'm glad to be able to see this result with the unsupervised machine learning approach that we were able to take all right so we used two different kinds of dimensionality reduction we use principal component analysis and we use umap to show how we can take that all those variables about all the different all those different um roll call votes in the united nations to be able to um do dimensionality reduction and then be able to see um in the in these in these new spaces which countries are similar to each other so remember we looked at that there are many other different um algorithms that are available to do this kind of dimensionality reduction and tidy models and we can do this both for you know exploratory data analysis purposes or as input to a supervised machine learning model so there's lots of options out there to use these kinds of approaches so i hope this was helpful and i will see you next time
Info
Channel: Julia Silge
Views: 2,995
Rating: 5 out of 5
Keywords:
Id: ZByO3D7faPs
Channel Id: undefined
Length: 25min 13sec (1513 seconds)
Published: Wed Mar 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.