Data Analysis 6: Principal Component Analysis (PCA) - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Principal component analysis is perhaps the most widely used data reduction technique on the planet Everyone uses it but here's the thing. It doesn't actually do data reduction Principal component analysis is the idea of trying to find a different view for our data in which we can separate it better And I'll show an example piece of paper And the idea is that what we want to try and do is reframe our data Maybe move it around so that we can better separate things out better cluster things perhaps it's better for machine learning Now as a side effect of this in PCA, we also order our Axes by the most to least useful in some sense. So then we can perform a separate data reduction Technique later by taking the slightly less useful axes away by or in this case Dimensions or attributes of our data PCA is commonly pitched as a data reduction technique. Actually. It's a data transformation technique It just makes our data and meaning able to production later So let's imagine we have some attributes and we we know that some are correlated some are not correlated The problem is that maybe we don't want to just delete some of the attributes may be 0.65 correlation is I mean It's it's good But it's not it doesn't mean we definitely want to delete attribute two and keep only attribute one on the other hand maybe We do need to reduce some of the number of dimensions we've got or maybe we just want to try and make our data More amenable to things like clustering. So let's look at a quick example Typically PCA is done over many dimensions when we've got lots and lots of attributes I'm just going to show two because obviously I starts to break down when I try and draw that many on the page So if we have two attributes And what we want to try and do is work out what the contribution of each of these is to our data set Which of these is useful which of these is not useful and now obviously if we had many dimensions, you know Seven hundred ten thousand we can still apply the same technique. So maybe we have some data that's like this We have some datasets over here and perhaps we have a little gap and maybe some data over here and in general Our data is kind of increasing like this. So this means that attribute one and attribute two are positively correlated to some extent but maybe the correlation is not so strong that we just wanted elite attribute to what we want to try and do is Transform our data into a way where these are more useful imagine that you've got some data but looks a bit like this But if we rotate our data We take a different view we can see there's actually two objects and then we can separate them out and maybe if you were to take them again you could see there was four objects and So on this is the idea what PCA is going to do is find new axes for this data That separate it better for PCA to work. What we would start by doing is standardizing our data So all of our dimensions attribute one attribute to all of the attributes are going to be centered around zero And they're going to have a standard deviation of one PCA will not work really at all. If you have widely different scales for your data So what we want to try and do is find a direction or an axis through these two attributes That separates out our data better than individual attributes Do let's see how this data looks just from attribute one If we trace down this way you can see that it's sort of got this amount of spread in actually Bute one and they kind of Dotted around like this and they sort of should go all the way along like this so you can't really see anything on here Meaningful about these two groups, right? And of course the more dimensions you have the more this could be a problem Similarly about tribute to if we trace along here it goes from this range to this range This is the variance of attribute - like the range and we can see that roughly speaking The data is as spread out in attribute one as it is in attribute to that spread is about the same and both of them are kind of useful for looking of a data but not really because again, We have an equal distribution of point all the way along here. So that's not hugely useful All right. So if we look at just attribute one, that's not hugely helpful If we look at just attribute two, that's not hugely helpful either. So what can we do? well what we want to try and do is find a new axis like some new attribute that fits through this data like this and can Really separate everything out because the spread of this data is actually diagonally in some sense not this way or this way So what principal component analysis is going to do is find this principal component miss axis through our data like this Such that when we look at the spread of a data, it's maximized, right? So the data is as spread out as we can find it And this is going to happen over any number of attributes So actually one here attribute to attribute three attribute for all the way to attribute n when we've got maybe 700 or 800 or $1000 so at the moment which is fitting one principal component, this is one line through our two-dimensional data There's going to be more principal components later, right? But what we want to do is we want to pick the direction through this data However, many attributes it has that has the most spread. So how do we measure this? There's really two goals which were exactly the same one is to maximize the variance So we find a direction for this line such that these points at the very edge are farthest apart The other one is that we minimize the error so we take this error from here this distance this distance from all these points to our new axes and we minimize it so You can imagine if we do this for all our points we can get the sum of the squared distances from these points to this line and then as we move this line around Sometimes if it's going to be better Sometimes it's not if we have a line that goes like this Some of these lines are going to be very large like this and that's going to be a higher amount of error So what we'll find is that if we do this our first principal component will sit through whichever direction in the data minimizes these distances and by definition Maximizes this spread which makes this axis super useful if we use this axis now as our new X and we rotate this whole page All our data is lovely and separated. And actually we have two distinct clusters in this data set, right? So that's what we're going to do Now as I mentioned PCA doesn't typically reduce the number of attributes from two to one just like that We're going to have another principal component which represents the second amount of most variance orthogonal e so at ninety degrees So that's going to be this one here We find the first principal component which maximizes variance and then we find the next one along that maximizes events in the next direction Now if there were multiple dimensions we'd keep applying this process We keep finding new axes for our data that systematically show more and more of a spread of our data But we're crucially we're ordering this by the amount of variance that they represent So this is PC one or principal component one. This is principal component two and Principal component one is always going to have the most Varied data in it principal component to the next most three the next most all the way to the end with the least so a natural Side-effect of this process is that we're going to have new axes through our data Which and we're going to have a same number of axes as there are original dimensions in our data But they're going to get less and less useful in terms of the variance of our data as we go forward So PC one is going to be the most important most of our data is spread out across Pc-1 pc2 a little bit less spread out PC three a little bit less still all the way down to PC n all the way Down here if you wanted to perform dimensionality reduction because you felt you had too many dimensions to your data you could just for example keep the first 10 principal components project your data into that space and Still retain most of the information we won't go into the mathematics of how to calculate these principal components because you can find out very easily online and all has a Lovely function to do it for us I wanted to focus on intuitively what PCA does but how we will actually project these points onto these new axes and rotate the whole thing is Each of these principal components is going to be a weighted sum of all the attributes So for example PC one is going to be some amount of attribute one Added to some amount of attribute two now in this case because it sort of goes off at sort of 45 degrees It's going to be about the same but you could imagine if your data was like this It'll be mostly attribute one and a little bit of attribute two if it was like this It'll be mostly attribute to a little bit attribute one All right. Now, of course the n-dimensional data or we have many more dimensions that I can't draw on the page The principle is exactly the same some amount of attribute one attribute to attribute three and so on all the way to the end Right and that's going to project our points straight onto this line through that data So when we talk about minimizing the error you can imagine Rotating this about the center of these points here like this And as you do this these red lines are going to change in length And it's going to settle on the very center line where these weights are minimized Right and as it happens that also maximizes the variance of these points here because of the fact that this mathematics is based around eigenvectors and eigenvalues Pc2 is always going to come out or foggin all or in this case at 90 degrees to pc one now This is true of however many dimensions. You've got every single new axis that appears or new vector A new principal component is going to come out or foggin all to the ones before Until you run out of dimensions and you can't do it anymore We've already reached the most we can fit in on this two-dimensional plane We've got one here and we've got another one orthogonal to it There is no other lines I can draw for that to be true Right, but obviously if we had more attributes, that would be the case so the reason that it's so important to scare your data appropriately is that you're trying to find the direction for your data that Maximizes the variance now, if one of your dimensions is much much bigger than the other of course That one is the one that's going to maximize the variance if you've got salary that's between naught and 10,000 and all your others are between naught and 1 Then your first principal component is going to be predominately salary because that's the most important thing as far as it know If as it knows this is why it's so important to standardize your data first We're going to continue to use our music data set for this video now for those of you Forgotten this data set is a set of music files that are freely available online Where we've got the metadata of a genres or titles for different tracks and then for those tracks We've also calculated some features about the actual audio for example Temporal features how loud they are? How fast the music is how upbeat it is whether you could dance to it this kind of thing Apparently dance ability is a measurable trait Apparently these teachers have been generated by two different libraries once called Lib rosa Which is freely available online and the other ways echo nests Which are the features that a core of Spotify and how it does its music recommender system and its playlists So let's load the data set So I'm going to read it. It takes quite a long time to load It'll probably be faster if it wasn't in a CSV, you've got to remember if your files are in CSV You've got to actually pass the more than workout, whether they're numerical or text, you know for every cell. Okay, so we've got 13,000 instances or rows in our data and we've got 751 attributes or dimensions to our data? So these are going to include features from both liberals ER and echo nest and the other metadata of these tracks So we're going to select just the echo nest features for this part Just be it's a little bit easier to have fewer dimensions to look at this would work just as well On all the other features as long as they're numeric So we're gonna select echo nest is equal to the music data frame all of the rows and just the echo nests columns Which are 528 to the end and then we're going to standardize all this data now So we're going to Center it around 0 a mean of 0 and a standard deviation of 1 using the scale function now take a minute to finish and then we can just check to make sure that our Variance and our mean or what we expected. So we're going to apply over dimension two so that's over all the columns the Variance function and find out what the variances are and you can see they're all one, which is exactly what we want So let's have a look at the mean the mean should be centered about 0 It won't be exactly 0 just because of you know, floating point errors and so on. So there we go. So 1.5 10 to the minus 17 very very small right close enough to 0 perfectly fine So the function we're going to use is the PPR comp function in R This is going to perform principal component analysis for those of you who are interested in learning more What it's going to do is create a covariance matrix, and then it's going to use singular value decomposition To find the eigenvectors and the eigenvalues and those are the things that actually we want from a PCA So we're gonna run that now it doesn't take too long. But this is still quite a live data set This will slow down quite a lot if you had a very very large data set, but it still might be worthwhile What it's done is it's found the directions through our data that maximize the variance and it's projected our data into that Space or transformed our data into that space at the moment. The dimensionality of our data is exactly the same and completely unchanged. No Dimensionality reduction has happened yet. So let's perform a quick summary There'll be a lot of the stuff on the screen, but I'll point towards what's important So what it's doing is it's showing us the list of all the compiled They're standard deviations service spread in that direction and also how much of the variance it accounts for You can imagine that Let's imagine a spread of your data and all the different dimensions is this much but in one direction, it's just this much What was the percentage of the spread or the variance that that principle component accounts for right? This is very easily quantified so you can see in here We've got the proportion of variance for PC one is naught point one one six nine which is about eleven point six percent so out of all the 224 echo nest features This weighted sum in principal component one or vistit direction through our data Represents eleven percent of us bread, which it's not too bad. Actually, I think that's pretty good Why principal component two that's another eight percent So the cumulative proportion of these two principal components is going to be twenty percent and at printed component three twenty-five percent and so on So what we're saying is by the time we get to principal component three if we represent our data Is this three dimensional space around these axes pc-1 pc2 PC three? We're getting 25 percent of a spread of the data That we had before but that's three dimensions instead of two hundred and twenty four dimensions. So that's not too bad Now one important thing to look out for is where our spread starts to get towards a hundred percent Where is it in our data set that we can say, you know? what these later dimensions these later principal components are not really adding anything to our Data set so we scroll down and we'll find here at 95 percent scroll down a bit further 98 percent 98 percent and Here we go. Principal component 133 the cumulative proportion of variance explained by all of those ones from 1 all the way to 133 is 99 percent if you're going to perform dimensionality reduction Stopping at 99 percent of the variance is very common What we're saying is we can delete any of our data from principal component 134 all the way to the end and we're still getting 99 percent of a spread or Information from our dataset if you want to use PCA for data reduction Then what you're going to have to do is decide what your cutoff is going to be now 99 percent is a good Number to use what does that actually mean? What it means is if we plotted the different principal components going this way and the amount of barrier. It's That they explain like the amount of a spread of a data that they're responsible for they're going to decrease like this I mean this is going to be a bar chart actually Right so like this so principal component one is always going to be the most variance explained because that's how the mathematics works These are ordered in that way principal component 2 is less three is less four is less and so on we're going to keep going down until 99% of the variance has been explained in some band and we can remove everything else. That's what we're going to do So 99% is one option ninety-five percent something like this Any number of principal components that you remove is going to delete some of your data equivalent to the moving dimensions But because they're ordered in this way from the most useful to the least useful It just makes that job a little bit easier instead of saying it was tempo or feature five that we didn't want actually We're saying in this axis principal component one and principal component to its Principal component one hundred and thirty-four that it's not that useful to us Let's have a look at one of our principal components and see what it is. So we're going to type PCA Followed rotation and we're going to select just the first one because otherwise it's going to be too much information so this is going to be how much of each of our 224 dimensions does pc1 need to create this weighted sum and project our data so you can see for example It's - naught point naught one of tempo or two - two - naught point naught two of tempo or two - three one thing to remember about these is these are now Arbitrary axes through some massively dimensional space very difficult to know exactly what this means, right? You can start to look into based on these weights, which of these features is more useful, but that's kind of a second second step you can use so for example tempo or feature naught is no point naught - so we're going to take Naught point naught to eight times tempo or not Whatever that value is times like this much of the next one times by this much of the next one I'm gonna add them all up and that is a projection of our data point into this new space So we can do this for our entire data set as it happens Our calculates is forced. But you could calculate this using a matrix multiplication if you wanted so these are all our points Transformed into this new space. So hopefully we can see them better Then we're going to start plotting Different genres of music in these principal components to see if if the separation is any better than it was before So let's have a quick look. So this is a scatter plot a principal component 1 vs principal component 2 and every single song in our data set and you can see it's a bit of a higgledy-piggledy mess and it would Be because there's some 13,000 songs here But you can see that maybe some of these songs are bigger over here and some of these over here Maybe let's just look at a few genres to sort of narrow it down and make our figure a little bit clearer So I'm going to select just a rock electronic and classical genres. I don't know they seem like they'd be slightly different So let's run that so we're going to take just those genres We're gonna plot them in the same scatter plot and see where they are in this space So how did this data get into this form? What happened was for every individual track We had a number of features in our 224 dimensional space each of these Principal components is a weighted sum. So for example for let's say track 516 we'd have taken tempo feature 1 multiplied it by part of principal component 1 the loadings Added that to the next bit to the next bit to the next bit and worked out where it sits in terms of principal component 1 this new axis, we'd have done the same for principal component 2 and that puts them down here Now there's quite a lot of overlap But you can start to see we're teasing apart the electronic music from the rock music the rock music's sitting over here on the right The electronic music sitting on the lower left and the classical music has up the top here Now these axes don't mean you know That musics faster or slower or more or less upbeat because without looking into the waiting's and below dings for these Principal components. It's impossible to say for sure but what we can say is that they're starting to come apart and there are some differences in our data set the fact that they Still overlap means that probably two dimensions is not enough to satisfactorily Separate out all these things if you wanted to pass this projected and transform data into a machine learning algorithm You'd probably need to pass in more than two dimensions And in this case given that 90% or 99% of the variance is explained after principal compare 133 those 133 dimensions are probably what you'd use You can actually use the entire output a PCA the same number of dimensions you have before to just show a better Rotated version of your data to a machine learning algorithm You don't have to remove any dimensions if you don't want to but because the dimensions are ordered In from most variants to least you can kind of get a good gauge for where you should cut off And remove data that way this kind of data reduction along with the ones we looked at before are going to form part of this data cleaning data transformation and data reduction approach that we're going to iterate through until our data is as small as we can get it and it's it's We can extract as much knowledge as possible in the easiest way once we're done with this our date will be ready for clustering for machine learning for Classification for aggression for anything else that we want to do? Today we're going to talk about clustering Do you ever find when you're on YouTube you'll watch a video on something and then suddenly you're being recommended a load of other videos That you hadn't even heard of that are actually kind of similar. This happens to me. I watch some video
Info
Channel: Computerphile
Views: 90,206
Rating: 4.9658604 out of 5
Keywords: computers, computerphile, computer, science, University of Nottingham, Computer Science, Data Analysis, Data, Dr Mike Pound, Dr Mercedes Torres Torres, PCA
Id: TJdH6rPA-TI
Channel Id: undefined
Length: 20min 8sec (1208 seconds)
Published: Tue Jul 09 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.