Robust Principal Component Analysis (RPCA)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi everyone welcome back so I'm really excited in this lecture to tell you about one of my favorite algorithms kind of modern algorithms called robust principle component analysis so the regular principle component analysis is one of the absolute most important algorithms of the last hundred years you see this everywhere it's right up there with the fast Fourier transform is principal component analysis this is a cornerstone a mainstay of statistics and modeling big data so if you're interested in machine learning Big Data you've definitely seen principal components analysis before and you've probably used it ok and so what we're gonna talk about today is how can we use sparsity to improve the principal component analysis to make it more robust to outliers to corruption to bad data measurements and this is based on a recent paper by can des Lima and right from 2011 which really beautifully outlines this theory and shows how to use it ok and what I'm gonna talk about in this lecture just as kind of a hook to get you interested and to kind of get through the math is this basic problem here what if we had a picture of a person we knew it was a person but it had this kind of big funny mustache or maybe the person was wearing a hoodie or trying to disguise themselves and hide from some security camera can you with modern robust statistics with modern tools and kind of machine learning and an optimization can you figure out what was under that mask and kind of subtract off that mustache or that hood so can you decompose an image into the true underlying system the underlying measurement and kind of the mustache and fake glasses and hood ok and surprisingly and slightly alarmingly the answer is in many cases yes you absolutely can so you can no longer kind of get the cartoon fake nose and moustache and glasses and hide from the cops because based on on pretty recent advances in sparsity and and robust statistics you can actually start to figure out what's under under disguises very cool again a little bit alarming but but I'm gonna walk you through all of this and how this works okay and there's no magic happening this only works because I know that I'm looking for pictures of a person okay I couldn't so I'll walk you through all of why this works and how this works and some of the math under underlying this in this lecture okay again this is following chapter three from our new book by myself and Nathan Cutts data-driven science and engineering if you want to read about this download the PDF here all the code in MATLAB and Python are on databook UW comm and on github so you can kind of walk through these examples yourself okay and use this I think you should be using this in your daily life and in your research this anytime you would use principal components you should be using robust principal components okay so as a little motivation we've seen before kind of this robust regression idea so if I'm gonna do a least-squares regression maybe I have these blue data points and I want to fit a line to those data points but maybe I have a really bad outlier and one measurement that's way outside of the distribution maybe someone just messed up the instrument or kicked the you know kicked the thermometer or whatever you're measuring someone just messed up the measurement here and but everything else is okay and so we've seen before that even though there is this true kind of slope this true fit regression fit for the bulk of the data if you do something like a least squares regression this l2 minimising regression then this is gonna be strongly skewed by this outlier and it's actually gonna tilt the entire regression fit to get closer to that outlier because the sum of the squares of the error the square of this error is so much larger than the squares of all these other little errors okay and so what this tells you is that standard least squares regression is very very fragile or sensitive to outliers it tilts the whole regression fit and we're gonna see that that's exactly also true of principal components but what we've also seen is that if you use an l1 regression so instead of penalizing the sum of the squares of the errors UPenn the sum of the absolute values of the errors this l1 error norm you get a much much more robust and faithful estimate of the original distribution even given this outlier so it's much more robust and it has this outlier rejection property okay so that's exactly what we're gonna be doing here with this robust principal component analysis again you have some cloud of data with an outlier there is some true distribution that you're assuming can be well modeled by a Gaussian by some kind of tilted stretched out Gaussian which is the normal principal components analysis assumption but when you do standard PCA just kind of vanilla off-the-shelf PCA that whole distribution is gonna get again tilted down to be closer to this outlier it's gonna tilt and stretch and distort the distribution because of that one outlier so again I can't stress this enough regular principal components analysis it handles small amounts of like white noise but outliers and corruption that are well outside of your distribution that are kind of really bad measurements will completely skew your PCA and so PCA is also very fragile with respect to outliers and is very sensitive to them and so what we're gonna see is that this robust principal component analysis just like our robust l1 regression is going to kind of compensate for you knowing that there's outliers it's gonna use an l1 penalized regression and it's gonna fix this this principal component fit and so you're gonna get much closer to the true distribution and you're gonna have a lot more robustness to corrupt and outlier measurements okay this is just the cartoon I'm gonna walk you through the map okay so if we go back to kind of this original paper this RPC a work by Kendall's Lima and right we're still talking about this basic idea let's say I have a high dimensional measurement so I always think of pictures has really really high dimensional measurements maybe a million megapixel images a million tall vector in my world and so we have this high-dimensional measurement and we know that some of the pixels are corrupted so I this isn't animate this is not an example from their paper they have a little bit more serious examples this is a silly example from from our book where I add this goofy moustache and so you can see that something like 20% of the pixels are just completely wrong I've completely obliterated them and turned them into just black pixels so that's kind of what I mean by an outlier or by a corruption it's not just my face with some white noise on it it's just I've erased that information and it's outside of the distribution of what a face looks like ok and so the question is how can we split this X data into some component that is you know true and correct and a sparse matrix with all of these outliers and you can see that there's a ghost a faint picture of his face but this is pretty small in magnitude this matrix is mostly sparse and mostly it's almost all zeros all of this stuff is almost zero and then this mustache is a sparse collection of all of the corrupt outlier measurements from the original data so how do i split my data like this why can i split my data like this that's that's what I'm going to talk about and the basic idea is you couldn't do it if you didn't have a large library or a large kind of statistical knowledge of what a human face should look like ok so you couldn't do this if you didn't know what a human face looked like if you didn't have a training data set so in kind of the machine learning speak you need a large training data set to learn the statistical features of what's in a human face and then you can do this robust principal components ok and so the basic idea this is from the Yale faces be data set so you have 36 individual people and for each of those 36 people you have 64 images of that face with different lighting conditions it's actually a really cool experiment they built this geodesic dome with all these cameras and lights rather all of these lights and then a camera and in two seconds they would flash all of these lights and take 64 images to get all of these different lighting conditions so it's a really massive library of what human faces should look like under different lighting conditions and the idea is that if you have that data set even if one of your images has pretty bad corruption maybe even if I like black out 1/2 or 1/4 of these in of these faces or I put a big curly mustache on one of them because of these statistics of this data set you can essentially separate out the the data set into what's called a low rank component that's well described by principal components and a sparse component that has all of the outliers and all of the kind of weird corrupt bad measurements okay and there's a lot more depth here so I do encourage you to read their paper and read this in Chapter three of our book if you want to know more or check out our code I'm just giving you kind of the highlights right now okay so the basic idea is that you would take all of these phases 64 times 36 you would reshape each of them into a really tall skinny vector so now you have this massive massive matrix number of pixels by number of images and what we're gonna do is we're going to do this robust principal component analysis to split kind of that big big data matrix into a low rank matrix a matrix that is well described with principal components with low variability and this is what's called our eigenfaces and so if you take the SVD of your faces you get these eigen faces that low rank matrix would have your eigen faces and then plus a sparse matrix which has all of the stuff that's not in that distribution of what if they should look like like this curly weird mustache that I added okay and so then that problem is really actually very simple to write down we're trying to we're trying to decompose X into the sum of two matrices L plus s where L is low rank meaning it's well described by a few patterns it's well described by principal components and s is a sparse matrix with all of the corruption and outliers and bad measurements okay but you'll notice this is not a well posed problem at all there are infinitely many ways I can split a matrix X into a sum of two other matrices this is kind of like I have n nodes and two n unknowns so it's it's completely ill-conditioned and under determined and I there's infinitely many solutions to this problem but we're getting really good at that that's becoming a common theme nowadays in optimization and statistics and sparsity and machine learning in general is that you can solve these ill-posed inverse problems if you regularize if you add a penalty term that promotes a single solution that has the kind of physics or behavior that you want and in this case what we actually want to do that the properties we actually want on L&S is that we want our L matrix to be low rank so it's well described by principal components so we want to minimize the rank of L we want s to be sparse because we don't believe that that much of our image is corrupt so we want s our kind of corruption entries to be a sparse matrix and we want to minimize that rank plus L zero norm of s so that L plus s adds up to X so that this is true now this optimization problem here it's always good to write down what you want to solve but you can't solve this in any kind of a computationally efficient way because this is a highly highly non convex optimization problem both the rank and the zero norm are not convex and so you have no guarantee that you're gonna find a solution to this you know before the universe goes cold and so what the authors of this paper did was they introduced this convex relaxation this is what we normally do is that we we introduce proxies for these norms so these dorms are non convex and hard to work with so you introduce proxies of those norms that are easier to work with in this case the proxy for rank is the nuclear norm or the sum of singular values and the proxy for sparsity is the one norm or the kind of sum of the absolute values of the entries of X so this is a convex optimization problem you can solve this in a scalable way that you know works in in a relatively computationally efficient manner and what you get out we'll probably with high probability it will be promoting low rank l and sparse s okay so this is how you actually solve the problems by running this minimization there's all kinds of good details in the paper and in our chapter three about how you would actually compute this and how you can write your own code using kind of alternating Direction methods or Lagrange multiplier methods to satisfy this okay but that's what you actually do is you're trying to find you know a solution to this ill-posed inverse problem where L is low rank this contains all of the patterns in the data and s is a sparse vector or matrix of all of the outliers in your data can you do it you do it this way good so that's actually basically how it works and in the original kind of paper where you have this data set here they actually applied it to this data set and what they found was that they were able to kind of remove shadows from people's faces so it's a little hard to see but in some of these you have pretty large shadows and what they were able to do is from the limited information that was in the picture the low rank part they could subtract off the shadow and show you what the person's face would look like in full illumination and then everything else all of the shadows would go in this kind of sparse matrix s so that's a really cool cool demo of how this works and you can also use it for things like surveillance or you know if you have a license plate and you know you have lines going through it so some of its occluded you can kind of back those out possibly you could figure out what's under you know someone's mask all kinds of interesting and maybe questionable technology that can come from this okay but again it's based on this idea that you are solving your regularized optimization problem and that there is a convex relaxation so that you can do this on really big data problems okay good and in the last part of this I just want to point out that you can also apply this to fluid flows so fluid flows and physical systems in general engineering systems or systems from science even though they are also very high dimensional they often have low dimensional patterns so in fluid physics we call principal components analysis orthogonal decomposition P OD the same basic idea where you can essentially decompose a very complicated system like this fluid flow into the sum of only a few kind of basis flows or eigen flows so maybe you only need ten you know principal components to describe this fluid flow what Isabelle Sherrill has shown so Isabelle is a graduate student working with Brian Paolo G and me what she showed is that even if you have your fluid flow data that is massively corrupted with kind of salt and pepper white noise it's not white noise salt and pepper corruption that you can essentially use this robust principal components analysis and you can decompose that into a low ranked fluid which is kind of the fluid the cleaned version of the fluid and a sparse matrix of all the outliers and corruption and it's important to point out actually that regular principal components works quite well for white noise so if you have a little bit of white noise everywhere on your data regular principal components does a good job robust principal components is when you have this kind of big corruption and outliers it's a little hard to tell here because this is high res but all of these little salt and pepper errors this is not white noise added to my field we basically made these values manually like negative 10 or +10 so so they're really really bad outliers and the original data has been obliterated in that salt and pepper the salt and pepper corruption and so you can do this like then you can actually compute the principal components or the PID modes so the true modes this is P OD on the noisy data this is robust principal components analysis and what is it bell showed is that you actually get much much better kind of decomposition of your flow when you run this robust defying technique you can also use this to build models like these dynamic mode decomposition models where we know for these periodic flows all of the eigenvalues of the system should live on the unit circle in discrete-time but before doing this robust filtering step you get terrible eigenvalues kind of really really bad eigenvalues after our PCA almost all of your eigenvalues are where they should on the unit circle so really great work by Isabelle you can read about it in this archive paper she'll have a video on this soon and I'll put a link up when when that video is posted kind of walking through this on many many more examples last thing I want to point out I actually talked about this in Chapter one on the singular value decomposition when I talked about the Netflix Prize and correlations in general and so I thought I would revisit that example here in the context of robust principal components analysis so we had the Netflix prize and matrix completion and the basic idea here is that you have this massive imaginary not imaginary like square root of negative 1 yet this this massive matrix think about it as a thought experiment where for every movie in the database and for every single person in your subscriber list you had this matrix of preferences and here I've simplified it so it's just you know plus one if you like the movie negative one if you don't like the movie but notice and so people you know have different preferences and of course we believe that individual people kind of there are principal components of movie preferences and movies themselves so maybe you know the Terminator and Conan are more similar than the Terminator and the fantastic mr. Fox right there these live in a principal component together and and so do different people and so the idea here is the the vast majority of this Netflix matrix is is empty people haven't rated every single movie they've ever watched they haven't watched all the movies or else Netflix wouldn't be in business right so so the vast majority of these entries are unknown and what Netflix and and other recommender systems would love to do is infer what do you probably what would Alice probably think about Pulp Fiction or what would Bob probably think about the fantastic mr. Fox so they want to fill in all of these missing values and that's a very closely related problem to the RPC a problem okay so for example let's just do kind of a thought experiment here let's look at Dan and who they have a lot of overlap in in their data so they both like pulp fiction and they both like fantastic mr. Fox one of them doesn't like Conan the other one doesn't like Dazed and Confused and so what you can kind of this is just a guess you can kind of infer that based on their I'm having similar preferences some places they might have similar preferences other places similarly you can do this on a column like The Big Lebowski over here you know for example if if someone like Bob you know agrees on this movie I already missing Maya already losing my what I filled in yeah so if you know if Bob disagrees with pulp fiction maybe he's gonna disagree with The Big Lebowski okay and this is an example where I kind of did it by hand so this is a little silly but you can imagine that if this was a much much much much much much larger matrix and it's much sparser there are still all of these correlations you can still cross-reference everything it's almost like a really really high dimensional Sudoku puzzle okay where you are cross-referencing thing and trying to figure out based on someone liking this and someone liking this what this other person would like okay but with these kind of robust statistics techniques and these regularize solutions for ill-posed inverse problems and the fact that patterns exist in your data there are ways of automating the solution of this problem and so they're actually getting quite good at this in recommender systems they're really getting good at inferring what would be what the entries of a mostly sparse matrix would be assuming that it has low dimensional patterns okay so robust principal components analysis is super useful really important again you should definitely go check out the original paper check out Chapter three in our book where we talk a lot about and kind of sparsity in general and robustness and robust statistics and download the code and try this yourselves this is really powerful stuff there's not magic happening here it seems like magic because this inverse problem is ill-posed but with the magic of regularization with a dude dishes choice rather of regularization you can do these pretty pretty amazing things like robust signal decomposition okay thank you
Info
Channel: Steve Brunton
Views: 42,873
Rating: undefined out of 5
Keywords:
Id: yDpz0PqULXQ
Channel Id: undefined
Length: 22min 11sec (1331 seconds)
Published: Fri Jan 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.