COS 302: Singular Value Decomposition

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we're going to talk about the incredibly important topic of singular value decomposition or svd svd is an incredibly important tool not just for general factorization but it's kind of the right way to solve a lot of different kinds of under-constrained and over-constrained linear systems our basic setup is that we have some matrix a that is not necessarily square that is in general m by n and the idea is that we can always write such a matrix as the product of three pieces we can always write it as sum u multiplied by some sigma by some v transpose where these pieces have different special properties in particular where u is an orthonormal m by m matrix so a square orthonormal m by m matrix v is an orthonormal n by n matrix and sigma is a matrix that is mostly zeros except along its main diagonal and it is m by n so just to remind you this means that u u transpose equals u transpose u equals the identity and v v transpose equals v transpose v equals the identity sigma is the one that's a little bit weird so it is m by n so it's m by n and it only has non-zero entries along this diagonal here in general we refer to those as the singular values everywhere else at zero so here i'm using sigma i i equals lowercase sigma i the singular values are always non-negative so they can be zero but otherwise they're positive conventionally we write them in decreasing order this decomposition is nearly unique so sigma is unique but it's possible that some of the sigmas would have identical values and in that case u and v will not necessarily be unique we refer to these guys as the singular values by the way and the number of non-zero singular values is the rank of the matrix a so one thing you can see is that we can wind up with a lot of sort of useless pieces of this svd this is what we would call the kind of full singular value decomposition but if for example m is greater than n then as we observed we'll have this chunk of zeros here and so in this case then we could instead think about the reduced svd the idea of the reduced svd is to have the sigma matrix be square and b square in particular in the smaller dimension of m and n so in this case we would wind up in a situation where u would now be m by n rather than square and instead sigma is square and n by n and v transpose is n by n and this eliminates these kind of extra zeros and the useless extra columns of u that would have been multiplying by those zeros the sort of hand wavy explanation for what's going on here in and the reason that this is interesting is that you can think of u and v as just being like rotations so they're orthonormal transformations they don't change the length of anything and what's happening is we're trying to find a transformation such that all of the scaling and shrinking that's going on is happening along the coordinate axes associated with the singular values so it's like we're finding a basis in which all of that scaling effect is happening along the dimensional coordinates one thing that's confusing about the singular value decomposition is that it looks a lot like eigen decomposition that is it produces a diagonal matrix but it's a thing that we're applying to rectangular matrices rather than the square matrices so that diagonalization is one thing but then of course we have this invariant type condition of the eigenvectors with eigenvalue decomposition what's going on with svd well let's talk about that kind of equivalent property for a second as before a is an m by n matrix and we're decomposing a into the product of an orthonormal matrix u a matrix sigma that is zero everywhere except along the main diagonal and the transpose of an orthonormal matrix v the columns of u we refer to as the left singular vectors and the columns of v we refer to as the right singular vectors so remember that with eigenvalue decomposition the idea was that if i hit an eigenvector with the matrix then all would happen is that it would scale it well we get something very similar in the case of the right and left singular vectors respectively let's imagine for example that we talk about the vector v j so this is the jth right singular vector so this is the jth column of v now let's imagine what happens if we hit v j with the matrix a so if we do a multiplied by v j what do we get let's write out the decomposition where we would have u sigma v transpose v j and let's take a minute and talk about v transpose v j so v transpose is a matrix whose rows are the right singular vectors so this is big v transpose and now we're multiplying it by vj vj is orthogonal to all of these rows except one which is itself and it's orthonormal so that will be a one so what happens is when we do this multiplication it sort of slices out the j's entry so we get zeros everywhere except for in entry j now what happens if we take that kind of slicing vector and multiply that by sigma if we think that through right then sigma has sigma one sigma two and so on and zeros everywhere else we'll just imagine this is a reduced form now it doesn't change anything and now we multiply that so so this is sigma and now we multiply that by this slicing vector that we just looked at and what it's going to do it's going to create another vector that looks like this but instead of a 1 now it has sigma j so this gives us a bunch of zeros and then sigma j and then more zeros finally what we do is we then hit it with u so we take this matrix of singular vectors u and we multiply by this vector that is zeros everywhere except for that sigma j this will have the property of slicing out again only uj the jth left singular vector and multiplying that by sigma j and so what we see is that this gives us sigma j multiplied by u j we can see that it's very similar to what happens with eigenvectors of matrices except that now what's happening is when we're multiplying by a right singular vector then we're slicing out the corresponding left singular vector and multiplying it by that corresponding singular value so sigma the singular value is still playing the role of a scaling except that it's not scaling v it's now scaling u it's also worth convincing yourself that the equivalent thing is going to happen with a transpose and a left singular vector that is if we said a transpose u j then we're going to get a v sigma transpose it's diagonal so it doesn't matter but u transpose multiplied by uj applying the same argument we can see that we're going to slice out the jth entry and then we'll slice out the jth singular value and then multiply that by the jth now right singular vector and so this is going to give us sigma j multiplied by v j same idea we're hitting u this left singular vector with a transpose and that's having the effect of scaling by sigma but scaling v j instead of u j now let's talk about how to compute a singular value decomposition and this is going to have interesting connections to eigenvalues and eigenvectors also so again we have a matrix a that we're taking to be m by n and without loss of generality let's assume that m is greater than n if it weren't then we could transpose it and compute the svd of that and then transpose it back now we're going to write the svd of a as we've been doing as u sigma v transpose but of course all we have is a we don't actually have this matrix yet our starting point for thinking about how to compute the svd of a is going to be to imagine that we write down a transpose a in terms of these svd components so let's do that a transpose a is going to be equal to v sigma transpose u transpose u sigma v transpose now u is orthonormal so these guys become identity and we get v [Music] sigma transpose sigma v transpose if we did the full svd then sigma would look like this right we would have kind of elements along the diagonal there zeros and zeros there that would be sigma and then we would be hitting it on its left with its transpose and so we would be hitting it with a matrix like this now just to remind you this is n by m this is m n n this is sigma transpose and here that diagonal is like this with zeros out here and zeros there so this is going to give us a square matrix that's n by n and whose diagonal is the squared singular values so if we kind of think about how this would work we would wind up with a matrix that is just n by n with squared singular values along the diagonal that means that our sigma transpose sigma contains sigma 1 squared sigma 2 squared recall from a previous video that symmetric matrices are orthogonally diagonalizable and that essentially what we've done here with v and then this matrix and then v transpose is we found an orthogonal diagonalization of the matrix a transpose a what that means then if we connect it back to that previous lecture is that these are the eigenvalues of a transpose a note that they are all non-negative because they're squared and this should fit within our understanding of positive semi-definite matrices which must have all non-negative eigenvalues now we see we have a useful tool for starting to compute the svd by computing eigenvalues and eigenvectors in particular if we compute the eigenvalues and eigenvectors of a transpose a then we're going to get v let's call this d v transpose and so we can see that the eigenvectors of a transpose a are the right singular vectors of a so let me say that again the eigenvectors of a transpose a are the right singular vectors of a and the eigenvalues of a transpose a are the squares of the singular values so if we take the square root of those eigenvalues we will get the singular values so just by doing the eigenvalue decomposition of a transpose a we now have the right singular vectors and we have the singular values so we have those two pieces now we just need the u we need the left singular vectors so how are we going to get those the way we do that is we take our matrix a and we hit it on the right side with v which we have just computed from a transpose a now what does this give us well we now have u sigma v transpose v multiplying it on the right by v essentially cancels that out by turning that into the identity so now we just have u sigma we also have sigma because we've just computed it via the eigenvalue decomposition and it's a diagonal matrix so we can easily compute its inverse so we multiply on the right by sigma inverse so we now do av multiplied by sigma inverse and that gives us u sigma sigma inverse these cancel to the identity and we get u to recap we take the eigenvalue eigenvector decomposition of a transpose a that gives us the right singular vectors and the singular values we then use those singular values and apply v and sigma inverse to a in order to compute the left singular vectors u one of the ways that singular value decomposition is really useful is that it allows us to talk about the best rank k approximation to a matrix in order to talk about approximations though we're going to need to talk about distances and norms as they relate to matrices rather than vectors there are a lot of different norms for matrices and they have different properties one of them in particular the frebenius norm comes up quite a lot and is one of the ways we can think about how svd works so just to remind you when we talked about norms for vector spaces we were talking about things that looked like this we had some vector we wrote two bars on either side to indicate that we were reasoning about the length of x well a matrix norm is very similar it's just that now we have a matrix instead the norm that kind of makes the most sense and is easiest to reason about in some ways is the thing called the fravinia storm and a lot of times we write that as these double bars on either side and then a subscript f and you can think of it in some ways as kind of the natural euclidean norm or the l2 norm where we're just unrolling all of the elements of a so that would be taking the square root of the sum over all the rows and columns of these elements squared so the frebenius norm sums the squares of all the elements of a and then takes their square root so it looks a lot like a distance if you imagine taking a and unrolling it into a big long vector one of the things that's really interesting about the frabinius norm though is that we can actually write it a couple of different ways including in terms of the singular values in particular one of the ways that we like to write it is as the square root of the trace of a transpose a it can be useful to think about how we might get from here to here like how these things relate and so i think it's worth looking at the entries along the diagonal of a transpose a so let's imagine that we're looking at the i i entry so this is just saying row and column equal to i so the entries along the diagonal and the sum that we're interested in is going to be from say j equals 1 to this inner dimension which is going to be m and we're taking the first a and we're transposing it so we'll be summing instead of over the second dimension we'll be summing over the first dimension and so we'll say j i and in the second matrix we're not transposing it so we'll be summing over j i again again the i's and i's are here because we're looking at the eye entry of the matrix right away we can see that of course this is equal to the sum over j of just a j i squared and then the trace will sum over the eyes as well the trace of a transpose a is then a sum over the eyes and the sum over these j's of a k i squared and of course the order doesn't matter they're just dummy indices so we can see that this double sum of squares inside the square root is the same as the trace of a transpose a now let's imagine further that we actually have the svd of a and that it is as we've been writing u sigma v transpose let's plug that in to the trace of a transpose a a transpose a first as we've looked before is going to be v sigma u transpose u sigma v transpose and of course these are identity and so it is just v sigma sorry there's a transpose here v tran sigma transpose sigma v transpose as we discussed just a minute ago then this is a square matrix with the squares of all the singular values in it so now let's think about what the trace of that is so the trace of v sigma transpose sigma v transpose due to the cyclic property of the trace we can move the v transpose around or just we can remember that change of basis doesn't change the doesn't change the trace these become identity and we can see then that this is the sum of the squared singular values so we can think of the frobenius norm as just being the square root of the sum of squares kind of like the euclidean distance or we can also think of it as the sum of the squares of the singular values i find this to be kind of a surprising connection that if i just add up the squares of all the entries then i'll get the same number as if i sum these squares of the singular values one of the remarkable uses of the singular value decomposition is to construct an optimal rank k approximation to another matrix that is if i have a matrix a and it's m by n as we've been discussing and i wanted to come up with some approximating matrix of rank k less than the rank of a then what we could do is a truncated svd in order to get that matrix and that will be the best approximation according to the frebenius norm that we just talked about and also according to thing called the spectral norm the way we can think about that is of course by considering the sbd of a as we've been doing but now we're going to do something slightly different which is we're going to write this as a sum over rank one matrices constructed from the left and right single vectors what do i mean by that well we can also write a as a sum up to the rank of a of the i singular value multiplied by u i v i transpose u i is the ith left singular vector v i is the ith right singular vector what we're doing here is a little bit confusing because we're taking two vectors and multiplying them but rather than transpose being in the middle to make this an inner product the transpose is on the right side making this sometimes what we might call an outer product so rather than this collapsing into a scalar like it would with the inner product instead it expands into an m by n rank 1 matrix and then we're weighting that matrix with the associated singular value and summing up a big stack of these this summation is actually exactly what we're doing when we write it like this it's just different notation for the same idea but we can see how we're sort of constructing the matrix one rank one matrix at a time the appeal of this is it allows us to get after the idea of a rank k approximation via what we might call a truncated svd so we could write that rank k matrix so we'll say that is a sub k and instead of summing up all the way to the rank of a what we're going to do is only sum up to k and because the sigmas are in decreasing order this is summing up the outer products associated with the largest singular vectors and it turns out that this truncated svd truncated because you're throwing away or zeroing out the singular values beyond k gives you the best approximation among rate k matrices to a this is something called the eckhart young mursky theorem and the idea is that if we were to consider the minimum over all matrices b that are m by n and of rank k and we were to look at the norm between a and that matrix then this is going to be achieved by b being this a k approximation so what this means is if you want to get the best rank k approximation you can take your sigma with its singular values zero out all the ones with an index larger than k put the matrix back together and you have an ak that will achieve this minimum and it turns out also to be true under the spectral norm and not just the fravinia storm but you don't need to worry about that detail it's not too hard to prove this but it requires thinking through the relationship a little bit between the forbidden store and spectral norm and so we won't do that here mostly what i want you to know is that this is an important feature of the svd it allows you to get the best rank k approximation

Info

Channel: Intelligent Systems Lab

Views: 404

Rating: 5 out of 5

Keywords:

Id: JUYGohQY41U

Channel Id: undefined

Length: 24min 56sec (1496 seconds)

Published: Mon Mar 01 2021