Lecture 37 - Principal Component Analysis (05/03/2017)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay okay so what I want to do today first is sort of step back a little bit on what we were doing for last two weeks right so the topic that we've been looking at the last two weeks is what we call unsupervised learning and what I sort of mentioned early on is that one primary way to look at unsupervised learning methods is what we call latent variable model model so most of the unsupervised learning methods can be posed as this not all of them but a lot of anything that happens so this is a very probabilistic sort of approach so if you do not have probability then things will be little different but the idea behind these LVM models is that you observe some X right and so far we have been looking at X as some collective observation but this could be something else it could be text it could be strings whatever right so you observe some X but instead of directly modeling the probability distribution for X what we typically do is you model the probability distribution for X given some latent variables B and some Peter which are the parameters of your model right so and we also have a distribution called B right so in the generative form the way this sort of looks like is that how the data is generated is through a two-step process you first sample Z from this right convincement distribution and then use that to sample X and the problem the machine learning problem is that we would only be given X right so we're given just some training data which will be a bunch of SS and then we need to figure out what are the theta the parameters of our model and also what would be the most likely is this eventually what we would like is what is V what is Theta which is a learning part and then what is P V given some X right so this is the inference part then what we did was we saw that clustering like clustering could be thought of as our latent variable and as an infiltration of latent variable models right because in clustering we are interested in become hidden quantity which is the cluster index so we kind of denotes which cluster X belongs to butt in clustering we do not have an explicit sort of probability model but the philosophy is the thing right and then we also saw that if we treat be right if we is categorical which means it take one out of K value then latent variable models become mixture models there's V indicates which mixture component data from right and and then we saw that k-means clustering well so mixture models right then the question becomes how do we learn the parameters of this mixture models we looked at one algorithm called expectation maximization so this is the sort of a quick review of what we have done so far so or e/m so thank you my vision algorithm which lets you the joint or the marginalized likelihood of x given theta right over all possible values of Z and if you look up our discussion in past couple of classes we have discussed why this is important right and then we saw that one version of en which we call Hardy M is what K means K means clustering is a clustering algorithm that is supposed to clustering problem but II am if you sort of put some constraints on a M which means essentially what you're saying is that at your each step you force your objects to belong to one V so u 1c z given x comma theta will be one everything else will be 0 right so then that becomes k-means clustering right then what we also saw was that if we instead of being categorical if V belongs to RL between V the vector of L numeric values right then what we get is what we call factor analysis and this is what we did glass right doctor analysis and then we saw you know for a special case of factor analysis where everything in gaussian we saw how to find px given Z we found how to say figure this p z given x right so everything was nice and gaussian because of the nice properties of a Gaussian distribution right and then I also kind of mentioned that if you want to estimate the parameters for a factor analysis you can use expectation maximization there as well we didn't go through that algorithm you'll have to derive it yourself but but you could in principle apply all right so this is where we were right so just a quick review of what factor analysis were so here what we assumed was that be right the hidden variable is coming from a normal distribution with some initial mean and covariance impact for most factor analysis models we assumed this a hero and because it's an identity matrix okay and then what we say is that px even V comma theta is also a normal distribution right with mean as WD plus mu and covariance matrix as a side right and typically we assume that sy is diagonal so this was a set up for factor analysis so we have done this in that class right so I hope you sort of know what what I'm talking about here now in the end what I did was I total course in the middle different examples and basically the effect of this model is that it lets you understand some hidden dimensions so it's kind of like a dimensionality reduction method where you you're given some X right but you can if you knew what W and sy and mu well which are the parameters of the models which you can get from e/m then you can find P V given X from a theta right just by applying Bayes rule and it turns out that sort of a Gaussian right so what does this mean this means that if I give you X some data point X which is in D dimension I can get what will V be actually I'll get a distribution over V typically we just work with the mean value of that T right so it lets end to X X belongs to R be like it's in three-dimensional space and we belongs to RL and typically we assume that L is less than V or much less LD so this is what lets us do dimensionality reduction because it allows us even though our original data was in D dimensional space it left us represented and into a smaller dimensional space but after analysis was not proposed as a dimensionality reduction method it was more proposed to understand what are the latent variables in faster analysis we don't actually worry about this as much as what we worry about W because W is that latent loading matrix right so W tells us how are the features that you observe sort of connected with each other so the idea is that maybe you are looking at data which is in twelve dimensional space but six of those features actually are very similar right and six of them are different are similar among each other so factor analysis this W matrix will tell you okay b6r together these picture together right so they kind of tell you what are the hidden factors that actually generate the data so it's used a lot when you're talking when you're dealing with things like psychometric where you go and take a survey of people you're trying to understand behavior of people you take a survey which has 300 questions let's say right not all the questions are sort of independent of each other the factor analysis kind of lets you extract the speech now I mean those hidden factors so that idea but in the end what I also mentioned was that if but you could use this for dimensionality reduction right nobody's stopping you for that so if this side matrix is just Sigma square times I so let's say the time matrix is still a diagonal matrix but each entry just cause the same right if this happens the spatial analysis actually same as what we call probabilistic principal component analysis okay so PCA some of you might be familiar with that but we are going to look at piece here today as PCI PCA was an explicit method meant for dimensionality reduction okay given data in d dimensional space the purpose of pca is to find smaller dimensions such that when you convert the data into small dimensions it retains most information right so it retains the information so that's what PCA does but turns out and PCA had no probabilities associated with X I so PCA does more like a spectral decomposition and everything so but if you look at analysis then PCA and factor analysis are similar and that PCA is called a probabilistic PC right so that is another thing we also agree is that here we are not making any assumptions about W right W which is the loading name it's just some loading matrix so what is W so W is a d cross L matrix right because we multiply W with be right and that gives us a B cross 1 vector so the effect of W is that when I multiplied WV it gives me some height or the mean of X which means that essentially converting an L dimensional vector into a D dimensional vector so think of this w so we have some features associated with it right so we would be a column vector with L feature TV 1 VL and what you need is X could have be X 1 3d features right yes that w is a because alpha did yourself so to get differently the first entry what you do is you take the first row of W and multiply or take a inner product with this right so you can think of this as a transformation of the original vector into a one-dimensional value same where the second one will be with the second row enter so each row of this w can be thought of as a way of transforming your data right so each row if you look at the values in this roof they will tell you what values of this hidden variable do the life right so if some of these are 0 then they are not sort of using this so this tells you how much of these factors are loaded get your X so that's why we call the loading matrix right but when you solve the after analysis I'm gonna get a W there is no special structure to W but if we force if this and if W is also normal if we force W to be orthonormal what does that mean it means that if I take every column of W and take an inner product with any other column of W it will give you give me a 0 value and if I take its inner product with itself then it will give me a unit value so which means that W transpose W or W transpose is symmetric is identical so if this is true then it's called principal component analysis the reason we wanted to be orthonormal is that when we are doing true dimensionality reduction right what that means is that the dimension the new space in which you are transforming your weight or your features should also be orthogonal to each other right because if you think in geometry sense when we represent a point when we represent a point in a 2d space when you say this is X this is X this is y we assume that x and y are orthogonal to each other right in three-dimensional space we assume XYZ are orthogonal to each other now if I transform this into a new space let's say I just want to transform a two dimensional coordinate system into another two dimensional coordinate system which is maybe just rotated right if it is if you want it to be a true coordinate system the new axes should also be orthogonal to each other right that is the property of a coordinate system that all of the basis vectors that represent your coordinate system should be orthogonal to each other right so that is why in a for a few dimensionality reduction you want it to be orthogonal to each other so we force W to be orthonormal so you can still solve a eeehm to do this factor analysis with this as a constraint and get the answer and that is what we call probabilistic PCA or probabilistic critical component analysis but today we are going to look at probabilistic oh sorry just the principal component analysis the way it was originally proposed in 1901 and then we'll see the connection between that and PPTA as well alright any questions so far Yeah right okay we'll be good all right so that's a good question so the question is how do I determine this L right so this L is again is the same story right it is a hyperparameters of this so this typically comes from some domain expertise they tell you that okay so going back to like that survey example right so let's say you have collected this data which is each data point is essentially responses to 300 survey questions and let's say you're trying to model the personality traits of people then maybe some psychologist could tell you that you know typically there are five personality traits so then you use it as five but is again no good way of estimating this so I have sort of used factor analysis to study countries in the past right so I've looked at data from CIA World Factbook where every country has many parameters then some you know some domain expert some somebody told me that you know our country has three basic factors economy demographics and there's one more type of our natural resources right so then I said okay then that means that my I should look for three factors so something like that or any other questions okay so let's move to principle component analysis right so we looked at this right so let's so let's introduce PCA right so let's say I have this five data point represented in a two dimensional space right and what I want to do is I want to embed them in one dimension right let's say I do not have space to store five times two values because everything I will have two coordinates I just have space to store five values which means that I want to embed it into one dimension so how can I do that right so I can do it many ways for example I could just store their x-axis values right so I can just store 0.2 point whatever three point five point six five and point seven that's one way another ways I could just store their y-axis values like so there are those are two ways to represent data but what is the best way right so the first question is how do you measure this best thing right so what what has been shown now this is as a result more some statistic that information theory is that if you find a direction so essentially what you want to find is a line on this you can project your data right so for example earlier I said I want to put it in one dimensional space along x axis that means that I draw a line like this and then Y project everything on that right so that is one way another ways on the y axis but what we really want this is my data so I want to draw a line any line and then I want to project all of my data points into this line so this will be the first point second point third point four point five right so we know genetically how to do this projection so what we want to find it what is the best line such that the points of a spot apart from each other as possible all right so that is the point go for example if I draw another line let's say just the line along the x-axis and I predict my point so the point will be here here's here yes and here right so that's one place if I put it on y-axis will be like this and then along this line like this and I can draw any any arbitrary line and I can project data onto that so our task is to find the line that gives us the best separation between the data because if you think about dimensionality reduction when you do the mention a to deduction you lose information right because earlier your data was in two dimensional space now you want to represent it in one dimensional space clearly there will be some loss of information you want to minimize that loss yeah so that is the one so what having agreed upon is that the best line would be one which maximizes a variant of the projected data all right so whatever data that you get after doing this projection you want to make sure that a variance is maximum because excuse me because that will give you the best separation between the data points right because in the original space data points are very nicely separated you want one dimensional thing in this video is again nicely separate like so for example let me take another contrived example let's say your data was like this yeah now if I project my point on on the y-axis and they will look like this if I projected it on x-axis they'll look like this if I project them on a line which is actually this line I'm assuming all the pointer on this line and they look exactly like what they are right so if you compute the variance of these possible projections you will see that this variance is the largest like variance is just sort of x- mean and x minus the mean x squared of that right so how do we find this line in this contrite a contrived example of course there was a line that you could have fitted to that's fine but in general when you have data in any higher dimensional space how do we find a sort of an embedding which maximizes the various all right so that is sort of what PCA does yes so the principle behind PCA the basic the driving principle is that you want to find the direction in which you lose least information and you quantify that loss of information by saying I want to maximize the very so if I maximize the variance I will lose the least amount of information all right so the question relative how do we find this direction of maximal rate so we can actually do it simply by doing a little bit of mathematical optimization which clearly we love so this idea right so let's say we don't know what the direction right so we say let you this this is a unit vector vector with unit magnitude denote the direction of over this our line that line on which I want to project things right and what I want to do is first I want to compute what is the projection right so if I have any data point X I then if I compute X I transpose U that will give me the projection that I want to find where does it lie on this line then X I transpose U will give me that right so this is just geometry right so the idea is that if I have any line so basically what is this vector indicate this is actually a perpendicular vector to the line that you want to project our data on so if I have a point and I want to see where does it lie on this line I just in a product between the line than this this thing right so that is the genesis of this is a new point so this is the embedding right but let's call it G I limit what we want to do is we want to maximize the very alright and now in this whole discussion what I'm going to assume is that all of our data has Beatle me to mean Center data so if your data is not mean Center you can make it mean enter so let's say this is my data matrix all of my X either is x1 transpose X 2 transpose and so on so if this little not means entered it in this later column me and subtract it from every vector that will give you a mint enter data so let's right now agree that our data is mint enter vo me which means that if I want to compute the Vale right so variance of my embedded data is going to be just so it will be VI [Music] transpose V I write summation 1 over n I equal to 1/2 and actually VI is actually going to be a number like little F is VI square so this is just a formula of variance and I mean it could be n minus 1 that doesn't really matter but here we are doing is 1 over n di square R exactly assuming that the mean is 0 right so they're easy and that can be written as I'm going to replace this with this with Si transpose square right and if I open it up and write it together it will look like this 1 over N summation all right and of course this is I equal to 1/2 and Roger compute the variance of your entire data set and you can see that these parts do not have I shall bring them out maybe you had transpose you had right now what you'll see is that this this is nothing but the happy now assuming now remember that we have assumed that our data is being centered right which means that this is nothing but the sample covariance of my original data because typically sample covariance revision as if it is X I minus mu plus XY minus mu transpose now why did I say like 1 over N for all I right and this entails give me the new 0 so then that becomes exactly right I'm going to write this as s a matrix F so this is our sample covariance matrix so what we want to do is if you want to find the new that gives us embedding with maximum variance basically you want to find a you that maximizes right plus we also want you so so you can tell optimization problems right to find you with maximal variance we want to maximize this find our max of you such that this thing is maximized right subject to the constraint they're subject to constraint that u transpose U is 1 because it needs to be a unit vector right because if you just maximize this values then you can arbitrarily have you at all infinities and that will maximize the whole thing right but you want to use you to be a unit vector which maximizes this quantity all right so this is actually easy to solve because you can introduce our favorite Lagrangian multiplier method this is a constrained optimization problem with any quality constraint which means that I can solve another problem which is beauty you like plus lambda 2 Z right so this is easy any question so far right so let's see what this gives us this actually gives us a very interesting result so so I can do to the first I'll say I'll compute the partial that it was not equal to do we want to find the maximum value for this so the first thing I'll do is I'll compute D by D u of this whole thing and set that to 0 right which means that you'll get to you at a to s you plus 2 lambda you like this is just the derivative of this is su this is something that you'll get from the matrix cookbook as well plus 2 lambda U is equal to 0 right in like that su is equal to maybe you right and oh if you've computed with this so then u transpose U is equal to 1 when I compute SS would be lambda right so these are the two things right in fact is I'm going to just switch tinkling I'm going to make this - that doesn't really matter right because it'll be absorbed in lambda so su equal to lambda so this is the equation so now let's look at this a little carefully right so we want to solve this you want to find the solution for the system right F is a matrix F is a D equal B matrix this is the end across one vector but we want to find a solution in this language when I multiply s with you Lu as some lambda times u right so I'm getting you know where I'm going with this right then yes right so the solution that you will get for this is simply the eigen decomposition of F right because that is if you want any you you you can just simply call the ideal decomposition right but the next question is which I can rectory will I take right so either this will be the eigenvector of F but if I do I get the composition of F it is a D class the matrix I will get D eigen vectors to be the eigen vector should I choose right but I can choose so let's say I can choose any arbitrary eigen vector rather be oh yes as a number to maximize like we were maximizing you transpose su right and instead of this we'll say this is just going to introduce me replace it with the eigendecomposition so that means that objective value objective function value will be you lambda see you specific response right or lambda u transpose u but because of the constraint we know that this is gesture is unity right go to the lambda so we want to find that I'm going to choose that eigenvector in this lambda J's right the lambda of course will be the eigenvalues of your solution this means that the best embedding will be the eigenvector with largest eigen rag the lambda is eigen value right when you solve this solve this using I can probably the eigenvectors will give you you and lambda n correspond to the eigen value right so you choose that eigen value or to that eigenvector that gives you the largest eigen value so that will be the direction in which um if you if you want to embed your data along just one direction right so choose the eigen vector of the covariance matrix and project data on to that and choose that eigen vector that has the largest eigen value so that is the idea behind principle component analysis now somebody would ask okay what if I want to project it in two dimensions not just one so the answer there is that then you to the top two you choose the two eigenvectors that correspond to the top two eigen values and so on if you want to project it in a dimension project it is using K eigen vectors alright any questions all right so that probably we do understand not to do nothing right so that his idea so the first principle component or the first direction in which you want to embed the data is given by the eigenvector with the largest eigenvalue second principle component second largest value and so if you do that right so what's that what that lambda I will give you that that eigen value right below you you get the eigenvector and you use that to embed your data that's fine but there's also the lambda I that comes with it right even if you use that too so it turns out that at lambda I will actually give you the variance along that principal component right because we we computed that just now so if I if I use any principal component or any eigenvector to embed my data then the variance that is given by that principal component is lambda right so both things so when you do here so when you take your covariance matrix you do the eigenvalue decomposition both things are useful for you you can use the eigenvectors to recover to embed your data and you can use the lambda to understand how much variance is captured by this this direction right so the idea is if you go back here right let's look this was the principal component this was the first eigenvector so when you invert the data then this variance is actually just a lambda associated with that eigenvector or the eigenvalue associated with that eigenvector so with this you can actually compute so that will you decide to take the first L top eigen values and say I'm going to take those eigen vectors and now I'm going to embed my data into this else dimensions right so you'll get instead of D dimensions L dimension so now you can compute this value which is and this will tell you what is the percentage variance that you have captured the kind of tells you how good your embedding it so if you take all of your eigenvector if L is actually D then you'll capture the 100% radiance so if you haven't really done any dimensionality reduction all your band is that your data was in D dimensions and you may be rotated it a little bit right but you haven't reduced then another D so your variance will still be the same you'll capture the whole variance but of course you haven't gotten any reduction so the idea is you want to find that l that you use reduction as well as not too much loss in information so the variance is sort of analogous to how much information here last a better idea okay any questions yeah I'm sorry good good ah see if the data points with arranged in a circle so the question is the data point where on a circle would this work that's one okay so I mean you can always do this right so you can take your data in fact I have like an algorithm somewhere like so let's say your data was on a circle so this is a PC algorithm the first thing what you do is you mean Center it right so let's assume that your data was on a unit circle uniformly sample so it will always be already we reinvented what you can always do this right and then your data will be on a unit circle then you can compute the Kuwaitis sample covariance matrix and then you can compute the eigen vectors and eigen values right so you can do that the point is that you might not get a good reduction which means that if you only use the first come first direction so it's might only capture half of the variant and you can actually try it later on I will show you a notebook where I have the algorithm so you can try it just generate data on a circle and try to fit it so you can do this but it might not give you good reduction the other point is that both three the principle that PC of our song is that that it is over two dimensions right it that there is some correlation between your two dimension if there is no correlation if you both love each other then you cannot do any dimensionality - right the PPA won't work there are other method that could work because they are on a circle that means that actually there is some relation between two points but it is not something that PCA could capture I presume it would be like a failure case for PCA if you want to compare it like that all right any other questions yeah all right so the questions why are we doing it yet right because it is giving us a reduction information we are doing it for dimensionality reduction so so the point is that even though your data is in three-dimensional space we do not want to keep the handle there because if you remember like if you did your your neural network attainment right where you had data in 716 or 784 the mental space right what if now because the complexity of the algorithm depends on how big that vector is because your W matrices would be that big what if you do not have computational resources to do that right what if you can only process data which is ten dimension law so then you have to do the missionary reduction and you would incur unlock but hopefully it won't be that much right so that is one one point another is if you want to do visualization because you can only visualize things in two or three dimensions right there's no middle eyes high dimensional data that's another thing the third is that that goes back to the question that you asked earlier is that if the data has some dependent is really if the variables are not totally or if the variables are not uncorrelated they are not independent of each other then what we see unless you do is come up with new dimension this will all be independent of each other right so then any algorithm which get confused by correlated features will perform better in this case so things like regression that is in the past have correlated variables regression kind of gets confused and then typically we do our like religion aggression or something to address that part right but what you can do is that you could do PCA on your data right reduce the dimensionality and be assured that you have removed any dependence among the few features and then your algorithm will work better in fact there's one algorithm called partially school principal component regression it exactly does that so instead of doing directly regression on your X it first stress PCA reduces s Lewis smaller dimensional space and renders regression and it has been found that in most practical settings actually improves the performance the PCA might even improve the performance because it removes some of the noisy relationships alright so that is idea so what how do we use these here for eventually we definitely pretty easy right so first we consider cuz we do the eigenvalue decomposition of your covariance matrix using your input data then we only use the first l columns of your w right W is the theta matrix that contains your eigenvectors so we can we create a new W matrix in which each column is one eigen vector we hit that and then when we come multiply X with W now W is V by L X is my B when I compute this related to Sylvia and my L matrix so each row there will be the data in this new reduced stress so each input vector is actually replaced by a shorter L by one rack is how we work and it works pretty and I'll show you some examples right this is sort of a example of a link now what you can actually do is you can also reconstruct your X letter I give you w let's I know what a blue is and let I give you a 3 this was the lower dimensional memory if I multiplied W Y Z remember W is V by L Z is L by 1 when I multiply WIC I will get X again right if I got you the same X because we have of course lost some information that we have only used the first elegant vectors in W this means that your ex might not be exactly what your original Xbox but it will be pretty close so if we compute the reconstruction error between your original axis and what you're after that you get by doing this process right that error is what we call average reconstruction error and there is a theorem it's called logical ppm there's another longer name to it which is that if creasing I will give you dismissed reconstruction error among any method that you can come up with which gives you elevated records like pca if you look easier which is you start with a covariance matrix we take the first L again vectors and then you do this kind of reconstruction PCA will give you a solution that has a minimum reconstruction L because think about it you can construct this W in any way right I constructed using eigenvalue decomposition what you can construct the W which is this L orthogonal unit vectors inside right so I can conserve it in many other ways but any way that I choose I can be assured by this theorem that PCA will give you the least reconstruction error so that is why PCA is so popular in terminal on any question so far yes yes I can retrofit all this is orthogonal right so that's the property of eigenvalue decomposition if you're trying to get vectors that are orthogonal to each other but they fall that equation in a you equal to lambda u that is sort of the property of eigenvalue decomposition any other question ok so now let's look at some another demonstration which sort of as you understand it a little more so here I am using against FK learns PPA but you can always do self-id all you need to do is means enter the data compute the covariance matrix eigenvalue decomposition so that's what PCA might do as well at least Lisa doesn't do that we did at something else which we'll discuss in the next lab this apart in the value decomposition let us look at the diagram that it is my coy data right this is data include an entranceway and you can see clearly really some correlation between the two variables so if I run PC on it right after here and actually doing the eigenvalue decomposition so use the output of this is we have two eigenvalues and two eigenvectors and you see they're orthogonal to each other so so what this is telling me is that this is a larger eigenvalue like so this eigenvector if I project my data using this eigen vector the spread in that axis will be the most and second would be this one if I so first we can let us look at the eigen vectors themselves could be the first eigenvector that I get rekt i use just basic trotting to plot deep and since they are orthogonal to each other to the local Haga really what they're saying is that project data on this and put a tail on this right so if I do that this is how it looks like so I have killed my eigen vectors by the eigen values so what this tells me is that I should project this data on to this dimension if I do this then I'll get the best we spread if I put it in this dimension I'll get a little let's read right so that is right here but if I projected in both dimensions then I'm getting more dimensionality reduction but all I'm doing is if you think about it I am rotating my axis right earlier my axis was just like XY and then because of when I multiply the eigen vectors with those I'm just rotating it so even though I am not really reducing the dimensionality but at least what I am doing is I'm making the new features perpendicularly that I mean so they're very red so if you look at data in this case in the rotated space you will see that they are not the we correlated with each other I said but of course what we want really is some kind of dimensionality reduction so we projected along the first principal component and this is a spread that we get and because of the PPA theorem we can be assured a lot of eg a pyramid because PTH explicitly tries to Maximizer alien you can be assured that this will give you the maximum rate all right meaning all right any question of our let me do one more example that will resume on Friday so let's look at the the data that we were looking at the digital right you you have grown to love it and you me well let's say I take a load all my creativity right so this is this via matrix which has all you know the dimensionality is relative so it had three hundred three earlier I'm taking the new sample the Calipari redid it images with 784 features like if I look at any image it look like this number three if I look PCA on this right instead of if I perform to ETA and then look at this plot now this part is very important whenever you are doing this here this is what we call a creep lock which tells you how much variance is captured if I use you know some number of crystal components if I use only one principal component and then actually gives you the the ratio so it kind of tells you that this is a loss that you would incur if you just use one principal component as you increase the number of principal components because you can actually have definitely personal compliment interval you have 784 I can record right so as you increase the number of principal component and now remember these principal components are sorted by their eigenvalues right they're not in any random order so if I keep increasing the law a ranger will root out right the ideas would look for a D here and say okay here if I do that if 50 principal components then the law that I didn't talk too much and I get a lot of rush right instead of representing my data in the whole space until 74 I would be only using the data business area much as well so that is sort of the benefit in terms of liminality direction let's say let's do this like let's say I create my Z that ate my L to meet this one so I'm just taking one dimension and now I can ask for my greater so instead of taking accurate 2007 84 I'm performing it into one dimensional space right for this and then I click and leave it so that it is one image so this is X and this is the Z this is the X bar that I get remember we can compute the so I do all of this process right and then I take the first example in T get Z and then i reconstruct my X bar right so this is the X bar that I get back this is recovery right so think about it this data actually was represented using one feature the first feature that's it but I've reduced my image from a 784 length record to one value and then when I recover my image using the W this is what I get let's only close I cannot do that and this is the difference but the idea is that you use slightly larger value later I take 30 through again right till you see some different egg and if you keep increasing my else letter I take 120 then I recovered you will see that now they're getting closer to the image so even when you're representing a data only in 30 dimensional space you are able to capture most of the information and the loss of minimal and at Lycia theorem tells us that any other orphan or orthogonal transformation that you will come up with this difference will be Lois if you did please hear all right so if we had then please me earlier right we could I could have asked you in this neural network a fine who first Lukey CA and then you would only work with data which is pretty dimensional long and you would have done much faster right that's right that is a benefit of doing GPS another and to be a question all right the next time we'll see how we use please here for and analyzing places any questions with questions all right so if you're not submitting your project report is handling here and then I'll see you on Friday thank you very much [Music]
Info
Channel: ubmlcoursespring2017
Views: 1,897
Rating: undefined out of 5
Keywords: CSE474/574, PCA
Id: MzGXTqZ73ik
Channel Id: undefined
Length: 52min 2sec (3122 seconds)
Published: Wed May 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.