Mod-01 Lec-31 PCA -- Model Adequacy & Interpretation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Good morning, we will continue with PCA. Today, we will discuss PCA model adequacy and interpretation, so last class we have discussed up to a extraction of PCA, extraction of principle component and we have seen that first we have used S minus lambda determinant equal to 0. Then we got the Eigen values and then using individual, suppose of the for the first one a 1, first Eigen value with the first Eigen vector we have found out and subject to a 1 transpose a 1 equal to 1. Similarly, you can find out a 2 using S minus lambda 2 i a 2 equal to 0 and a 2 transpose a 2 equal to 1. In this manner you we will be able to find out the Eigen vectors. Once you estimate the Eigen values and Eigen vectors. You can write down the equations also, for example what the example we have considered here that the loadings which from profit, and sales are two variables inter related variables. We have extracted also two components PC 1, PC 2 and these are all Z 1, Z 2 and the loadings that basically these component loading the Eigen vectors 0.19 and 0.98 for your first one. That principle component 1, 0.19 x 1 plus 0.98 x 2 and similarly, for Z 2, it is 0.98 x 1 minus 0.19 x 2, you see ultimately what is happening here just reverse, only one sign reversal is there. So, what I ask you that you please compute in detail using these matrix S matrix like 1.15, 5.76 and 5.76, 29.54, you compute in detail and come to up to this level and you will be able to find out why what is the reason of this type of relationships. Now, under adequacy test what you want to discuss first you have identified lambda 1, I think it is 30.66 probably and lambda 2 is 0.03. Then the total is sum total of lambda j, j equal to 1 to 2, here this equal to 30.69. Now, we have extracted this from this S matrix, now what is the stress of S stress of S is the sum total of the diagonal elements this is 1.15 into 29.54 is not this, it is 60, 30.69, this is 30.69. Now, if there are if S is p cross p matrix, then you will be getting like this suppose we are using sample S 1 1, S 2 2 like this S p p. Other way you can write S 1 square S 2 square like this S p square that you will be getting. There are definitely off diagonal values are also there, but stress is this, so I can say stress of S is nothing but sum total j equal to one to p S j square. So, these can be proved that your sum total of lambda j j equal to 1 to p equal to sum total of S j square sum total of that means variability. If you extract the maximum number of Eigen values and what you will find out that sum total of this Eigen values will be sum total of the variance component of the original matrix. This matrix what you have considered and you have already also found out another thing that suppose I say variance of a j transpose x this one also we found out that a j transpose S a j. Now, if I want to know what is the variance of Z 1 that will be variance a 1 transpose x which is a 1 transpose S a 1,it should be lambda 1 it will be lambda 1 How? What is a 1. In this case a 1 is 0.19, 0.98 S is 1.15, 5.76, then 5.76 and your 29.54, so just you do this multiplication and find out that is a 1 transpose S a 1, you find out because this one is a 1 transpose 1 cross 2, 2 cross 2, 2 cross 1. So, resultant will be 1 cross 1 and definitely this will be 30.66 which is this is a 1, we can take a 1 is lambda a 1, we can take lambda outside and that you can do also that is also possible. Basically, what I am saying, suppose if you create a diagonal matrix of lambda 1 all 0 lambda 2 and lambda p. Then what you can write basically is sum total of this one, I think that is j a square j equal to 1 to p S j square that you can write stress of your lambda a lambda. I think A transpose that you can be it can be written like this, now what you will do basically is why am I using a is because you started with this and then lambda is this transformation matrix. So, it will be lambda transpose you a sorry stress of this, now A transpose this is orthogonal I, so stress of lambda into i, this will be stress of lambda that means diagonal elements, so sum total of j equal to 1 to p j square. So, this is what the decomposition lambda j lambda j square j equal to 1 to p, it is not square lambda j lambda j. If you go for singular manner decomposition that square will be there, so this is the case, so you now understand the relationship everything is wrong. Next is as I was telling you suppose you got lambda 1 equal to 30.66, lambda 2 equal to 0.03 this are point estimates because we have taken example that 2 by 2 matrix from n equal to certain value, I think in our case n equal 12, two variable result. Suppose, I go for next time as I told you earlier also point estimate if you want to find out, then we are going for expected value of lambda j that is equal theta j. That means we are saying that in a population principle component analysis, if you do there you will be getting. If you know everything about the population you will be getting the exact value of Eigen value for that a particular population this is from sample this is from population. So, this one is covariance of lambda j lambda k this will be 2 theta j square j by n minus 1 for j equal k and 0 for j not equal to k. This is the development, so what I mean to that is that is two lambda component lambda j and lambda k they R covariance when k j equal to k this is nothing but the variable component j not equal to k. It is the covariance component, so covariance is 0 and variance is 2 theta j square like this. We can assume that it will follow a normal distribution with theta j and 2 theta j square by 2 n minus 1. When you that mean you know this solution, now this distribution is common that is definitely a question, but it is complicated question. So, there are several researchers statisticians develop this, so we are sticking to this lambda j can be approximate to normal distribution provided definitely n is large n is large. Now, when this is the situation you can find out the confidence interval for theta j, so what we will do basically you want to find out this minus Z alpha by 2 plus Z alpha by 2. We know that lambda j minus expected value of lambda j by standard error of lambda j this follows Z 0 one the basic statistics that concept. So, that means lambda j minus theta j now what is the standard error of lambda j. Now, what is the value here that variance part 2, theta j square n minus 1, so that means standard error if you write down here it will be 2 theta j square by n minus 1. Now, these quantity will be this side minus Z alpha by 2, this side Z alpha by 2, so ultimately this one can be written like this Z alpha by 2 less than lambda j minus theta j by theta j will come out root 2 by n minus 1 less than Z alpha by 2. If you know mathematical manipulation, then ultimately what you will get you want something like this. So, theta j will come out here, then this theta j is there, so this theta j plus this quantity will be added and then as it is within this 1 by something will come final form will be like this one plus Z alpha by 2 by n minus 1. Then 1 by Z alpha by 2 by n minus 1, but we have this lambda j component, so it will be lambda j it will be lambda j into lambda j into lambda j you can write. If our alpha value 0.05, then alpha by 2 is 0.025, then Z alpha by 2, this will be 1.96, now we got 30. Our lambda value 30.66 by 1 plus 1.196 into 2 by 11 n is 12 less than equal to also you can write less than equal to 30.66 divided by 1 minus 1.962 by eleven. So, ultimately this quantity will yield 30.66 by 1.836 less than equal to theta j less than equal to 30.66 by this one minus 0.164 and this leads to 16.70 less than theta 1, because we are considering j equal to 1. So, if I write theta j also j equal to 1 here, then this value is 186.95, but do you agree with this? There is an original see your original sample data if you see 30.69 this is the total variability, now this is because of many things your number variables is number of observation is very small this is huge difference. So, I am sticking to that first one point estimate one only that point estimate will it is all practical purpose is it will solve the what you want to do, but see this much variability for the population PCA is there. If I go by this simple sample piece, then this trial error method we are saying that this is possible when n is large we are assuming normal distribution our case n is very small value. So, we cannot say that this is correct, but this is the way you will be finding out the confidence interval for the population Eigen value. Now, another one is there, similarly using S we found out that a j what is this Z j equal to a j transpose a x. Now, this a j, you have a compound root the value of a j, but this core is sample a j corresponding in population in this lambda j is there. This is the Eigen vector for from sample this is population Eigen vector Z, so similarly there is definitely the distribution possible, but that distribution again really they are all competitive distribution what is that the Eigen vector what you compute. These follows multivariate normal multivariate normal with a j and t j where t j is theta j by n minus 1 sum total of k equal to 1 to p. Then j not equal to k theta k by theta k minus theta j square lambda k lambda k transpose not lambda that is alpha k alpha k transpose, this alpha is what this 1. So, as we are using here alpha k, k equal to 1 to p, this type of mixed equation will come just for the sake of clarification request all of you to check this that k and j term the notation. If there is just check this one, so ultimately as we are getting as are getting that in p a j t j. That is the multivariate normal, you will be getting 100 into 1 minus alpha percent confidence region, if it is p variable case you will be getting the confidence region. That confidence region takes this form n minus 1 alpha j transpose lambda j S inverse plus lambda j inverse S minus 2 lambda 2 i 2 into lambda j that follows chi square p minus one alpha this is the confidence region. So, I repeat that one n minus 1 alpha j transpose lambda j S inverse plus lambda j inverse S minus 2 i, this i is basically if S is p cross p. Then i is also p cross p that is the identity matrix then this quantity will follow this distribution chi square p minus one with alpha you can check this also confidence region. Now, as it is confidence is here you may be interested to know the simultaneous confidence interval then it will be kept to simplify it further, but there are complex, but I am giving you basically what you have applied in your PCA that is from sample. Now, what is the guarantee that suppose in the texture direction what you are trying to find out that you are actually going for the clue what is your sample size it is a quite large yes very large very large. So, we are I think you can still check the particular the confidence interval for lambda j i m values. That means the dispersion must be checked and if it is very big I think it will be within the limit because variant is large. So, you mean of size 128 cross 128, then is it 128 into 128 this much data. Now, what is this 128 into 128 no I understand, but 128, this itself is a matrix 128 cross 128 what about n, what is n total observation 128 into 128, so these many observations are there, it is quite larger. So, I think that as you are doing it may be we will be finding papers where this dispersion measures the Eigen values that is not considered that means uncertain part is not considered. So, you can propose the adequacy will be much better you can propose that and may be different sample size how the adequacy that your fitness will change that can be also a good addition to theoretical contribution. Now, let us see this what that example is, now see what happen ultimately lambda 1 is this and ultimately this is what our confidence region is. You have seen in multivariate normal distribution to the exponents are that same thing is basically when I am talking about that a j is multivariate normal. For two variable case the ellipse is formed ellipse is formed so you find out this confidentially form your Eigenvector. Then, I will show you the that model adequacy test in terms of Bartlett’s sphericity test, then we will go for what are the different criteria that can be used to find out that what are the number of PC’s that can be retained. Finally we will show that some hypothesis test what is this what is happening suppose there are p x variables you are extracting PC’s m PC’s you are extracting let m is much less than p. For example, I am taking twenty variables and then finally compacting into two dimensional using PCA. So, that means you are you are excluding 18 dimensions, you are removing this if your data is very highly correlated, then only it is possible. That means the 18 dimensions are not required, they are too much highly correlated, the Eigen values will be 0 for all the remaining 18 only for 2 Eigen values will be significant others will be almost negligible. So, that eighteen PCs what you are not considering are not the considering are not there, they give you something wrong signal that if you go by subject in criteria of eliminating some of the some of the principle components it may give you some wrong results. So, finally that is why what we want to say that some hypothesis test that we are removing at some subset of principle components. We want to see that whether those sets are contributing to the extraction or not, but to the explaining the variability in two sense that is not extraction. I mean to say that explaining the variability of the x or whether the discarded components variability is very limited or small. What is Bartlett’s sphericity test, Bartlett’s sphericity test is interesting. For example, you see your data scattered data x 1 and x 2, this is your data, this is random data if you increase this this is a random data sheet, I mean there is more relation between x 1 and x 2. So, it will resemble a circle if we go for p greater than 2 that means p greater than equal to 3 or what will happen it will become a sphere. So, for this is for p equal to 2 it is a circle for p greater than 3, it will be a sphere greater than equal to 3 it will be sphere. Now, when you find out the circle or sphere that means the variables are basically random in nature in variables are un correlated variables scattered randomly without any systematic component. This means variables are un correlated if each of the variables are un correlated with others what will happen how many PC component you can extract. If I go for principle component analysis, for example this is my case, suppose if I rotate this, suppose this Z one this side it is Z 2 you are getting. Now, if I take the correlation matrix between these variables with no correlation between each other what will happen you will get only diagonal elements one and off diagonal elements all will be 0. So, if true sense the variables are not correlated what is the meaning of going for principle components analysis, no need of principle analysis. In that case R will be i, that is what happens in that Bartletts test, it is like this h 0 that R is i and correlation co efficient is that matrix is a identity matrix. Now, instead of this figure if you get a figure like this high correlation is there, so in that case your this R it cannot become this cannot become 0, this will be having substantial help. So, Bartlett says then alternative hypothesis is R not equal to i and he developed the statistics value also. What is the statistics to be tested the statistics to be tested is here minus n minus 1 minus 2 p plus 5 by 6 log base a determinant of r, this is the statistic. Now, these statistics follows chi square distribution with p into p minus 1 by 2, sorry now the example we have taken 3 for 12 observation. Then we have two variables, then our log R the determinant part the determinant if you say that the determinant that log R that determinant of R this one will be computed like this 0, 2, 5, 8, the determinant of R. So, this quantity this quantity will become some value 9.5 into 3 that is almost 34.77, now what is your chi square that p means your 2 into 2 minus 1 divided by 2, let alpha is 0.05. So, this value will be k i square 0.05, if you go if you see the table I think this value may be very low value not this much 34. You will not get this, so ultimately what will happen your alpha values for this this one is this value and this 34 will be coming somewhere here, so for our example reject ho R equal to i. This is obvious if you see the R matrix here this R matrix is 1, 0.987, 0.987, 1 this is our matrix, I think similar may be somewhere it is there, but very high value So, when there is high value this Barlett’s null hypothesis is rejected what we are saying you can go for principle component analysis. Now, this is one of the test in fact this test can be done much before the principle component analysis, first see the correlation matrix then this correlation go for Bartlett’s test. If you find that yes there is felicity no need of going back to principle component, so much of effort why should you put, then suppose Bartlett’s test is rejected and you can go for several PC’s. Now, what is there how many numbers of how many numbers how many PCs you will give keep number of PCs to be retained. There are several criteria, one of the criteria is cumulative percentage variance explained what we have seen we have seen that the variance component of the original data matrix that is equal to j equal to 1 to p lambda j that we have proved. So, that is why what you do you do like this, suppose your this is lambda j that is lambda 1 lambda 2. So, like this there will be lambda p you find out the value of lambda one how much some value like this you are getting, then this is the value. Then you find out cumulative value for example, in this case our first one is 30.66 second one is 0.03, so this is 30.66. If I see the percentage percent cumulative 36.55 by 36.69 this will be almost 99 percent 30.66 by total sum of lambda. That is the total sum of the lambda because we have two only two from the example that is 2, so that is why this into hundred it will be almost 99 percent. Now, the thing is that we just go on doing like this that lambda one divided by sum of lambda j that lambda 1 plus lambda 2 divided by sum of lambda j. So, this into 100 into 100, in similar way then you find out what is the percentage what percentage like this then you put a cut off late cut off by 90 percent. So, those many lambda components you keep that is cumulative percentage of total variance explained this is what the first criteria you can do is. So, in this criteria what will happen you may find out that you have to go for 95, 90 percent of the total variability of x to be explained you may take some of the principle component. For example, j plus 1 to may be j plus j plus m, these many component or up to j plus 1 to lambda p, these many components these components we have taken or if I say no problem m components, we have taken these values are almost equal very negligible. For example, if I plot like this, this is my lambda value this side the PC value PC 1 to PC p component, you may find out first lambda value is here second lambda value is here third lambda value is here fourth lambda value will be here like this. You may get a figure like this actually suppose 90 percent is coming here, this is your 90 percent because it will go for cumulative at this point it will extract here. You make a percentage column also to cumulative percentage, this will be 90 percent, so see from contribution point, if I see, I think up to here it is there is difference. All those values are not significantly different there is many screw contribution still there because I want this. So, in that case you may not be interested in 90 percent total variability explained, you may be sacrifice little bit and then you come to some other total variability may be it is 80 percent. Then you go for 80 percent and then these many these many components you keep there are some other methods just to do this one how nicely you will be able to do this. Another method is average root average root basically what will be your average root lambda average one by p sum total j equal to one to p lambda j. So, you keep those many principle components value is greater than lambda bar getting there is a, I spoke about it. So, many components where p component is extracted and the lambda bar, then component 1, 2 2 like this p component. You check suppose k components are having values which is greater than lambda bar. These many will be kept understood there is another method known as Kaisers rule Kaiser rule says that that you instead of S you use correlation matrix R matrix that is the correlation matrix for extracting PCs. So, your correlation matrix is like this all diagonal elements will be one end of diagonal values will be there and some values will be there, then what is this stage of R. If it is p cross p matrix that will be p, so if I use correlation matrix instead of co variance my total variability is number of variables. Now, what is the each variable variability in terms of when he go for standardized variables. So, when you are using correlation matrix what actually you are doing instead of xi j you are transforming this x i j to x i j minus x j bar by s j j square root, then if you compute x sorry S co variance matrix for this, this will be equal to R. So, standardized variable when you are taking for every standardized variable mean is 0 and standard deviation is 1 and variance is 1. Now, in this case what will happen for every variable the variability of x j that is 1, but standardization and giving that symbol. So, if any of the lambda j that is obtained from that correlation matrix value is less than one no need of keeping this because these component is not able to explain a single variable variability. So, that is why what it says that you keep those principle component whose lambda j lambda value is greater than equal to 1, this is keep these many, so there is another criteria. I have one question regarding average root this one here we are arranging lambdas in descending order. Now, when you that like when you extract your lambda one is always greater. Then equal to lambda 2. Yes, in any ways. So, we arranging Eigen values in descending order which is what it is obvious it is obvious because this lambda one must be greater than the average lambda. It has to be. It has to be, so one component is sufficient according to that criteria. No, that is not the criteria you take all those components whose lambda values is greater than lambda bar not 1, 1 not one. It is not all the components have been Eigen value greater than the average Eigen value. Ok. Another criteria is broken stick method, here this combination is first to find out a quantity called l j which is 1 by p sum total k equal to j to p 1 by k this quantity. Here, it is just like a stick broken into several components that it is random randomly that all the components in any, any size you can get. So, that is why this type of 1 by k this things are coming, now your lambda j the cumulative value for each lambda j you find out so that cumulative value is lambda j by sum total of lambda j j equal to 1 to p. You keep those lambda j whose value is greater than l j, so again the descending order is there in lambda 1 lambda 2 like this. So, you find out for the first one k equal to what found out this quantity is this quantity this quantity is less than the cumulative percentage explained here, then keep this. Similarly, keep this when you are going forward at the end k is changing if it is first one is 1 second one is 2. So, from 2 to p 1 by k that way you are taking a broken stick method, this is another criteria there is the popular criteria one more popular criteria is there that is Now, see the scree plot here. Scree plot is nothing but you just decide you keep the lambda value this side the PC, then put for every lambda values you put here and you will find out that. Ultimately, it will create the carbon create in elbow type of shape elbow will be there elbow is the point. If you take the normal posture the standard normal posture sitting posture in that case what will happen that whole arm will be and upper arm should make 90 degree angle perpendicular that is the thing. So, that means this elbow point you find out you find out the point where this type of 90 degree will not get here, but if you get that this the best one so you are finding out where this elbow lies. You take principle component number of component up to that elbow level, this is the usual one the reason is if I say my elbow and then this is parallel all this is perpendicular parallel to horizontal my fore arm. That means other point here, they are equally contributing there is no improvement in terms of addition of some other principle component. Then, you have to add everyone all the things because they are parallel now in this in our case this is two variable case only, so two principle component we have found out that only one component is enough we do not require more, any question here up to this. What we should study for all those terms? Which one you want to follow? Now, how many so many things are there see ultimately what happens I can say that each are basically almost the same thing they are talking about one way or other thing it will difference, now cumulative percentage. we have to use that. That is probably you cannot ignore that one because ultimately the original data sheet these are certain variability that co variance structure is there. What you are doing using PCA that co variance structure, you want to explain in some transformed dimensions in transformed variables, so if I am not able to explain the majority of the variability present in the original data sheet. Then I think it is not a good model, so cumulative percentage where that is the first one we have to look into and then you have to think what the total variability that you must explain the percentage of the total variability. You will find out that it is very difficult to explain in fifty percent of the variability, now question is then if I require twenty variables to explain 90 percent variability. Now, if I reduce it to five dimensions, and able to explain 80 percent of variability, then definitely you will go for the five dimensions with the deduction of 20 percent variability. You are not able to explain the 25 percent variability at 90 to 80 that is 10 percent deduction is there correct that is possible second thing is then why all those things are there. Basically, it goes for everyone, then you see which one is favoring you, but whatever you will do either broken stick or Kaiser Rule or your scree plot. Finally, you have to see that how much the total variability you are able to explain these are the scree plot is basically mutual representation. Another important question is that whether we will go for S or R which co variance which matrix You will explain see if you use S and if you use R that will get different type of different regions it is in the transformed this. Now, it all depends on that what you want you want that suppose you are measuring the variables in different units the variability is wide. Under this case, suppose your interest is there not on scree the pattern of relationship co variance relationship, then I can ask you to go for correlation matrix that mean whatever axis you give dimensions you give it is in terms of correlation matrix. That vectors Eigen vectors will be created and there the pattern of relationship is more important not the strength if it is the strength is also equally important. Then you will go for co variance matrix because strength co variance is in the original domain not transforming when you are going for R. You are going for the standardized variables so that you have to look into where you want to do. Then, final one is that as I told you the hypothesis test that Bartlett 1950 developed this one, so suppose by these traditional method like scree plot broken sticks and other things. You are keeping out of the P components, you are keeping m components or let p minus m components you are keeping that means m components you are discarding. Now, my question is I want to test the m component collectively what this contributing in explaining the variability of the original data matrix. Significantly, in that case our hypothesis is that lambda m plus 1 equal to lambda m plus 2 equal to lambda m p means the scree plot you make the scree plot is horizontal that is what you are trying to test. Then, you are saying they are not equal for at least one pair in the hypothesis the statistics is d equal to n p minus m log lambda m bar and where lambda m bar is this one because the component you are discarding the average of that. Then this quantity will follow high square distribution with this divisional freedom and then you know that chi square the table you have to follow find out this values. Then you go for a change accordingly you take your decision we should accept all lambdas which ever greater than lambda m bar greater than lambda m bar, but not that is straight forward here. It is saying that if actual thing is that if they are equal then lambda m bar will be individual lambdas, so then using this but in sample you will not get exact symbol there will be little bit of difference. So, if I assume that lambda suppose lambda m is lambda j plus one or m plus 1 is greater than lambda m bar and you will take this. Then you are going by point estimate what it is saying you just see that there is a difference or that difference is significant or not if that lambda m plus 1 minus lambda m bar is significantly different. Then, you keep otherwise you do not keep if say that can be discarded that is why the sample this distribution is very important and using this distribution. You are first finding out the statistics then you are finding out the value of statistics, then you know that what the chi square digit of freedom is. Then you go to table find out this value if you find that d value is greater than this, then you reject null hypothesis h 0 what is all the lambda values here the discarded lambda values are equal. So, these many tests are there and I think and we should nothing problematic these are all simple to calculate although very difficult to derive for example, Bartlett’s test and all this are difficult, but very simple to calculate from the application point of view. All those things when we come to this multivariate domain what will happen, ultimately that you will find out that a large number of statistics are used to test the same thing. Now, what is the reason is all those statistics are based on certain assumptions now and there will there will be deviation from reality in raw data when you collect the data. You will not find out that they are actually following the assumptions of the statistics of the models you are using. So, you will use several such statistics and see that all are favoring or some are majority favoring or not if majority favors to a 0, go for a 0 otherwise do not go for a 0, yes Rahul any question? No, sir. So, I think we will stop here and afternoon I will show you in that given data how to go for your multiple regression multivariate regression. Then principle component analysis and also we will see part very simple add data things will click business, things will be done even then once I show it will be easy for you to use the Softwares. Thank you very much.
Info
Channel: nptelhrd
Views: 13,626
Rating: undefined out of 5
Keywords: PCA -- Model Adequacy & Interpretation
Id: nu2boMTKOFA
Channel Id: undefined
Length: 58min 46sec (3526 seconds)
Published: Fri May 09 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.