Understanding and Applying Factor Analysis in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone welcome back to my channel where i talk about all things tech and finance and in this video i'm gonna go over the theoretical and applied aspects of factor analysis the similarities between principal components analysis and factor analysis are actually really quite similar so throughout this video i'll also be comparing and contrasting the two methods factor analysis is essentially an analysis trying to find the covariance and variance within your overall data set so you are essentially trying to find the relationships within your data set this is slightly different from pca or principal component analysis in that factor analysis can do the exact same thing of principal components but not vice versa so while pca is a little bit more explanatory factor analysis requires a little bit more assumptions and is a little bit more stringent on its assumptions such as normality and linear relationships among its various variables and the general rule of thumb when utilizing factor analysis or in fact many other models that exist out there is to have at least 100 observations in your dataset but in general more observations or more data is better data the primary assumptions of a factor analysis model are the following the factors are centered the covariance of the factors are independent factors in the errors are independent and the factors are randomly distributed while pca identifies variance pretty well it does not really do a great job in identifying the covariance within your data set so this is essentially where we can use factor analysis you should be aware of a few terms and their meaning when working with factor analysis communality ranges from 0 to 1 and is sometimes interpreted as an r squared value measure a pattern is an estimate of weights in a factor analysis you can think of a pattern as a coefficient or coefficients in a regression type model loadings otherwise known as a factor loading is basically the correlation coefficient for the variable and factor it shows the variance explained by the variable on that particular factor most loadings on any factor should be small whereas a few should be relatively larger a large factor loading can be interpreted to be larger than 0.7 but this is mostly subjective a specific row of the loadings matrix should display non-zero loadings on a few factors lastly any pair of factors should have different patterns of loading factor scores can be interpreted from the factor loadings where you apply an optimization method to retrieve some value the value can be found by using a variety of different methods such as the regression method barlett method or the anderson rubens method more about these methods and how they can be calculated are in the description down below in factor analysis the rotational aspect is actually incredibly important when you are working with factor analysis the rotational aspect of factor analysis does not change the distance metrics within your overall data set all it really does is change the distribution of your data in order to get a different view on that data while maintaining variance and similar to pca factor analysis shares many of the rotational techniques such as quarter max equal max album x etc there's many other max's in terms of rotational methods out there for factor analysis based on eigenvalues and eigenvectors similar principles using eigenvalues and eigenvectors of pca can be applied as seen in my pca video the link is in the description there are a few methods unique to factor analysis involving chi-square tests model comparisons such as aic or bic values and in my demo i will briefly go over these topics in more depth so this demo is largely going to be an interpretation rather than an execution just based on the amount of output that is generated so the data that i'll be using comes from a book called the places rated almanacs by richard boyer and david savageo sorry if i butchered his name incorrectly but the link of that data which i've retrieved came from this link and i'll make sure i put that into the description for you to download i have a few functions or a few libraries over here the psyc package has the factory analysis and the read excel has a read excel function over here which i will load in my data data so let's check that out this is the data set that i will be working with there's obviously a few features that will be taken out i'll be taking now low cat all the way to 17 because i want my data set to be largely a numerical matrix so let's go ahead and do that over here and let's do a head on that data this is the data that will we will be working with uh notice that sit in over here sit and sit in over here is like a sequential value that just being iterated through however many rows there are so we don't really have to take that into consideration when we are actually working with the factor analysis output over here um i'll be largely inputting the data into the factor analysis function which is the r value over here the r value can take many different values it can take your rod formatted numerical data matrix or it can take your correlation or covariance matrix on that raw data matrix so that output should largely be the same so you don't really have to worry about that there's going to be three factors that i would want to be trying to extract three loadings and i'll be using this rotational method there are a variety of rotate methods that you could use but i'll be using varimax but you can also use quartermax bentler t equimax etc so you can definitely check out the different rotational methods and there is a very specific method that you can use in this case it's going to be a principal access factor and there's other factoring methods that you can use but largely i'll just be comparing the principal axis to the maximum likelihood values so let's run these two functions real quick and let's get the output so one thing that i will automatically go to check is to see the cumulative variance output over here so for the principal axis value i have 0.45 and let's check out the maximum likelihood maximum likelihood method which is okay so cumulative variance is 0.48 so since the cumulative variance of your maximum likelihood method is larger than the principal axis factor i'll largely just be working with the ml function uh ml method over here okay so once we have established that we'll be working with one specific method and note that if you're working with your own data set and if it whether it be in industry or academia i highly recommend that you check out both of those methods because there's always a pro to the others con all right so let's actually check out the loadings we have going on over here ml2 ml1 and ml3 are all loadings these are related to the number of factors that we have over here and one thing that we want to get out of this factor analysis is to identify which of these features correspond to which of these loads so i went ahead and i've already identified these specific features that are assigned to a specific loading or in general what makes more sense so we can resemble these factors into one large factor that represents each of those features we have going on over here so for instance we have ml2 let's begin with ml2 i've identified health education arts and transportation to be with this particular loading and i just named that quality of life factor so let's check out health which is over here 0.98 we have education education 0.5 we have arts which is 0.84 and transportation transportation 0.44 and the method behind this madness um in terms of identifying which of these features belong to this loading is that i went ahead and you just scan down the column see which of these values are the highest in terms of all the other all the other values that are in that exists within each of the other maximum likelihood or in this case other factors so since 0.98 is the largest factor or largest value that exists within this row then we can assume that health can be assigned to ml2 similarly transportation 0.44 it's the maximum value here we can associate that with the ml2 over here so so on and so forth so i went ahead and did that with the ml1 the second factor in ml3 which is the crime crime crime crime right here 0.9 so and you can easily assign uh each of these loadings with a specific value or a specific category that explains each of these features that we have going on over here and as you might know this can be highly subjective as to which of these values can be associated with some of these loadings you can also change what type of factor evaluation name or categorization that will be associated with each of these loadings that we have going on over here one thing to note when you are taking into consideration as to what features are assigned to what factor loadings is that the factor loadings have to explain something different and also they can't be contrasting views such as maybe one factor explains happiness whereas the other factor explains sadness you can't really have that also you also want to make sure that your factor loadings are not explaining similar things such as you have happiness and something similar to happiness you don't want that so you want each of these factor loadings to describe something different but not totally contrasting views and this is where a little bit of your subject matter expertise will come into play and this is where the scientists or the you know the data scientists will have their views and their opinions and this is where they start shaping up their opinion based on their data set let's go on to the next step we have h2 and youtube very simply the h2 values x is essentially how much of the variance is explained by that given factor and we have house over here and we have 0.995 so that means there's a lot of the variance that is explained by these three given factors over here uh conversely the u2 is just the negation of the h2 value you just subtract one so one minus this one minus that is equal to this and that is essentially the amount of variance that is not explained by all of our given factor loadings over here given our output over here what can we assume is the best feature that is explained in factor analysis and this will be house because house has the highest h2 value 0.995 a very close second is 0.982 but house is largely explained by all of our factor analysis terms so how can we interpret what is the best feature that is explained by factor analysis really quite simply you just look at your h2 terms and whichever one is closest to one in this case it will be a house that is the term that is best explained by that factor analysis piece conversely if you want to check out which of these features are not explained by the factors at all really that would be the contrasting opinion which is 0.987 but we know that sits in is largely an index so let's get the second highest one which is 0.857 which is economy so the factors do not explain a good chunk of the variance of economy so that is one feature that we're going to have to take out of our model because this is pretty horrendous over here last but not least we have the com term which is it stands for the complexity term and ideally we want this to be very close to one but essentially the comp term indicates how many factors contribute to the specific variable and if we want one factor to underline one variable then the desired com value is going to be very close to one over here and as we scroll down over here we have you know explanation of variance we have the mean item complexity as i mentioned earlier we also have some other statistics related to the chi-squared values and this will largely help with our assumptions on what type of hypotheses testing that we want to complete we have the root mean squared error or actually residuals close to error 0.05 and we have the data frame or degrees of freedom corrected root mean squared residual and so on and so forth and one thing to note here for bic value we can take the absolute of this bic value so this would be 29.72 and if we compare this with the pa which is the principal axis factor we have 3.25 so this is this can be a different justification as to why you should use principal axis factor instead of using the maximum likelihood method so you don't necessarily have to just focus on the cumulative variance because just based on this statistic this pa method is going to be a better fit for your given data set and you would just do the exact same thing when you are assigning a specific category to that given loading and then you can go on with your given statistics and test out your given hypotheses all right so that pretty much wraps up what i have for factor analysis i went in depth into how you can interpret many of these terms and many of these outputs and uses and you know real world stuff really quite interesting really interpretable and really really fun stuff so i hope that you enjoyed this video make sure you leave a like and subscribe hit those notifications buttons and i hope to see you in the next one thank you so much for watching
Info
Channel: Spencer Pao
Views: 3,954
Rating: 5 out of 5
Keywords: Rstudio, Factor Analysis, Theory, h2 and u2, Complexity, Model Decisions, FA
Id: kbJMz0KzMnI
Channel Id: undefined
Length: 14min 37sec (877 seconds)
Published: Sun Feb 28 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.