Applied Principal Component Analysis in R

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys spencer here in this video we'll be going over the theoretical and applied applications of principal components analysis so what is principle component analysis it's essentially a dimension reduction technique in order to condense your features further down in order to capture as much variant as possible while at the same time minimizing the amount of features so i wrote a mini slide deck series on what principle component analysis is and without further ado let's get down to it so first things first the pca or principal components analysis essentially maximizes the variance while minimizing the amount of features or columns that you might have it uses a variety of statistical slash mathematical procedures one of the more popular ways of condensing down your data is to use eigenvalue i mean eigen decomposition using eigenvalues or eigenvectors and it uses these in order to essentially find a linearly uncorrelated feature set that is also known as a principal components you want to do this in order to further like identify features that are not relevant to the overall model hence you want to make sure that most of your features if not all of them are linearly uncorrelated so there's no multi collinearity happening a more robust technique would be using the svd or singular value decomposition it's just like another way of reducing your data and it's considered to be a more a little bit more of a robust method and also principal components analysis essentially it it obtains your features and it finds a different like coordinate axis with your different components so it's always trying to project the observations known as scores in this case into your x and y grid where that x and y grid are your principal components and tries to find the best fit essentially so you're essentially fitting a xy coordinate plane to your data so what are the goals of pca one of the primary reasons why we use this is to drastically reduce our data it takes a high dimensional data set and supposedly gives it a lower dimension with a with a lower amount of variance but essentially you're trying to maximize the variance while minimizing the number of features to make this model more explainable and a little bit less and complex extracts all the important features of the data and it finds a few features that explains the minimum variance so in this tutorial i will be going over three simple ways in order to conduct a pca analysis so i went ahead and already loaded in my data set teeth over here and let's take a look at what this t data set looks like it's not really that much but essentially i have mammals and i have the types of teeth that they have top incisor bottom incisor canines premoles molars etc and the total number of teeth that are corresponding with each of these variables that we have over here and i'm going to be calling this the teeth principle components in order to essentially just extract your corresponding variables over here so we have 10 observations i mean 10 variables so i want 2 tonight to just essentially just get all of the um just get all the features that are associated with this we won't be including totals in here but we'll just be including the top incisor all the way to bottom molar so let's do that two to nine and let's take a look at what that is right here so that's the data set that we'll be working with okay so there's a really cool really neat package package it's the uh print comp and we will i think it's built in i don't think i got to load it in or anything like that with an r um but yeah so let's do uh principal components we can call it and it's called print comp on the data sets tpc and we'll be shrinking down these i think it's eight components down to eight um eight principle components so eight features to eight principal components and see what happens and let's include the correlations equal to true here uh i think it's all caps there we go all right let's take a look at this uh prts so this gives us the variables and the principal component value that's associated with it the standard deviation now let's check out what the information is available with this given output uh so like information information output information output i'll just do names of pc.teeth i have the standard deviation the loadings which is essentially the eigenvectors the center scale number of observation scores and call cool so each of these are variables that we can call within our given pc.teeth data so let's do a quick summary quick summary at what this is looking like you see that teeth and so this is what we all have so we have our cumulative variance and this cumulative proportion is just adding the variance uh down and then it should all add up to one since the entire variance is explained by all eight components and we will be finding out what is the quote unquote right type or the correct number of components in order to utilize a a a shrink data set and just for last and the x we can get the eigenvalues slash eigenvectors eigenvectors eigenvectors so if you are doing like a class for instance uh they'll probably probably be asking you what the eigenvalue and eigenvector is uh within your given like model so this is how you would do it so your eigenvectors is essentially your pc.teef and it's your loadings over here and your eigenvalues it's your pc.teef you get your standard deviation and then you just multiply this again so it's your standard deviation squared that is what your eigenvalues are awesome sweet so just uh note that here these values are scaled so the sum of squares are equal to one so now we're going to be finding the correlation between our original data set and our principal scores and these scores are coming from the principal component accesses so we are essentially trying to fix a a type of grid or an xy grid onto our like projected data so that's what the scores are and i'll go a little bit more into detail as to what that is so we have the correlation plot going teeth two to nine let's call the observations two to nine we'll be doing a correlation matrix on the pt pc.teeth we'll be calling in the scores here and let's actually just round this let's round this up to like three use a round function here so this is the correlation matrix between our given data and our principal components so this is incredibly useful in order to determine like if there is any slight relevancy between our actual components and our actual data and the higher it is to one then the more closely related the principal components are to your given data set now one of the really cool tools or really neat tools to use in order to identify however many components that you would want we'll be using a scree plot screen plot so this type of method in order to determine your uh the number of components that you'll be using is slightly subjective because we'll be looking at a graph and we want to look at the elbow we want to look at an elbow on the graph like elbow so there's like an elbow right then that's sort of like the type of the number of components that we will want to use and identifying this elbow within the screen plot is very much so subjective more often than not there's a more objective tool to use for the screenplot in fact there is but we won't be going into detail as to what that is okay so for the screenplot we want to do this type of functions you screenplot right here screenplot and then you call in your teeth let's do like a line graph over here and call the main function screen plots for teeth data so this is what i was talking about in terms of what type of elbow you want to look for let's do an ab lines to one and zero so we can get a horizontal line let's color it red uh red and let's make the length like two so this is where essentially we're looking at the elbow there's like one elbow like right here or here and yeah it's pretty much either or so you can essentially essentially just determine what however many components you would want to use in your further analysis whether it be principal components regression for instance so in this case i'll just be doing just two components that looks like a great elbow you can also do three uh but in general around this variance is equal to one that's where we will be aiming uh for our components to the number of components to lie within so let's uh plot the components that we have going on so this right here is a scatter plot as to what our scores will look like between the principle component one and principle component two i wrote some text here where it essentially provides the type of mammal that is associated with each of these particular observations so it's really cool just to visualize how everything is sort of being reduced or another way of conducting the analysis is to use a different package essentially analysis so we'll be calling pc.fit as the variable name we'll be using a pr comp function where it essentially just does everything that we have already done so a really cool function to use as well is the pr comp where essentially we don't have a dependent variable to reduce on we are just trying to minimize the amount of features that we have going on here and we just label the the type of features that are involved within our principal component and we just call the original data set where these values are coming from and of course we're going to be scaling them just for last and next we have the eigenvalues is the pc.fits and essentially that multiply it with itself that is the eigenvalues and then the eigenvector is your pc dot fits and then for this one it will be your rotation now the rotations are really cool when we are going to be more in depth as to how we're going to be fitting our accesses to our projected data there's a variety of methods and we'll be going more into that very soon actually so let's do a really quick summary of pc.fit pc.fit and this provides you the principal components with the corresponding variants as to however many components you would want to use and how much of the variance is being explained by these components and that's essentially what like whatever however you would want to use principal components you want to figure out however many of these components you could use to essentially map your data with less complex terms and faster runtime so the last method i'll be doing is we are going to be calling in this principle and this is from the package library psych it will load this in really really quick load that in it takes like not even a second we're just going to call this like sp ps2 why not and then the principal function from the site if i can spell it correctly principal yep and then we'll just be calling our teeth pc event a really neat feature that we can call is a number of factors so this would be the number of components that we want to generate where the number of components can't be greater than the number of features that we currently have within our original data set so do take in mind that teeth pc has feels like 10 oops one two three four five six seven eight oh yeah so in this case it just has eight so within these factors we can call it two two of these factors to be returned three four you name it it's essentially the same output as this right here that's all it is to it and a really cool um additional parameter that you can use which can be calling none but the rotate parameter rotate parameter it has so many other variables that you can associate this with it's how it's like a different type of method to fix your coordinate system onto your projected values where your projected values will be the calculated scores and the type of rotating parameters varies you can use a variety of them i think one of the more popular ones is like varimax but you can also use methods like max quartermax um i think the function was like oh blim and simplemax etc uh you can look that you can look more on this on your own uh by just doing like question mark principle and i'll provide you all the various methods that you can use to rotate your x and y uh coordinates to get the best scores essentially let's run this and let's do pc2 oh ps2 yep and then this provides you the number of factors pc1 pc2 since we identified this to be 2 and it provides you the variance explained and you would typically just want to look at this piece right here since we don't have the other variables that are involved we're missing about 22 percent of our variance within our data so you can always change this up you know just put it for three and then we just have three components instead and this explains uh in total 87 of our data

Info

Channel: Spencer Pao

Views: 4,527

Rating: 4.9375 out of 5

Keywords: Principal Component Analysis, PCA, Rstudio, Statistics, Applied, Theory, Walkthrough, Guide

Id: uNJBBpyss50

Channel Id: undefined

Length: 15min 31sec (931 seconds)

Published: Sun Sep 27 2020