Principal Component Analysis (PCA): Illustration with Practical Example in Minitab

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello Friends, In the last video on Multivariate Analysis, we had seen the Introduction of Multivariate analysis, some of the important concepts used in it and the introduction of various tools and techniques as a part of it. In this video, we are going to learn the 1st tool in multivariate analysis in Minitab software with the help of a practical example for easy understanding and better clarity. So, let’s begin… Principal Components Analysis: The Principal Components Analysis is used to identify a smaller number of uncorrelated variables, also called "principal components", from a large set of data. With this analysis, you create new variables (principal components) that are linear combinations of the observed variables. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components. For example, a bank requires eight sections of information from loan applicants like income, education level, age, length of time at current residence, length of time with current employer, savings, debt, and the number of credit cards. A bank administrator wants to analyze this data to determine the best way to group and report it. The administrator collects this information for 30 loan applicants. Here, the administrator performs a principal component analysis to reduce the number of variables to make the data easier to analyze. The administrator wants enough components to explain at least 90% of the variation in the data. Data considerations for Principal Components Analysis: To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results. In the case of Principal Component Analysis, there is only one requirement of data and i.e. You should have at least two variables And the measurements for each variable should be recorded in separate numeric columns. Example of Principal Components Analysis: Let’s continue with the same example. A bank requires eight sections of information from loan applicants like income, education level, age, length of time at current residence, length of time with current employer, savings, debt, and the number of credit cards. A bank administrator wants to analyze this data to determine the best way to group and report it. The administrator collects this information for 30 loan applicants. Here, the administrator performs a principal component analysis to reduce the number of variables to make the data easier to analyze. The administrator wants enough components to explain at least 90% of the variation in the data. Conduct Principal Component Analysis (PCA) in Minitab: To conduct a Principal Component Analysis in Minitab, please follow the steps: 1. Enter or copy the data to Minitab worksheet with data for one variable in one column, as shown in the picture. 2. Select Stat > Multivariate > Principal Components. 3. In Variables, enter C1-C8. 4. In the Number of components to compute, keep the field blank. Here, enter the number of principal components that you want Minitab to calculate. If you have a large number of variables, you may want to specify a smaller number of components to reduce the amount of output. If you do not know how many components to enter, you can leave this field blank. 5. In Type of Matrix, keep the default selection of Correlation as it is. Here, please select the correct type of matrix to use to calculate the principal components. • Correlation: This is used when your variables have different scales and you want to weigh all the variables equally. Our example falls in this category. And • Covariance: This is used when your variables use the same scale, or when your variables have different scales, but you want to give more emphasis to variables with higher variances. 6. From the Graphs, select the graphs you want to see for an analysis. Scree plot: Use a scree plot to identify the number of components that explain most of the variation in the data. Score plot for the first 2 components: Use the score plot to look for clusters, trends, and outliers in the first two principal components. Loading plot for the first 2 components: Use the loading plot to visually interpret the first two principal components. Biplot for the first 2 components: Use the biplot to look for clusters, trends, and outliers through the interpretation of the first two principal components. The biplot overlays the score plot and the loading plot on the same graph. Outlier plot: Use the outlier plot to identify outliers in the data. And 7. Click OK in each dialogue box to get the results. We will get the results of an analysis in the Session Window and in Graph Window. Interpretation of Results: In these results, use the cumulative proportion to determine the amount of variance that the principal components explain. Retain the principal components that explain an acceptable level of variance. The acceptable level depends on your application. For descriptive purposes, you may only need 80% of the variance explained. However, if you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the principal components. This is the case in our example. The first four principal components explain 90.7% of the variation in the data. Therefore, the administrator decides to use these components to analyze loan applicants. You can also use the size of the eigenvalue to determine the number of principal components. Retain the principal components with the largest eigenvalues i.e. >1. The scree plot orders the eigenvalues from largest to smallest. The ideal pattern is a steep curve, followed by a bend, and then a straight line. Use the components in the steep curve before the first point that starts the line trend. The loading plot visually shows the results for the first two components. Age, Residence, Employ, and Savings have large positive loadings on component 1, so this component measures long-term financial stability. Debt and Credit Cards have large negative loadings on component 2, so this component primarily measures an applicant's credit history. Use the outlier plot to identify outliers. Any point that is above the reference line is an outlier. Outliers can significantly affect the results of your analysis. In these results, there are no outliers. All the points are below the reference line. The first principal component accounts for 44.3% of the total variance. The variables that correlate the most with the first principal component (PC1) are Age (0.484), Residence (0.466), Employ (0.459), and Savings (0.404). The first principal component is positively correlated with all four of these variables. Therefore, increasing values of Age, Residence, Employ, and Savings increase the value of the first principal component.
Info
Channel: LEARN & APPLY : Lean and Six Sigma
Views: 27,710
Rating: 4.9089432 out of 5
Keywords: Multivariate Analysis, Multivariate Tools, Multivariate Analysis in Minitab, Principal Components Analysis, Principal Components Analysis in Minitab, PCA, Principal Component Analysis with Example, PC
Id: f0_UWY3R8CY
Channel Id: undefined
Length: 9min 35sec (575 seconds)
Published: Sat Feb 29 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.