SPSS for newbies: Exploratory factor analysis (principal components)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone so how I've ended up making this video is leading up to Christmas I had a guest stay with me who turned out to be an online dating expert well she'd written thousands of pieces for the media and she dropped names of a few online dating sites that she thought were good so I checked them out and what I found is that some of these dating sites they have personality questionnaires that you fill in and then they match you by your responses some of people on these sites talked about themselves they talked about themselves as being easygoing funny intelligent articulate romantic here and even kinky but what did the question is say about them so this is a case when we can use exploratory factor analysis so the idea what we can use it for is I've got measurements comfort can pick from questionnaires or may not be from questionnaires but let's focus on question is because a lot of newbies will be using for questionnaires I think so these questions are there to probe aspects of people's personality and the questions otherwise known as items will be interrelated some of them okay so you could have some of them can be concerning intelligence about computers concerning about the GUI Romanticism and so on so you've got loads and loads of questions what factor analysis can do then is summarize those loads are loads of questions in the - and a few factors so a factor I've given you some examples a factor from this questionnaire personality question I could be like intelligence or degree of articulate nurse and stuff like that right so that's what it can do so summarizing lots and lots of questions by a few factors which can represent the kind of interrelatedness of correlation among these items once we've got the fact that these factors we can create what called the estimated factor scores for each of the attributes like degree of intelligence that grew being funny romantic and so on so can classify people by a you know degrees of these kinds of traits so that's one use and I think that's would be the major use for newbies we can then take those scores and just do a summary stats of those like report the mean standard deviation do plots or could also take them to do another other other types of deep analysis such as regression or ANOVA that's not the only use of factor analysis we can use it to validate a scale we can use it to check the Uni dimensionality of a scale which is necessary for running Chrome back alpha all right but I'm not really concerned with those other things I've just mentioned which is to me just lots and lots of jargon really focusing on finding these factors and then you can then use these factors to do whatever you wish when you're reading around you're going to find there are two types of factor analysis there's exploratory and confirmatory we are going to focus on exploratory I'm not going to just don't need to spell out what these types of things are I'm just telling you so that when you're reading around forget anything that's talking about confirmatory factor analysis for the purpose of this video which is on exploratory factor analysis okay another reason really why profit making this video on EFA is because it's kind of a lot of murky murky things about it a lot of gray issues areas that are Vegas sometimes it's a bit like playing with tarot cards you get what you want especially how some people teach it so but being newbies we want like we'd appreciate a like a recipe you know steps simple steps that we can follow to kind of steer as clear of all these murky things so stay is what I'm going to discuss next in the finally I'm going to do an example of the EFA on personality data ok the steps is like my flow chart which I know some of you appreciate so just like with all other kinds of some other kinds of analysis we have conditions and assumptions that we need to kind of check first okay if I don't tell you about these assumptions or the more important ones it's a bit like selling you a dodgy car all right I just show you the good bits I don't tell you about the faulty bit so while we're together the can might be fine but when you go off and drive off you might find that it doesn't start so I don't feel like I can tell you about FA EFA without telling you a bit about the assumptions otherwise I'm a bad salesman aren't I so the classical exposure of factor analysis is based on if we still think about the question is that your variables they are normally they're continuous and normally distributed or could you say normally distributed jointly normally distributed because that implies that it must be continuous obviously with questionnaire data it's not going to be like that we're going to have mixtures of could be continuous could have ordinal because we could have Likert scales could they have normal as well now if you look around the literature the statisticians will tell you you cannot apply EFA as SPSS does it with ordinal and nominal data but if you read outside the stats literature like psychologists how they use it they'll just say this is where we pretend the ordinal are continuous okay hmm so being newbies let's accept that but if you're going to do a more kind of serious analysis just know that if you're going to treat ordinal as if it were continuous then you are likely to over you're likely to come up with and come up with more factors than actually are present so if you bear that in mind then even if you have ordinal you can kind of just cut down on the factors or kind of just bit more cautious about choosing more factors okay because you know that if you use Likert scales you could come up with more likely to come up with more factors than actually there and this has been shown in simulations okay nominal nominal so this is like where the variables the level measurement they they have no ordering if it's just like yes or no answers what called binary or dichotomous that's accepted okay but not if it's like on more than two modern two levels so for example religious belief Christian Christian I don't know Buddhism no it's a bad example and another one okay so three then this wouldn't work there are ways around that but for a newbie just know that hopefully that you've set it up to you're not your own got continuous and ordinal all right maybe nominal if it's just two categories alright so then the next thing we do is we also have to know that factor analysis is a large sample kind of procedure so you've got to have this is where you get different numbers depending who's teaching it to you some say minimum of fifty respondents or what called cases okay so in other words fifty people have responded to the questionnaire some people say at least you want 100 or right I guess it really the number really depends on how many factors you're going to try to get outward thing but just say we want lots and lots just know what we want our sample size to be large the other thing is that we got to have enough questions as well to make up these factors so again depending what you read around some people say if you expect you should have like five three to five variables minimum four per factor so if you expect that to be a factor measuring degree of trusts that somebody has in a partner you'd hopefully you'd have at least three to five questions relating to trust so it's really how you are constructing your questionnaire here you know you've got some hypotheses you've got some these factors in your head so you'll make sure you've got enough questions concerning those factors okay and then then there are a couple more couple more conditions that we're just going to come across when I look at SPSS right the next step we can go safely go to the next step is that determine the number of factors all right so we want really to have the whole point is we're supposed to kind of reduce these number of items to be represented by a few number of factors so if we've got like 50 questions we're not after 50 factors all right because that's not we have introduced anything well I like more to kind of have those 50 questions 50 items reduce down to say like 4 or 5 or 3 something very small number of factors okay which those factors should carry some kind of meanings which relates to a set of questions in your questionnaire so once we've determined number of factors the next step is to rotate right this thing called rotate just jargon at the moment whole point of rotating is to get a sharper distinction between the factors meaning that I want to be a clear a distinction that this set of questions is about degree of intelligence this are the set of questions about the degree of somebody's sportiness or something like how they're extra looking a degree about being extrovert okay then flowing down this chart what happens next is we drop the pour factors I pour that to meaning maybe it's got poor interpretations and we have a clear interpretation what the thing means or there's not enough it's not related to enough questions like you know Isis said we need two three five questions per factor if we had only one question for that fact it's pointless keeping that factor and also we tend to drop variables that a lot and of associated with modern one factor what they call them cross what do they call it cross cross loading or something yep cross loading I am NOT an expert in factor analysis by the way okay so a good point about that is that I'm asking simple kinds of simple questions that hopefully you'd be asking okay you can see that the this factor analysis is a cyclical nature so once we've dropped if we had to drop anything we kind of cycled back and we re estimate with a fixed with the same number of factors and then rotate again and just see what we get okay once we're happy with that that means we've got the result and at this stage your your your teachers might stop and you're thinking why you just stopped you know what's the point of then you're doing factor analysis so once you've actually gone through there then you're going to next obvious step to do for newbies is go ahead now and calculate or estimate these estimates of the factors course or estimate of their degree of intelligence estimate of the degree of trustworthiness all this kind of stuff if we are moving on to try to calculate the comeback alpha will help we have kind of verified the uni-dimensional nosov the scale or we could have also we could have validated the scale as in meaning that we've kind of checked that all the questions and measuring what they're supposed to measure okay already that sounds like long enough so let's jump straight into the application so I've got the personality data here okay I did not go online and gather up all this data from a dating website it's just I've got a date personality questionnaire it has it has it has 44 questions and I just delete this because this is like something I did before I started 44 questions and how many respondents okay that's well over 100 that's 459 so cool I can run this thing except for we know this that the variables all these variables they are on the cout scale I know they're all ordinal I've got no nominal got no continuous so basically this means that if I'm going to run it on in SPSS standard SPSS packages then I am likely to overestimate the number of factors all right which I might say that in my report to kind of show my supervisor that I've been reading around I actually know something more than books actually usually tell you so next thing is determine the number of factors right before I go into this let's just assume this because there's so many buttons that are options that you can use in EFA we're just going to assume that all the roads lead to the same place okay if we just bear in mind that at the end of the day what we want as newbies is something factors that interpretable then we not don't care which of these kind of buttons were pressing okay because depending who you're taught by some professors have some kind of liking for some methods rather than others and I'll kind of point them out as we're going along but if you're on your own and you haven't kind of been taught FA and using a big dissertation just kind of bear in mind that so long as we get some kind of you get some result that is meaningful then you don't need to spell out exactly how you got there right they'll just be happy with the factors that you end up with at the end of the day okay because if you don't bear that in mind it's like you've got all you're going to get anxious about pressing all these buttons whether these buttons do you know what do they mean and that's pointless you don't need it okay so we'll go to analyze and it's a dimensional reduction method that's so that's what this is so we'll go down to dimension reduction and it's factor that we click okay I'm going to reset this because I did it before you want to take select the variables and push it over into the variables box now all these come from the same questionnaire and putting them all in okay then what and assumptions will go up to descriptives I just click on PMO and Bartlett's test of sphericity okay so they give you two things that we can check for use use the practice okay to get initial number of factors we need to click on extraction so that's extract the number the factors we've got choice of method and now this is like where it's like guard is down it's up to you it's up to you statisticians if we click on this prefer apparently the maximum likelihood method ml and principal components whereas say social scientists they tend to talk about using weighted using least squares or principal axis factoring and sort a psychologist well there are assumptions for these things depending which one use so like for maximum likelihood it kind of assumes that kurisu's like joint normality alright which for Likert scales is I mean joint a lot of the factors which if you do look at they've got liquid skills you might say well maybe not okay or maybe yes but if we bear in mind that so long as we get a bouquet result we don't care what we're going to use so let's just use something alright so maybe we're going to use a principal factor not principal axis factoring analyze and most people have done correlation matrix they analyzed the correlation matrix ie pick out the factors from the correlation matrix so we're going to stick with that so don't go for covariance matrix if you go for covariance matrix the method for finding the number of determining the number of factors as I'm going to show you is not going to work so just leave it on correlation matrix display and relative factor solution leave that checked this is thing called the scree plot mmm most people check it so I'm going to check it alright it's not great but we're going to check it extract based on eigen value we'll leave that at 1 and continue and for now that's what we need so let's just click okay we start off with this box that would we ask for the KML in the Bartlett's test this kmo that is is a number it's a number that measures the proportion of variance in the variables that might be explained by underlying factors so it's a proportion that must mean its value between 0 and 1 so closer is to 1 the better and this is 0.8 for 1 and our some books a bigger than 0.6 is fine for social science I'm say because in 0.7 we are we are higher all right so that's good that's basically saying that a lot about 84% of variability in the variables can be explained by some factors underlying factors that is not a test that is just a number next thing Bartlett is a test okay being a test it must have a null and an alternative the null is basically that there is no correlation among the questions do we reject that go to p value which is sick yet we want that to be less than not less than naught point naught 5 this is tiny so good means that they structure their there are factors that we can pull out they're interrelated questions ok then I'm going to go down to the total variance explained box and there are two three columns all with initial eigen values the factor Rahman one initial email use in another and extraction sum of squares loading all we need to do because we want to avoid any kind of we're not doing a deep kind of analysis here want to go straight into this second column extraction because this tells you the number of factors that have been pulled out and you can show that the working is like from this side okay so we can assume so you can ignore that because so we just want the results so if we look down here we just go down to the bottom number and go along 10 so we've got 10 factors from this 44 question 44 numbers of a question yeah so if we go down to the bottom table 44 that's like 44 right 44 questions has been reduced down to ten things so those 10 things helped us will summarize hoping most of the variability in 44 questions now how did I end up with 10 you know this is where because we told SPSS to base it on I can values the factors where eigen values bigger than 1 so that's what it's done so if you slide along here this is where it's what 10 points is 1.04 4 it's bigger than 1 that's why you've got number there so it's taken across this side whereas the next one do we need a level factor well that I can value there is 0.96 6 it is less than 1 hence it's gone and that's it now if you look at the screen plot it's basically a plot on the x axis of the factors running from 1 to 44 plotted by the eigen values okay so so long as you know that you already know that if we're basing it on eigen values bigger than 1 to say it's a factor then you don't need the scree plot right so scree plot is just an alternative but some people like to look at something visually so we can look at that ok but just know that they're slightly the screw pot just showing you what you've already just seen but so if you decided on the under one criteria stick to that because screw plug is not always clear what you do the way the teacher is like they say suppose there is a kink or like to use the word elbow where that kink is then that's the cutoff but if we determined that the eigen value is 1 that was a cut-off flare then would select whatever is that number okay so I mean could just put this in a report and just say you've based it on line value bigger than 1 those two are not the only ways but they're the most popular ways of determining the number of factors okay so let's say we stick with 10 so now we want to kind of want to see if there's a mean get a more interpret all meaning of these 10 factors and usually this is where we just rotate so if we go to analyze again and we go to a dimension reduction factor this time go to rotate all right ok they're like how many options are the 1 2 3 3 4 5 options the most widely used ones are Eva berry Max or or Pro max alright and the distinction between the two is that the very max one says that the factors our ten factors are orthogonal to each other whereas the pro max says that our factors can be correlated which either so orthogonal means like the independent of each other nothing do each other where the pull max says that our factors for example in our case it could be I don't know intelligence and being funny could be like correlated right if intelligence came out to be fact and funny came out to be a factor so so people deficit say go for pro max because why do we assume our fog analogy when when that's very strict thing basically formality in maths is very nice because it leads to kind of neat results so let's just go for pro max but again if you bear in mind that doesn't matter what we do so long as we're going to get something that's interpretable then we're fine okay and we say okay now just to make the output clearer easy to interpret on the on the on the factors so we go to options and we go down here where it says a coefficient display format and we go sorted by size suppress small coefficients suppress small coefficients just mean delete or just do not show those I don't use the word loading I'm going to have their i loadings that are below a certain value and what they tell you textbooks are stuff is 0.3 right so we won't do anything below point three so want something meaningful in other words we don't want too small and okay missing values we're just going to if you have missing values in your data set which in reality people do we're just going to just leave it on default exclude cases list wise okay and then we're going to go okay right cool what you see here and this is where if you look at my first what we've got here is we've got a factor maker pattern matrix structure matrix and a factor correlation the factor the pattern matrix shows us what are called the loadings now these loadings I'll show you what we're going to this loadings of what people report the structure matrix reports the correlation between the variables and our ten factors okay then you've got the factor correlation matrix which shows you like the correlation among our ten factors so let's go to the pattern because that's what we're really interested so we've got our questions going down and then the first column and then we've got these other things going on one for each factor and here's the thing we're trying to interpret these factors we've got ten factors this look at them so let's go to the tenth one so I want to see like now I want to see if I can kind of chop anything out because we are down here do I drop or factors variables that are loading more than one factor that kind of thing let's go part way to doing this so for the tenth let's just have a flick down to three account of three items with that okay only three right so that one's only got three so we might say that we might want to chop that out how about the ninth fact how many questions are related to that remember we're after like minimum of three to five that's got one two three four and eight one it's only got three as well okay well knowing that we're dealing Likert scales ordinal data we're likely to have overestimated so why did I just chop it go straight down to okay seven has got so our seventh fact has got one two three four five cool so we could keep that why don't we go straight to seven so in other words I had dropped the poor factors because the last three don't I'm going to just say look they don't have enough then are loaded onto enough items so let's let's just do that we're going to go down again to analyze factor and now we're going to go to extraction and we don't need a scree plot so so we don't get cluttered with so much rubbish on our screen let's get rid of that let's get rid of that as well and rotate it we don't because we're not interested any of that stuff what's a fixed number of factors now I said we're going to drop eight nine ten now so that means I only need seven factors I'm telling seven extract only seven and we'll go okay and we don't touch anything else okay so I'll still want it rotated and sit okay and then we want to go down to the pattern matrix again rightly see now it's got seven columns factor seven let's look at well my seventh one has gone down is still three and my six has got three four five questions so I'm going to cut one more okay I'm going to analyze damage your reduction factor you can have over me by the way with what I'm doing because I'm no expert this but extraction six I'm going to come up with something that's meaningful okay let continue okay pattern matrix click that's the first way to do it six let's see six one two three four five okay I'm not satisfied with that now so let's see if we can do some interpretation let's look at the first one first factor the adjectives here disorganised it's got a minus minus means that it's basically the opposite of this so if this was a in other words if it's - it means that nom let me just rewind right this is what I had a hard time finding you should know this is that the pattern matrix show you the loadings the loadings take values between minus 1 to plus 1 because when I'm saying small or big how the heck do you know what is small what is big if you don't know what the minimum value and the maximum values are if there are any so I'm telling you that the minimum values minus one and the maximum value is plus one now this is minus 0.75 it is not a correlation it's a loading okay but this minus 0.75 is closer to minus 1 then it is to 0 isn't it so it must mean it's quite negative related that's what this sign means so when you look at each of these what call the loadings here you look at the sign which tells you is like correlation is like which is it is it kind of from positively or negatively related and then the magnitude so I'm just interested in the sign at the moment the sign is negative so it kind of means that this factors to do with more organized person because the more negative it is that means that actually the more opposite is of this adjective so it's not this person's organized ok this factors to do organize there the circle one is positive does a thorough job ok so it means that it is a thorough job efficient efficient ok that's a positive one negative one careless oh it must be the opposite of that means this practice to do something that's been more careful lazy that's a negative one that means again opposite of this so it's per person that is not lazy this persevere is is positive it's quite interesting this and reliable yep so sticks the plan positive so all anything else no so all together we've got to kind of think of one adjective or one phrase that summer you know what is this factor measuring well is it measuring his degree of well first of all it's like positive quality isn't it somebody you would use the word niceness to do that isn't it so you want to find one adjective that summarizes everything that I've just said there well how about just all run goodd goodness you can probably find of something more kind of succinct but does this the thing with factor analysis is you've got to come up with these phrases ok one thing to say oh I've got pluses and negatives is yeah it's because see some of these questions is like AAPIs is how you phrased it so if you wanted all to be positive instead of positive and negatives especially when you come to calculate these factor scores you are going to have to kind of recode it opposite code this so that this is talking about being degree of organizing this all right is there been degree of degree of disorganized and then you can see that the next one I'm going to do some things one more time so second factor it's positive and it's quiet so it's positive related quietness it's positively related to the reserved Necedah we're going to shyness is negative related to talk of tipping us that means it must be positively related to not talkativeness all right if that's a word and negatively correlated outgoing so positive must mean negative must mean it's the opposite the degree of not outgoingness so this could all be like as a character like a introvert nurse isn't it basically like me this one here Oh nervousness okay and so on so if we go back to our flow chart we've dropped the pour factors we haven't opted any variables and I'm not going to do it but I'm just going to show you something we want to get rid of things like nervous not because we don't like the word it's because it's related to more than one factor okay so we don't want that factor two and three both contain nervousness so that's called cross loading we don't want that in this case it's also the factors are not uni-dimensional okay so I have determined the number of factors I've rotated them to kind of get a sharper image of these interpretability of these factors all I don't do for you there but you can see what I was doing I've dropped the poor factors and I've told you how to drop the poor variables the data with more than one factor and then we re estimated and yep and we've got something of that so we're happy with now of six factors now forty four questions down can be rested by six factors that happen each have a meaning and now we can do the following we can estimate the factor scores we can estimate the factor scores now yeah I know that newbies like to do this because once you've got the factor scores then you can compute the facts like the degree of introvert nurse okay and then report that or calculate a mean standard deviation of that okay that's good which is good yep it's good so let's go over to to this thing here there's nothing to do with kinkiness here okay well this question has nothing to do with romantic nests or anything no it's not not too it's not very comprehensive it's not very comprehensive a questionnaire on personality is it so say we want to create a score so we've got these six factors well well one thing you can do is I did it in another video if you simply take these questions will it like for the first factor you want to measure one number for each person that summarizes their that measures their degree of se the second one group of their degree of introvert inist you will add up their score of quietness reserving a shyness the shyness now talkativeness you can have to reverse code so not talkativeness and then otherwise you get all these signs are the same outgoing and you've got to kind of reverse that as well so it's talking about not going out outgoingness alright so in other words that they're all talking like in the same direction then you can add them up that's one way to do which I've shown you in another video how to do another way that that's gives you the unit weighted what's called unit weighted the scale of score so what other way we could do is we can use a weighted scale and and we can do that in various ways so that we just go down to dimension reduction factor and this is where Z scores again I'm going to bet again we've got more than one option here I'm just going to bear in mind that we just want to score all right this three methods regression Bartlett Anderson Rubin just note that for all of them the calculate scores so that on the average score is zero so if somebody is gets a score above zero it means they're above average if it's negative it means that they're below average all right if you want the clear distinction between them click on help button which will tell you the tell you more about the differences between these three but the main thing to notice that that there calculate the mean values as zero okay let's continue what's going to happen now is going to should create it will create an additional column for each of our factors with the score alright here we go if you look down here yes look can you see one two three four five six aha so yeah I did six not seven right so yeah so six view so remember the second one was to do degree of introvert nice wasn't it so degree of introversion so if we look at factor to my don't eat we even give it a proper name here so it's revision intro the introvert extrovert introvert okay just say introvert so click on this again okay take us over here our introvert so the first guy first person first person got a score of on the introvert scale of minus 0.15 seven that is negative that means it is a less introvert like more positivism is the more introvert so let's introvert than average because zero is the average the next person one point two three that's above average now this is like where the web dating sites like for each person like for this person here second person will have like I ignore a bar chart yeah it's good way to show is in a bar chart showing you like their score for each one of these so that kind of liked all of the bar for that particular fact that means the more you are of that thing easy for non mathematicians to interpret so I think that's that that's it really I try to make this really clear I've given you a kind of like a recipe like this something to follow I've told you kind of emphasize this thing about the conditions and assumptions which is often kind of just not mentioned I just skipped completely but it's a it's actually very fun very very important so just know that because if we're doing questionnaires we have ordinal probably ordinal then we are going to if we would use SPSS we were likely to overestimate the number of factors SPSS does have an add-on a module called what is it called capture something like that categorical pca that deals with categorical I these two guys a the dichotomous and ordinal questions but I think you have to pay for it so hey I'm not going to pay for that because other software likes data M plus already have that built in right so I mean something like that should be so standard now or you could use R as well which is free so if you are kind of more serious doing serious research and you kind of want to kind of Lee you know you use like your kind of use methods which kind of recognize the deficiencies in in the in the standard method so if you've got categorical variables like you do in it in a questionnaire you you want to be using some package that kind of takes account of the fact right rather than just assuming that they're continuous as we have done in this using this SPSS package okay well you can write lot loads of comments I don't know I'll be interesting reading what your thoughts are okay cheers guys and good luck with your dissertation
Info
Channel: Phil Chan
Views: 80,314
Rating: 4.7675543 out of 5
Keywords:
Id: x4GFIzKzf2E
Channel Id: undefined
Length: 38min 51sec (2331 seconds)
Published: Fri Dec 26 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.