SPSS Tutorial for data analysis | SPSS for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to SPSS and introduction I'm Barton Paulson and in this course we're going to look at the statistical program SPSS and some of its basic functionality and give you an idea of what it can do and how well it might work in your own data work now spss the name deserves a little bit of explanation once upon a time it stood for a statistical package for the social sciences now it's just SPSS but that's its origin one important thing to know is how popular SPSS is here's a chart right here that comes from the excellent website R for stats comm and what it shows is the number of scholarly articles published in 2015 using various statistical packages and languages and we can see here is right at the top as SPS statistics SPSS is number one by far in terms of scholarly research also you can look at jobs here's another chart that is also from R for stats com and what this shows is analytics job listings on indeed.com in 2015 one major source of tech jobs SPSS is on the list but this time you see it's actually a lot lower it's number six and so there is a difference here between academic publishing and employment and analytics really what this tells you something about the population or the audience for SPSS the primary audience of SPSS is academic researchers especially in the social sciences but another fields like business now there are some reasons that SPSS is popular in these fields number one it's user friendly it's got a point-and-click interface which allows you to assemble code really quickly you can save that code as what's called a syntax file and then you can reuse it you can adapt it you can share it with others also SPSS is really well adapted for data from experiments where you're comparing means via T tests and analysis of variants with several important options like effect sizes and power analysis built in and so those are some of the reasons for SPSS is popularity especially within academic research in some we can say a few things number one SPSS despite being developed about 40 years ago is still popular it's got an easy-to-use interface and it's easy to save and reuse the syntax and giving you a code basis for the work that you do within SPSS you the first thing we need to talk about in spss an introduction is setting up and getting ready to do the work to do that however we need to take a minute and talk about versions additions and modules which I'll refer to different kinds of things in SPSS the choices really making me think of an overwhelming plethora of possibilities ahead of you and it's nice to break it down a little bit so the things we're going to talk about our versions those are the release updates the old version 1 version 2 editions those vary according to what's included in a particular purchase and modules are extra functions that you can get to add on to the abilities of SPSS we'll start by talking about versions version one came out in 1968 and at that point it was called statistical package for the social sciences SPSS version 24 came out in 2016 and now it's called IBM SPSS statistics the act like SPSS doesn't stand for anything now for this course I'm using version 22 on a Macintosh computer fortunately there haven't been any extraordinarily major changes between 22 and 24 and everything I'm going to show you in this course will work just fine in almost any other version of SPSS now it is possible that you've heard of something called pasw at some point and spss was briefly called predictive analytics software during a trademark dispute after SPSS got bought by IBM that only lasted for a year or so and it got resolved the important thing to know is that no matter what version you're using the files generally are highly compatible between versions and so code that you created in version 16 is probably readable in version 24 there are some backwards compatibility issues for advanced functions like automatic modeling and so on but most of it is consistent all the way through now we also need to talk about additions of SPSS and there are a few major choices here there's the base Edition the standard edition the professional Edition and the premium edition and they differ by price and they differ by the functions that are included with each Edition so for example in base you get basic statistics you get linear regression you get clustering and factor analysis on the other hand standard adds on to that logistic regression generalized linear models and survival analysis you it also adds drag-and-drop interactive tables professional edition adds to that data prep forecasting decision trees and imputation methods and then finally the top-of-the-line premium edition of SPSS adds bootstrapping complex sampling exact tests and structural equation modeling and so each one adds on a number of other functions now this is the product pricing as of August of 2016 and you see for instance that SPSS starts in the base at 1170 dollars per year per person so it's an annual license and it goes all the way up to nearly eight thousand dollars per user per year and so it gets really expensive however I want to say this don't panic there are other ways aside from having to like you know sell your house to get spss number one there is a free trial and you can download SPSS and you can try it for 14 days and during that time the best way to do this is see if you can make a business case and get somebody else to buy it for you there is also academic pricing student pricing for SPSS starts at $35 for a six month it's not the super-duper version but it is absolutely sufficient for doing the majority of academic research now we also need to talk about modules and these are the components that add extra functionality to SPSS and they're the things that differentiate the different editions primarily modules of refers the available modules include advanced statistics bootstrapping categories complex samples conjoint custom tables data preparation decision trees direct marketing exact tests forecasting missing values neural networks and regression so that's 14 additional modules and this sounds like a lot but if you can compare it to the 9,000 packages that are available for our there it's the difference there the other major difference is that these packages they cost money so you need to work that into your budget on the other hand there are also free plugins that make it possible to use code in our Python Java and the microsoft.net framework within SPSS and so there are abilities that you can add depending on what you need in sum we can say this SPSS has a long history as far as statistical software goes there are several variations and additional rephrase there are several variations and additions that you can make to it by adding extra modules on the other hand it can be very pricey so it's something to consider when you're doing the cost-benefit analysis of SPSS the next step in SPSS an introduction and setting up is simply taking a look at SPSS and seeing what the programs like and the easiest way to do that is to just open it up when you first open SPSS you'll get this introductory splash screen that gives you an opportunity to open up some files recent files and learn more about various things that you can get from SPSS if you want to you can click on this box don't show this dialog in the future then you won't have to deal with it again you can also just press cancel and that brings you to the data window in SPSS which has a lot in common with a spreadsheet it has these rows and columns where you have one row per case in one column per variable but there's some very important differences between SPSS and a spreadsheet to demonstrate this I'm going to open up a data set that I've used recently and then when this opens up you see that it does resemble a spreadsheet we have the variable names across the top we have row numbers down the bottom and we have data in the middle now one important difference between SPSS is data window and a spreadsheet is this you have a data view but you also have something called a variable view and it's the same data set but if we click on it we see it in a different way each of the variables has metadata associated with it so for instance age it tells you the type of the variable now these are mostly numeric there's a string variable but you can see there's a lot of choices here numeric comma dot and so on you also can specify the width of the variable the number of decimal places and then a really important thing that makes SPSS different from most other programs is the use of labels this column right here shows variable labels and the idea is we have a short one word variable name over here on the left and if you use a very old version of SPSS they were limited to 8 characters and you ended up sometimes with very cryptic names you don't have quite the same restrictions anymore but what's common is to give a short name to the variable and then to give it a label that is more descriptive in addition you can have value labels so let's come here to marital and we click on this and this is a way of telling SPSS that in that column a zero means unmarried and one means it married obviously you can make them whatever you want and when you come back here you have the option of seeing them so I'm going to come right up here and I'm going to click on this one too which will show the value labels and you see how they've appeared now that can have them go away there are the variables if I just hover over then I see the longer name going back to variable view you can also specify values for missing values you can give the width of the column the alignment and then you can specify the scale of the rephrase and then you can specify the level of a measurement now SPSS uses three values scale which is a interval or ratio level measured variable ordinal which is ranked data and nominal which is categories you also have the option of specifying whether something is an input variable a target variable or both and there are certain functions that use those but most of the time that's not a big deal and you see that in this demonstration data set those haven't been changed at all so the first window in SPSS is this data window but there's more to SPSS than that so for instance let's make a very quick graph I'm just going to make a simple chart here I'll come and make a histogram of age hit okay and so you see I have a graphical user interface with drag-and-drop menus that allows me to assemble my commands this way I hit OK and then what we get is another window that opens up it's super tiny up here so I'm going to make it much bigger and this is the output window so it's a separate window the data is in one window and when you do an analysis you get a separate output window you can actually have multiple output windows and what this one does is it has the graphs or any statistical analyses we do it also has a table of contents over here that you can collapse things or you can expand them and an important thing is I've got it set so it shows the code that SPSS generates behind the scenes to create this analysis and the neat thing about that is you can actually use that code and you can manipulate it directly this code is called syntax in SPSS now by default SPSS opens up only a data window and an output window but you can get a syntax window as well in fact let me do that I'm going to come up to file new and syntax and this is a very blank window but it's one that you can type in or you can also use the drop-down menus to put a command in there so I'm going to come back here to the recent command and I did a histogram and I could press ok again but now what I'm going to do is I'm going to press a paste and what that's going to do is is going to get the code for that chart and it's going to put it right here in fact this is the part that we use and if I select that I can hit run I can also do command or control R it runs the selection and you'll see we get the output window again and it's done the exact same thing a second time but this time it did it from a window where I'm able to have the text now a lot of people are uncomfortable with syntax and they like the drag-and-drop menus but a really important thing about this is it allows you to save your analysis so you can repeat it again without having to go through all the menus you can simply paste the syntax from the dialogues into a syntax file and then you can repeat it as many times as you want it's also really easy to modify things when you do it that way and syntax files are just plain text files they're saved with a dot SPSS extension but they read just like plain text files now these are the most important elements of the SPSS environment the data window was both the data and the variable view the output windows and the syntax windows that allow you to save the commands and this is what gives spss both some of its flexibility and its power and as you become more comfortable moving back and forth between these various windows and seeing what you're able to do both with the drag and drops and by typing text you'll discover there's a great amount of flexibility and power in SPSS that can allow you to do the analyses you need to do and get the insight you want from your data we'll continue our introduction and discussion of setting up in SPSS by taking a look at the sample data that comes as part of the SPSS application the really nice thing about that is it allows you to get started now and start working with things and see how SPSS works the hard part however is that it's totally hidden and so you need to know where to look in order to use the sample data now if you're on a Macintosh like I am then it's going to be in your Applications folder under IBM SPSS statistics 22 or whatever version you're using then samples and then in English then you'll have them in Windows it's a little bit different it's going to be C Program Files IBM SPSS statistics 22 or whatever version you have samples and then English so you have to navigate to that manually in order to be able to find those but when you do you'll see a bunch of files there now there's a few kinds in particular that are important there are the dot SAV files these are data files in the proprietary SPSS format they can only be opened up in SPSS usually and there are also dot SBS files and these are SPSS syntax files there are text files with the commands that can run a number of analyses and graphs and other functions in SPSS now we can try it in SPSS by having you on your computer open up the window and opening up a file called demo dot save but let me show you how it works when you navigate to the folder with the SPSS sample files in it again it's several hidden layers down these are the files that you'll find these are the dot SAV data files and these are the dot SPSS syntax files now there are other things in there there's something called a CSA plan that's an analysis plan there is an XML file and there's a few other things in there but the majority of what we want to deal with in fact rephrase but the only ones that we're going to deal with are the dot SAV files and possibly the dot SPSS files let's scroll down here until we find demo dot save now please note there's a lot of other demo files around that so you want this one in particular demo dot SAV because that's the SPSS file I'm going to double click on that and SPSS opens up the file now you can set spss so it has only one data file open at a time or you can have multiples I'm going to close this empty file right here but here is our demo file and this allows us to start working with a lot of the analyses and see how they work in fact I'll be using this file all the way through this entire course because it allows you to do a number of analyses that require specific kinds of data and this has it all set up so I'll show you a very quick one I'm going to come up to analyze and to explore and I will get level of Education and put that in and so I have a long list of variables that I can work with these are all the same variables I'll just hit OK and that opens up my output window again it opens it up microscopically here in the top corner so I'm going to make it bigger and now I with my sample data and that allows me to get some hands-on experience to see how the functions work in SPSS and to try some of the options and see how they affect things our next step in SPSS and introduction is to look at basic graphics because those are always a good first step in analysis and the easiest way to do that in SPSS is with something called graph board templates really you can just think of these as graphs made easy the idea here is that if you set the levels of measurement in SPSS then SPSS can suggest graphs that would be appropriate for those variables now in terms of level of a measurement remember SPSS uses three number one is nominal for different categories number two is ordinal for ranks and number three is scales that's for interval or ratio level measurements and then when you're in the graph board templates you have two basic choices you have basic graphs and those are where you choose the variables first that you want to graph and then SPSS will show you suggested graphs you can see what you want to do with them there's also an option for detailed and this is where you choose the graph style first and then you choose the variables that go into it these aren't exclusive you can bounce back and forth between the two tabs and it'll be easiest to see how it works if we just go to SPSS if you're logged into data lab CC then you should be able to download the exercise files from the same page of this videos on open up this file SPSS o1 underscore three underscore one underscore graph board SPS it's a syntax file and let's see what it looks like syntax file that you've opened looks kind of complicated but this is really because I want to have a written record of the same things that we're going to do with the drag-and-drop menus in the graph board we do need to open a data set and as I mentioned before depending on whether you're on a Macintosh or on a Windows computer the path to the data sets is a little bit different and also depending on the version you're using I'm using 22 and so if you're using something else change that number right there most of it should be the same and you can run this command and open up the data set and activate it now I've already done that I'll show you my data set right there is the demo dot save and we can come down here in a variable view and see the levels of measurement that SPSS has assigned to these most of them are scale we have a few that are ordinal we have only one variable in this data set that's truly coded as nominal and that's gender which is actually a string variable in this case I'll go back to the syntax now I have some rather complicated syntax here but what you'll see is that when we use the menus it's actually pretty simple the first thing we're going to do is make a chart of age but I'm going to come up here to graphs to the graph board template chooser and when I come to that you see I'm in this tab of basic graphs and this is where I choose a variable I'm going to choose age right here and it recommends three different kinds of charts a dot plot a histogram and a histogram with a normal distribution we'll take the very first one that's available dot plot and hit OK it puts it in the output window which I have to maximize and there it is it's a dot plot looks a lot like a histogram of age and year so it would go down to 18 years it looks like it goes up to about 77 78 and it's an easy way to get a feel of the distribution that we're dealing with again the command in text and syntax is complicated but the graphical interface makes this very easy to do I'll go back to the syntax for a moment if you were to paste the syntax for that command this is what you would see right here and there's a way of saving it you can modify it manually if you want now we'll do a histogram of age with a superimposed normal distribution I can I'll come up two graphs to graph board template chooser and this time all I have to do is come over to the right I click histogram with normal distribution and hit OK expand the output window and it's really simple now both of those charts of you were with age which is a ratio level or scaled variable in spss terminology we can also do this with categorical variables I'll use gender and make a bar chart come back up to graphs a graph board template chooser and when I come down to gender you'll see that the recommended charts change because this time it knows it's a categorical variable now if I had GPS data I could put that in here I can do a bunch of different things I'm just going to do a bar chart because that's the easiest to deal with I'll hit OK make the output window bigger and you see that in this particular data set we have an almost exactly equal number of men and women or data on them now those were the basic charts where you choose the variable first and SPSS recommends particular graphs you can also do detail charts these are ones where you choose the style of chart first and then you feel in the variables I'm going to do this again for a DA plot of income and then show you that it's really easy to modify it I'll come up to graphs the graph board template chooser this time I'll go to the detail tab click on that and I'm going to make a dot plot so I'm going to scroll through this you see we have a lot of choices you choose dot plot and then it's going to ask what I want to make a dot plot of I'm gonna click on this and I'm gonna scroll to income see the one that I want is right here household income in thousands I can click OK and then expand the output window and here's my chart it's a really basic chart and you see that most of the people are at the low end especially because this is hundreds of thousands of dollars so that's going to be a million dollars right there but I want to show you an interesting thing about this if we double click on the chart that opens up the edit window and the graph board editor has some special options for one thing I can change the number of decimal places here I just click on the decimals come to format and change the minimum level or rather the minimum number of decimals to zero and that's better but a more interesting one is if I click on the dots themselves they're doneness points and the modifier is to pile them there are a few other modifiers that can be useful one is to dodge them and what that does is it puts them in the middle expanding out either way it might be a little harder to make comparisons from one level to another but it's an interesting kind of chart I can click on it again we can do what's called jitter with a normal distribution and that takes points with the same value and it kind of randomly spreads them out up and down and again you can see that we've got a whole lot there at the bottom one other choice is jitter uniform which makes them stay within certain boundaries but it's hard to tell really how much things are spread out there at the bottom so I actually prefer pile or I think Dodge is interesting in this case and so that's one way of using graph board to both set it up and then to manually modify it by double-clicking on the chart again close this because I'm done with that and you see I have the modified version right there now we can get a lot more complicated so for instance I can make a scatter plot of age and income with colors for point density there's a lot of options and you can explore them this time I'm going to do a little bit differently I'm just going to select this command and again the way I got these was by setting them up in the menus and then simply hitting pace and it put this in tax into this index file so I could save it and run it later and so I'm going to show you how that works I've got the command here that I created using the graph board template chooser and I'll simply come up and select run selection and I maximized that window and there you can see I actually have what's called a hex scatterplot and it's showing a few different things and it's a really neat way and so you have a lot of options on the way you display things in the graph board template chooser and while the code is complicated the interaction with the menus is really simple he can be creative and you can get different views on your data and try to get more insight as you're doing your analysis the next step in our introduction to SPSS and basic graphics is bar charts and we like bar charts for a very simple reason they are simple and simple is good or more specifically bar charts are the most basic graphic for the most basic data just frequencies for a simple category it's also a very basic command in SPSS now we actually have a few options on different kinds of bar charts one we can make a simple bar chart so a single variable simply showing the category frequencies in that variable 2 we can do a grouped bar chart where we break it down by some other variable and then 3 we can do multiple variables and show the bars simultaneously but let's try this an SPSS it's really easy to do just open up this SPSS syntax file and we'll give it a whirl once you've got the file open you'll need to open the demo data set and we've used it before this is the command for Mac if you're running 22 and this is the command for Windows if you're running 22 just change the version number if you need to once you have the file open we're going to make some bar graphs now I'm going to do it by coming up here to what are called the legacy dialogues these are specialized one graph only dialogues that come from earlier versions of SPSS and truthfully I usually use these because I find them so quick and easy to deal with what we're going to do is we're going to make a bar chart for levels of education in our sample so I'm going to hit bar we're going to do a simple bar chart and we'll do groups of cases and all I need to do is hit level of education put it into the category axis and hit OK and I make the output bigger and there it is absolute piece of cake and it's also very very simple syntax you see the syntax right here it's really it could be one line and just as a point of comparison here's the same chart produced with the chart builder but you see we have this really complicated overwhelming code the legacy chart produces it an extremely simple way so that's a simple bar chart piece of cake now let's do a clustered bar chart for groups of cases we'll look at levels of education by gender to do that we come back up to graphs and to legacy dialogues to bar and now we're going to cluster it and do a level of Education clustered by gender so I hit define of Education that's sort of our outcome variable with that under category axis and then define clusters by gender we put that right there I'll hit OK and make it bigger and this time it uses nicer colors but you have the five levels of Education broken down where women are in blue and men are in green but it's really easy to see here the relationship between the two variables and in this particular dataset it really looks like there's no substantial difference between the men and women you know I'll say I believe this is an artificial data set so we wouldn't expect a lot of differences but this is a nice way to compare them by the way come up and you'll see that the code for this is really simple all it does is it adds by gender and so again a very short command I'm going to go back to the syntax and we're going to do one more here and that is for multiple variables so this is a situation in which it can be confusing if you have a lot of categories within each variable what I'm doing here is I'm going to get the means of variables or the numbers of ones if you have an indicator variable where it is a 0 for no and a 1 for yes this is a really nice way of comparing the frequencies of each one of them across I'll show you how that works we'll go up to graphs we'll come back over to bar and we're gonna do a simple one but this time we're doing separate variables I'll hit define and then I'm going to come down here and this data set again which I believe is fictional asked a lot of people about various things that they might do we're gonna ask them about wireless service and we're going to come down to whether they own a fax machine because this is old data and it's asking about old technology pagers I've never had a pager but I simply select all those variables I put them in here and as long as Islands the same scale is going to do the mean of each one and on a 0 1 the mean is the proportion of one's hit ok and there we have it it's a way of looking at the distribution of multiple variables simultaneously it's a very information dense display and especially when use the analysts are exploring your data this can be a really quick and easy way of getting a feel for your data and then which can then direct your further analyses as we continue to look at basic graphics in SPSS a really common one is histograms and this is a graphic for data that is quantitative or scaled or measured or interval or ratio level though it's really all are referring to basically the same thing and in any of them you're going to want to make a histogram to see what the variable is like now I mentioned that SPSS prefers the term scale for these variables and that's what shows up in the data definitions and I like to think of it as the scales of justice but why are we making a histogram the point is to see what you have to see what the data is like and there's a few things in particular that you're going to be looking for number one you're going to be looking for the shape of the distribution is it unimodal bimodal skewed left skewed right are there gaps in the data that suggests that maybe you have some important mechanism operating are there outliers that you would need to take consideration of before you do your analysis is your data symmetrical there are a lot of different things that you could look for and some of these are going to have a lot of influence on your analyses so it's important to take a look at the data and histogram will give you a great impression of a quantitative or scale two variable we'll try it in SPSS simply open up this syntax file and we'll see how it works when you're in SPSS most of this is really just to open up the data set is the same one we've used in the other is this demo data set and here's the code for a Mac adjust the version number if you need to and here's the code for Windows but once you have the data set open you can use the commands and it's really really simple all you need to do is come up to graphs will go to legacy dialogues and will come down here to the bottom to histogram and we're going to make a basic histogram of age so I click that and I come to age it's our first variable and I simply click this to move it over and hit OK I'll make the output window bigger and there's our histogram and from this we can see that our distribution is unimodal we can see it's pretty close to normal it's slightly skewed on the high end but not very much and this is going to be a really good variable for most of our analyses because it meets most of the assumptions of the kinds of procedures we might want to use now if I want to make things slightly more complicated because you see that the command for this is extremely simple we can make a small modification I'll show you here we can superimpose a normal distribution and all I have to do for that is come back to graphs to legacy dialogues and to histogram and I just check this box right here display a normal curve and what that's going to do is it going to create the same distribution but it's going to put on top of it a line of a bell curve a normal distribution that has the same mean and standard deviation and here you can see it we're pretty close to normal and this is a nice way of confirming that and again the code for it is really simple all it does is it adds the word normal in this sentence and that gives us everything we need so one of the reasons I really like the legacy dialogs in SPSS is because it's so concise it's so simple and it gets you what you need so you can get a grip of your data and move ahead as we continued SPSS an introduction in basic graphics we should look at scatter plots a very common method of looking at associations or as I like to think as way of assessing togetherness in data in other words you want to see what goes with what or more specifically what variable goes with what other variable so scatter plots are a great way of visualizing the association between two quantitative variables when you make a scatter plot there are some things you should look for and in case you're wondering what they are they include for instance whether the association between the two variables is linear because a lot of the procedures that are common assume that you can draw a straight line through the data you want to check the spread of the data especially whether the spread changes as you go from left to right on a scatter plot that's called heterogeneity of variance and it can cause problems with certain procedures you want to look for outliers either univariate that's a score that's unusual on a single variable by itself or in this case what's even more significant is bivariate where you have an unusual combination of scores and then finally you want to try to get some idea for the correlation or the strength of the association between the two variables a scatter plot will allow you to do all of those now in SPSS there are three general kinds of scatter plots that you can do number one is a simple scatter it's a bivariate X&Y chart easy to do number two is a matrix scatter plot where you actually have several variables and they're simultaneously and it's a good way of looking at complex associations between collections of variables and number three SPSS is able to do a 3d scatter plot but I'll have some words to say about that a little bit later but let's try this and see how scatter plots work in SPSS at least very basically so just open up the syntax file and we can see how it works when you open up the syntax file we have the same situation where you can load the data will use demo dot save and you can use this command if you're on a Mac using in version 22 and this command on Windows version 22 but we're just going to make a couple of scatter plots and it's a really basic easy command the first thing we're going to do is make a scatter plot of age and income but let's come up to graphs to legacy dialogs and down to scatter I'm going to use a simple scatter that's just a basic bivariate XY chart I'll hit define and all I need to do here is pick my variables for the x axis across the bottom and the y axis up the side I'm going to pick age for the x axis and put it right there and household income for the y axis and the idea is maybe there's an association between household income in an older person is that's all I need to do except click OK and when I get that I get this basic scatterplot so I have age and years across the bottom I have household income and thousands up this side and you can see of course that most of the people are are near the bottom that's because most people make less than $200,000 a year this graph goes up to 1.2 million we have a marker that's a large empty circle it's in black and you can change the markers and there's things you can do to clean up the chart but it's also easy to tell the people who for instance make a lot of money are generally older and so we can see in this data there is some kind of association between age and income but let's try to get a more nuanced one by looking at several variables simultaneously with a scatter plot matrix come back up to graphs and legacy dialogues and down to scatter this time however I'm going to pick matrix scatter click define and then only need to do is pick up the variables I want to include I don't have to specify X or Y because they're all going to serve as both x and y in different parts of the matrix I'm going to pick a few here I'm gonna get household income I'll move it over I will get age and move that over I'll get a dress years at current address move that over I'll get reside which is the number of people residing in the house move that and then finally I'll get level of education there's nothing especially meaningful about these they're just ones that I thought would be easy to look at now as a general recommendation if you do have one variable that is an outcome variable you might want to put that one in first that puts it in the first column in the first row and it makes it easier to find it when you're looking at your analyses but I've got my five variables in there and I just come and press okay takes a moment and then I come up and this is the scatter plot matrix and so you have all five variables listed on the side you have all five variables listed across the bottom so each one functions as both an X and a y you have empty boxes down the diagonal because that would be each variable with itself and the correlation is always 1 now there are things you can do to clean this up you can change the marker from a big black circle to something that's smaller and easier to see you can put regression lines through but it's easy to see that there are some really important patterns so for instance age in years and years a current address right here obviously there's a limit you can't live someplace longer than you've been alive that's why we have nothing in the top left of that but you do see some associations and some cut-offs that go through now this one's really dense in a lot of situations it's going to be a lot easier to see the patterns that's there especially if you change the markers and put in regression lines but this gives a good idea of what you can do with a scatter plot matrix now let's go back one more time to the legacy dialogues and to scatter because you saw that there were other options there there's a dot plot that's like a histogram and there's an overlay scatter which I don't want to deal with and then there's a 3d scatter and you might look at that like oh cool it's interactive it's 3d it's a great thing I'm actually not even gonna do it because every time I've done a 3d diagram I found it's impossible to read it clearly it's very hard to manipulate in SPSS an event of being really a bad experience and it's much easier to look at the association between variables using a scatter plot matrix that's why I recommend that you avoid the 3d completely even though it's available here but avoid it completely and use the bivariate and the scatterplot matrices as a way of looking at the associations between variables in your data once you've done the basic graphics for your data and seeing what you're dealing with it's a good idea to move on to basic statistics and then SPSS the most basic version of this is frequencies I like to think of it as putting things into buckets and then simply counting what's in the buckets so the idea is when you have a limited number of categories in your data then you should just count how often each category occurs it's a first step to really some significant insight but wait I just want to mention that the frequencies command in SPSS can do so much more than that I'm going to show you how it works for example it can do charts it can do bar charts and pie charts and histograms and normal distributions and they can do a lot of statistics beyond frequencies it can do quartiles percentiles mean median mode standard deviation variance skewness kurtosis and so on in fact because of this I like to think of frequencies as SPSS is version of the competent man character in literature movies who can do everything well you know somebody like Leonardo da Vinci or iron man who seems to be able to do everything or you know Marie Curie right here because she won two Nobel prizes and what have the rest of us done but anyhow back to statistics let's take a look at frequencies and let's try it in SPSS just open up this syntax file and we'll see the things that it's able to do for you as always we need to begin by opening the dataset will use demo dot save and you can use this command in Mac or this command in Windows to do that once you have the data set open it's a very simple thing to get the frequencies now I have the syntax saved here but really it's more as a record of what I've done because I use the drop-down menus to create these commands so I'm going to come up to frequencies and I'm going to get the frequencies for gender and job satisfaction to do that I come to analyze two descriptive statistics and then the first option there is frequencies and what I'm going to get is gender which is right here I'll just double click to move it over and we'll also get job satisfaction I'll double click and move that over now what's important is these are two different kinds of variables gender is a categorical variable nominal and job satisfaction here is a scaled variable and so normally you don't do the same kinds of things for these but frequencies is very flexible so I'm just going to hit OK and we'll see the default output for frequencies the first thing that it shows us is how many valid observation syrups so how many of our 6400 cases have data on these variables the answer is all of them there's no missing data here and then it comes down it gives us frequency tables where it lists every value or possible score on the variable and then says how often each one occurs so for gender we have three thousand one hundred seventy-nine female respondents that's 49 point seven percent and the percent and the valid percent would be different if we had missing data but we don't so we can ignore that and then the cumulative simply adds up to a hundred and then job satisfaction this is a scaled variable which has one two three four five as answers and here you can see how many people put each of the answers 17% highly just satisfied 21.8 neutral 19.1 highly satisfied and that's a quick look at the frequencies that we're dealing with it's a nice way also to check if your variables are coded well what we can do is more than that we can also turn off the tables and we can do bar charts using the frequencies command so I'm going to keep those same two variables gender and job satisfaction but this time I'm just going to make bar charts I'll go back to my recent commands frequencies and what I'm going to do is I'm going to click this it's going to give me a little error message because I haven't change the other thing first I'm going to come to charts right here until to make bar charts obviously you can make pie charts and histograms as well I'll click continue and then click OK and now the same general command frequencies is not producing tables but is producing charts and here you can see that we are very closely matched in terms of the number of male and female respondents and here you can see a job satisfaction sort of peaks at neutral and somewhat satisfied and so that's a really nice thing you don't even have to use the bar chart command you can do it right here you can also get more kinds of statistics in there so for instance this one I'm going to keep the tables off but I'm going to ask for a few extra things in fact let me just come back to this one we're going to analyze descriptives and frequencies and this time I'm going to do aged reside and job sat so I'm going to remove my one categorical variable here I'll just reset that I'll do age to reside and job set and I think that's this one right here faction and we'll move that over so I have three variables but they're all scaled variables what I'm going to do here is first I'm going to come to statistics and I have a really an impressive range of things I can get I can get the mean I can get the median that the mode if you want the mode I think this is the only place to get it in SPSS I can get quartile values now it doesn't do the minimum in the maximum you have to select those separately down here but you can also get cut points now a couple it's an interesting one the quartiles are cut points it splits the data into four equal sized groups with the same number of people in each sometimes you want something other than that so for instance I know that if you're doing propensity scores it's not uncommon to use five equal groups quintiles and also there are situations in which you want not the most extreme scores but near the most and so I'm going to put for instance the 2.5 percentile the 97.5 percentile because those frame the middle 95% of the data I can also get the standard deviation in the variants as there anything else I want right here I want skewness and kurtosis I'm gonna hit continue then I'm going to come back to this one I'm going to turn off the frequency tables because otherwise I have a lot of different possible interests here I'd have a lot going on I'll hit charts and this time I'm going to ask for histograms and we'll put a normal curve on top of each histogram click continue and click ok and so here's what we get it starts with the statistical output here are the three variables I selected it gives us the mean the standard deviation the variance skewness and standard error of skewness kurtosis we have the minimum and maximum scores and then the percentiles now it's a funny list here because I've got three things intermingled I have the quartiles that something I asked for so we have the 25th percentile the 50th percentile and the 75th percentile those are the quartile values I had the minimum and maximum up here so those are the zero and 100% quartiles but I also asked for quintiles and so that splits it at 20 40 60 and 80% and then finally I manually entered the two and a half percent I'll and then 97 and a half percent and so they're all put there together but it's really easy to see the changes in the distribution beneath that we have the histograms and we have each variable has its own histogram along with a normal distribution with the same mean and standard deviation laid on top age is pretty close to normal here's a current address however you can see is really skewed because most people haven't lived there that long and then finally job satisfaction is a little flatter than we would expect if it were normally distributed the point of this is that I'm able to do a tremendous amount of statistical and graphical work using a single command the frequencies function in SPSS one of the most versatile commands you'll ever use in our previous movie we looked at the power of the frequencies command but for basic statistics another very common choice is descriptives within spss the neat thing about descriptives is that allows you to achieve maximum density that is how to get a lot of numbers on a lot of variables in just a little space that's what descriptives is really good for on the other end there is a restriction it works only with numerical variables but that's a lot of the data that you might have and if you have that it can give you things like the mean the sum the standard deviation the standard error the variance the minimum and maximum the range the skewness and kurtosis you know I say but guess what you know in case you don't remember frequencies does more but that's okay there are certain things that the descriptives command does well here's what it does well first it gives you very concise compact tabular output so it's really easy to see a bunch of information in a small space second it's a really quick way to find obvious errors in coding in your data finally you can get proportions for indicator variables as 0 1 variables and I'll show you how that works also we have a bonus feature here in descriptives descriptives is the home of spss is top secret hidden one-step z-score procedure I've seen people knock themselves out trying to get z-scores by getting standard deviations and means you don't have to do any of that you click one button and you're done but let's try it in SPSS and I'll show you how it works and just open up this syntax file and we'll see what you can do with descriptives we'll begin as always by opening the data set will be using demo dot save here's the path on a Macintosh at running version 22 and the path on a window is also running version 22 this is my first command and it looks really long but that's because I have a lot of variables in it all we need to do is come up to analyze two descriptive statistics and descriptives we click on that now one of the things it does is it only shows you the variables that it can analyze so gender which was a string variable I mean he had just text that's not in there but what I can do is I can just select all of them do a command or control a and then move everything over and then I'm just going to do the default analysis I'll just hit OK and here's our output we have a whole bunch of variables and it tells us first the number of observations is 6400 almost all the way down this question about Internet is missing some data but that appears to be the only one we have the minimum value and the maximum value by the way this is where I talk about quick and easy data checking if you have a variable that's only supposed to go from 1 to 5 or 0 to 1 if you have a 17 you know something's wrong and so by simply checking the outer boundaries that's a fast way of seeing if there are any really obvious errors we also have the mean and the standard deviation two of the things you generally need the first two moments of a distribution and so that's a lot of information and it's in a very concise format that's a wonderful thing if we go back the syntax I do want to mention this one thing about indicator variables I said it earlier it's this if you have indicator variables that's a binary or dichotomous variable that has only two possible values and if that variable is coded as zero and one then you can in fact get the mean of it and it tells you something that tells you the proportion of observations that have ones and this works best if you use the standard programmer format of zero equals false or no and one equals true or yes and strangely in this particular data set that's true for most of the variables but not the last one or two in demo dot save and I have no idea why they switched that but it's something that you want to check in the coding before you go ahead and do it and so if I go back to the output you can see for instance that most of these wireless service down through Owens fax machine those are all zero ones where zero is no and one is yes the mean right here tells us that 99% of the people own TVs 9:6 own VCRs because this is a long time ago 25% had paging services and I like this one where's the internet on this list 27% had the internet because this was apparently generated and like you know 1990 who knows what anyhow those are meaningful data points the mean tells you the proportion of ones or yeses I'll go back to the syntax here and then let's take a quick look at the z-scores now any reasonable person would think that a z-score is a transformation of the data and therefore it would be under the transform menu but you know it's it's not there instead it's hidden as an option and descriptives so let's go back to descriptives unless you age and income so I'm going to reset this pick age and I went to pick household income and I'm going to get both of these as z-scores because a lot of procedures work a lot better if you have z-scores all you have to do is this click Save standardized values as variables and if I hit OK what it's done here is it gives me in the descriptions because I actually still ran the descriptives command for those two variables but more significantly let's take a look at the data set when I come to the data set if I scroll to the end here variables that were not there previously Z age and Z income and they have lots of decimal places because you need those with z-scores now I'm refreshed now under normal circumstances you would want to save this into the data I'm not going to do that because this is one of SPSS built-in default data sets but I do want to show you that we can do one other thing here let's go back and get descriptives for those z-scores so I'm going to come to analyze descriptives I'm going to reset this down to see our two new variables I'll select do a little shift-click to get both of them then pop them over here then I'll hit okay and as you would expect a z-score has a mean of 0 and a standard deviation of 1 and we didn't have to do it manually we didn't have to remember any values we didn't have to round things off and did it exactly for us and so that is what the descriptives command does it makes a very concise tabular output and it also allows you to save standardized or z-scores for use in certain procedures for a final look in SPSS at basic statistics we'll look at the explore command I like to think of this as a way to get a lot closer get a little macro view on your subject get closer and see what's there in detail now the Explorer command is going to give you a bunch of statistics it can give you the mean and the confidence interval for the mean and the trimmed mean as well as the variance the standard deviation the interquartile range the minimum and maximum the range skewness kurtosis a collection of M estimators which are special robust ways for measuring the center of a distribution percentiles which we've seen before and lists of outliers it can also give you a collection of plots it's the one place in SPSS that you can get a stem-and-leaf plot now traditionally those are things that are drawn by hand so it's kind of cute to see a computer do them also get boxplots and you can get histograms and you can get a set of normality plots such as a QQ plot or a.d trended QQ plot and the neat thing after that is you can break all of these analyses down by groups so let's try it in spss and see how it works just open up this syntax file and we'll run through the various procedures in Explorer and see how it can add up to your own analysis as always we'll begin by opening the demo dot save dataset here's the command for a Mac here's the command for Windows now again I'm saving this as syntax that makes it repeatable and it means so you can download it and try running it on your own but I created all of this by using the menu commands let's start by doing a default Explorer analysis for a couple of variables I'll come up to analyze to descriptives and then we'll come here to explore and what we're going to do is age an income category and again this is kind of interesting because these are different kinds of variables age is a scale variable and income category in this case is an ordinal variable I'm just gonna leave all the defaults as they are in a hit okay and here's what we get from this first we find out whether there were any missing cases there weren't in this situation and then we get a collection of descriptive statistics for these we have first for age then for income category we have the mean with the standard error the confidence intervals the 5% trimmed mean median variance standard deviation minimum maximum range inner quartile range skewness and kurtosis along with Thera standard errors and so there's a lot of information there and we scroll down and find the same kinds of information for income category in thousands now remember some of this you wouldn't normally want to use because income category in this case is not a scaled variable and a lot of these things like minimum maximum and trim mean work best with a scale variable but SPSS is able to kind of run it on everything so interpret with caution then we come down and look we have a stem-and-leaf plot where this is age which in our sample is two digit numbers and so this means 1 8 18 and each of these leaves each of these numbers over here is the leaf that represents 10 cases remember we have 6,400 cases so we have about 640 numbers right here and you can see for instance that the 30s appear really common late 30s and that we go up to somebody in their late 70s and so that's an easy way to see what's going on simultaneously we get a boxplot and the nice thing about this is you can tell really quickly there are no outliers on age not in this particular data set the same thing with income category and the stem and leaf plot looks funny but that's because there's only a few possible values one or two or three or four and it's drawing it so it looks a little weird but we can come down and get the box plot as well and see there's no outliers at least on this kind of variable again not normally something you would do with a rank order variable but it's possible here you now the neat thing is there are additional statistics I'll do the same to statistics but I'm going to go check off a lot of options that I have right here so let's go back to that dialog I'll go to Explorer and what I'm going to do is I'm gonna say just give me the statistics right now and I'll come up here and I'll make some selections one thing although 95% confidence intervals are by far the most common I have seen significant situations where people used 80% confidence intervals so you can change it if you want then I can get all of the M estimators it's a whole collection I can get a list of outliers and a list a percentile values I hit continue and I click OK and now we have the same table we had before that's their descriptives up there at top then we have the M estimators and this is for different robust measures of center again all of them are trying to give us something equivalent to the mean and you see in this case Huber's msdemeaner two keys by weight ampuls M estimator and an trees wave the numbers are all pretty similar I mean it goes from a low of 41 point 18 to a high of 41 point five to but they're all really close and each of these has specific parameters that go into them you can't adjust them in the dialogue box but let me just return to the syntax for one second you see here these are the parameters for each of the M estimators you could change them here if you wanted to I'll go back to the output percentiles 5 10 25 up to 95 and then it gives us the case numbers for the highest and lowest 5 cases on each variable and so this is a really nice way of seeing a multi-dimensional picture of our data now in terms of pictures and even better ways to do this with more graphs so let me go back to the syntax for a second and you see that we can get some additional plots I'm gonna use age in income category again but I'm going to change that what it tells us so first off going to say give me just the plot so we're not gonna get any statistics I'm coming to the plots pen you well we have a stem-and-leaf by default let's get a histogram let's also get normality plots that's a way of assessing how closely your data match a normal distribution I'll hit continue and okay and now I have a histogram for age the stem-and-leaf plot this one here is normal but this one here is new it's a normal QQ or quantile quantile plot of age and years and if it were normally distributed all of these circles would fall exactly on this line you see it's really close but it does deviate at each end and that addi trended one takes that line sort of flattens it out and it's much easier to see where the changes are now I know it looks really big in this case but this variable is in fact pretty close to a normal distribution then we have our box plot and then we do the same thing for income mr. out with a histogram are stem-and-leaf plot and our normal QQ plot again a little weird because there's only four possible values in this data set but they all fall pretty well on the line and there's our D trended plot and then finally the bots felt that we saw before now there's one more thing we can do with the Explorer command and then as we can take some of these analyses and break them down by groups so if we go back to the syntax we'll see I'm going to do income and break it down by gender let's go back to the menu here go to explore and I'm going to reset this and we're going to take income and put that into our dependent or outcome variable list or the thing that we're pertaining to predict and then we'll take a gender scroll down a little bit there is gender and put into the factor list or sometimes people call it independent variables so that's if it's an experimentally manipulated variable or the predictor variable I'm going to come up here and I'm actually going to skip the statistics and get plots only I don't want to send relief but I will get a histogram I'll get the normality pods and now because I'm breaking it down by groups I can check the spread versus level with the Levine test the idea here is that the data should be spread out approximately the same amount for each of the groups so we can compare them using some uniform statistics I'm going to do what's called a power estimation here click continue and then okay you and now what we get is again is a list of the number of cases that have complete data and then all of them do there's no missing data we have a test of normality and what we see here is based on both of these that the data for neither group is normal that's okay because we knew that income was strongly positively skewed genitive Varian's whether the two groups have about the same variance or spread you know there is some difference but they are not statistically significant and so it appears to be the same for the men and the women which is good in this particular data set and then we can come down and see the histograms first for women and you see it's got a really strong skewness there and the same thing again for men really strongly skewed then we get the normal QQ or quantile quantile plots and again if it were normally distributed all of these points would fall right on this line it's strongly skewed and so we have this really big bend in the data the same is true for men and here's the D trended lines where they should all be flat on that line instead you get this swoosh mark instead and so it just confirms that we're not dealing with normally distributed data then what you do have is this big collection of outliers in the box plots I'm going to do one thing I'm going to double click on this and then I'm going to come right up to here and this will turn off the data labels so we can get rid of the ID numbers and you can see that we have a lot of outliers in both the men in both the women and there's no really obvious differences between the two groups and the spread versus level plot is something that you can use if you have multiple levels that it can help you select a kind of power transformation a square root or reciprocal a square or something like that but that's a more complicated topic and something for another day and besides it appears that we have relatively homogeneous variance in the two groups and so we'd be good to go ahead and do our other analyses and so those are some of the options and explore and that's where we'll end our discussion of basic statistics but you can see how they can be used to see how well your data meet the assumptions of the procedures that you use and then really how well you can make inferences from your sample to other groups when you're working in SPSS and you're accessing data one of the most important things you can do is to create labels and definitions for your data I'd like to think of this as the statistical version of Alice in Wonderland and the caterpillar asking her to explain herself you need to explain yourself or more specifically when it comes to your data you need to tell SPSS what do your data mean now that is the data description and I see two kinds of information that you tell SPSS about your data the first one I'm going to call semiotic switch comes from the study of meaning this is where you tell SPSS what the variable names are the data types the variable labels the value labels the missing values the level of measurement and the role that each variable plays contrasted with that there are other elements that even call aesthetics and that addresses variable width decimal places column width and alignment and these are all settings within the data window of SPSS one of the most important though at least for human consumption is going to be the variable and value labels and so I'm going to take a little time and talk about those with the variable names that's what the short name is the ones that you have there at the top of the column there are some important rules so the rules for variable names number one the names must be unique no two variables can have the same name that shouldn't be too surprising that's an identifier rule number two the names must start with a letter I put an asterisk there because you can start with an at a pound sign or a dollar sign but you don't want to because those are generally reserved for special functions within SPSS rule number three names can use letters upper or lowercase they can use numbers and they can use period underscore at pound and dollar sign on the other hand don't end with a period that can cause confusion with the command terminator and don't end with an underscore because that's used for automatic variable names when the SPSS is doing computations rule number four names cannot include spaces and rule number five names must be less than 64 bytes in most text coding systems that's 64 characters but if you're using the Unicode system that might be only 32 characters and the last rule rule number six is the names cannot be any of these words all and by EQ GE GTL alt and E naught or two or with because those are all reserved function names within SPSS so don't create that confusion and so those are the short names that go at the top of the variable on the other hand the label that you associate with that you can give it a more descriptive name those are the variable labels and so there are a few rules for those rule number one they must be less than 256 bytes that actually means it could be really long although you don't usually want to do that because some procedures will display as few as 40 bytes 40 characters and you really want to be able to read what it is so you want to keep it short but you can go longer if you need to rule number 2 the labels must be enclosed in quotes although I'll tell you they need to be straight quotes the vertical ones and not the curly quotes our SPSS chokes on those rule number 3 labels can include any character including spaces which is something that you can't have in the variable name but you can put it here so that allows you to put labels that sort of float on top of the variable names and those can show up in the variable lists they can show up in the charts and the output that you create another really important one is value labels so you may have a variable called gender and you may put zeros and ones but do you remember what those zeros and ones are and so I'm going to show you some ways of dealing with that the most important thing is to put value labels on there so here are the rules for value labels rule number one there must be less than 121 bytes so that actually is really long you generally want to keep your labels pretty short rule number two like the variable labels the value labels must be enclosed in quotes and they need to be the straight quotes and not curly quotes rule number three labels can include any character including spaces that's good this is an interesting one rule number four the value labels do not need to be unique that is more than one value can have the same label so you might have the numbers one through nine and it could be that seven eight and nine all say the same thing but they underneath have different code in terrorist situations where you might want to do that but mostly I want to show you how this works in SPSS so just open up this syntax file and this one's gonna be a little different cuz we're actually not gonna use a data file I'll refer to one but I mostly just want to show you the syntax this Intex file shows how to write variable labels and value labels now you don't necessarily have to put them all broken down in lines I do it because it makes them a lot more readable it's a lot easier to see what's going on the first thing is the command variable labels because there's an SPSS command it's written on all capitals and then what you do is you write the short name of the variable and then you have at least one space and then you have straight quotes and then the long label so here for instance I've got vert 0 1 that would be the first variable and then this is its label written out and you don't need to have anything after don't need any commas or question marks or semicolons or anything you just go to the next one now I put it into another line because that makes it easy to follow and I run them all through here I'm gonna make one important recommendation if you have a dichotomous variable or binary one that has only two possible values and gender might fit into that category let me recommend this that you code it as zeros and ones a lot of people use once in twos but that gets confusing if you code it as 0s and ones and name the variable after whatever the one is now when it comes to male and female I generally give one to whichever group I think's gonna have the higher score on my main outcome variable so it'll switch around but if for some reason I think that men are gonna have a higher score on an outcome variable then I will call it male and then the label will be our four respondent is male on the other hand if I think women are gonna have a higher score then I will call the variable female and the label will be our as female I would obviously only use one of those two now here are some other examples I tend to give generic names such as variable or really just Q for question qo1 qo2 and i use the leading zeros so they sort properly in the dialog boxes and when you're done listing all of your variable names and the variable labels and quotes just end with a period it doesn't have to be have a space before that's left over from earlier versions of SPSS it's a habit I have so you can run this at any time and it will assign these labels to the variables and then they'll show up in the data file which is nice next are the value labels and what you have here is the first command which is written in all caps and then you give a list of variables to which the values apply and you can list them out separately ver1 ver - here I've got aver 3 without a leading 0 and then if they're all next to each other if they are adjacent they can actually specify ranges vert 3-2 and Capitals vert 10 so that would leave 3 4 5 6 7 8 9 10 and then you just go to the next line and you give the first value that's a zero and then I give zero equals no and one equals yes when you're done giving the values need to put a slash so it knows you're done with the values for that variable then you can go on to the next variable I said for instance if I gave one on a gender variable to men I would call it male and so zero which would mean no they're not male would be female and one yes they are or true that would be meal and do a slash on the other hand if you coded it the other way and then you just call it female and zero which means no or false means they're not female they're male 1 means they are fine obviously use just one of these and do the slash and then I could have a rating variable say for instance a lot of people call it a Likert scale just a rating scale and I could do rate zero one to rate 10 and I can specify every value so this is a five-point scale from strongly disagree to strongly agree finished with a slash or maybe have a different kind of scale here at the end I have scale zero once your scale zero two that's an 11 point scale but I only mark the two ends the zero and the 10 so zero is never or not at all 10 as always completely and then to let SPSS know that I'm done specifying value labels end with a period so this is actually a single sentence and it's a way of telling it how you want the numbers to appear both in the data window and in any output that you get finally I'll mention something about missing values because it can also be easier to specify these in syntax the command is missing values and you just give the names of the variables and you can use two in the same way and then in parentheses you put the number that is assigned to missing values 99s common so I've got that there and then you can do a slash if you're going to use different codes after that I could do mail through female and here I say two through hi and really what that means is anything other than the zero or a one is missing so if I accidentally type in a seven you know it's missing and then here I specify several different values I can put 7 comma 8 comma 9 so if any of those show up those would be considered missing does do what you want the nice thing is it will exclude them automatically from analyses but it will include them in frequencies when you're getting that output finish with a period and then you just run these like you do any other command and it's going to do a lot to clarify your data and make it easier to follow your analyses and reconstitute your work in the future you when you're working in SPSS and you're trying to access data you may get the idea of entering data well let me tell you my thoughts you want to enter data in SPSS I just see it as an exercise in frustration it's a pain to do it manually and I take maybe if you're entering 10 or 12 numbers you know basically nothing it's something that's often referred to as a toy data set maybe you could do that now it's also possible to copy and paste data but I'm going to say sort of because it doesn't work really well I'll show you that it's much much easier to just import the data from a CSV file or text file and I'll show you how to do that in the next section but in terms of entering data let me show you how it works in SPSS we'll just open up a blank document and we'll try it so here's a blank data window in SPSS I can come right here and I can enter a number and you know unfortunately I press tab it actually goes down which is an unusual behavior and you see it gives it an automatic variable name very zero zero zero zero one well if I want to move sideways I actually need to move the right arrow key so I'll go this way two three and so on and then I can hit return and it goes down I'll come back to here and I'll go four five six I'll hit tab and it comes back to the beginning so it's not the most intuitive behavior plus you see it gives it these generic names that's because you can't enter the variable name directly in this window instead what you have to do is go to variable view you can also get there by just double-clicking on the variable name here we go and you can enter the variable name and you can change other things you want to do it works but it's a pain I'm gonna come back here to date of you now I mentioned you can import data sort of so let me show you how this works I'm actually going to go to a Google sheet that has nothing in it at the moment and here I'm going to enter a few values of a few different kinds on the 56:43 and I'll enter a number J return I'll go okay so data I've got two digit numbers and I have letters which will be string variables in SPSS I'm going to copy those and we'll see how well they paste over an SPSS so I'm going to go back there come over here to the side and I will paste those in and you see that the values came in and showed up with decimal places and I can get rid of that but it's really weird with the string variable with the letters and so you can copy it notice also I can't copy in variable names I still have to enter those and manually you can deal with those when you import but really this is a demonstration that putting stuff manually in SPSS it's not a good environment for that you use a spreadsheet use Google sheet use numbers use Excel anything enter it there and then import it I'll show you that in the next section and you'll see that it's a much much easier process the last thing I want to say in spss about accessing data is about importing data and you know compared to entering it manually it just makes me feel like this and I resort to cheesy clipart to show how happy I am because no doubt about it importing is absolutely the best way to go if you want to get data into SPSS now the nice thing is SPSS can open text files it can open csv or comma separated value files and even xlsx that's Excel files as long as they're formatted right now what do I mean by formatted right there's a term from Hadley Wickham in the our developer community tidy data and it's referring to something very specific it says that your file should have only one sheet so that's one worksheet even though Excel can take more than that that each column should be exactly equal to one variable and that each row should be equal to one case and an important thing is no funny stuff in your excel sheet because excel is very flexible and when I refer to funny stuff I'm talking about things like macros and formulas and graphs and formatting and comments or merge cells or headers taking up their own rows or duplicating row numbers you don't want any of that basically you want to treat it like a CSV file and if you do that then you find you can import it very easily into SPSS and in fact let me show you how this works we're going to try this in SPSS but I want you to do two things first I want you to download the course files and that will include a zipped folder by this name that ends with datasets that's going to have three files inside it I'll show you those in just a second and then you can also open up this syntax file that will work with them but let's go to see what's inside the folder and explain a little bit what's going to happen here the folder that I've asked you to download contains three different files now I have both the folder here and I have the three files saved separately next to it but normally they would be inside it but before the syntax to work properly you want them sitting separately on the desktop all three of them contain the same data it says mbb which stands for Mozart Beethoven and Bach because this is Google Trends data about the popularity of search for each of these three composers names since 2004 this first one is in CSV or comma separated value format the second one is a plain text file and it's tab separated and the third one is an xlsx file so it's an excel sheet and you can see it's the same number but it appears a little bit differently when I do the quick view here on my Macintosh what we're going to do then is open up the syntax file and we're going to see what we need to do to import each of these now I've saved the syntax but the fact is it's easier to do this stuff through the menus now I give some information here about using the file path in each of these syntax commands I have to specify the file location now this is the format if you're on a macintosh like I am of course you'll want to change Bart to be the name of your home directory if you're on a Windows computer you're going to need to change it to something a little more like this or possibly depending on the version of your operating system using backslashes instead anyhow I'm going to show you how to import each of these and I've got the duplicate information here in the script in case you want to run it that way but it's actually really easy to do it from the menus so here's what I'm going to do I'm going to come up to my data window I'll just click over to that data windows empty right now I'm going to go to file open and data you do that if you're opening an existing SPS file or if you're importing something in a different format now here I'm on the desktop you can see my folder there but you can't see the three data files I have next to it because right now it's only going to display files that are in the dots Save that's the SPSS proprietary data format I'm going to click on that and come way down here and we'll start with the text file the txt version so we're going to hit that and now you can see that it's there I'll select that file and I'll click open so now I have the SPSS text import wizard and we can scroll through muscles pretty quickly it asked if it matches a predefined format something that would have saved somewhere else it doesn't if they're delimited EES they're delimited by tabs in this case are the variables included the top of the file you see how they show up here as the first row well I click yes and now it excludes us because it knows that those need to be the in the header of the data file hit continue line represents a case I want all of the cases you could sample from it if you had a very large data set they would allow you to do explore to our analyses more quickly than you could otherwise limiters appear now by default a text file the one that I have uses tabs and it knows that it asks about text qualifiers I don't have text qualifiers in here so I just hit continue don't have to change anything now I have dates here at the beginning and they are year - month now SPSS can handle dates however it doesn't like the fact that I'm using year and month without the day associated with it consequently I'm going to leave it just as a string variable as a text variable and it still works properly in any analyses I want to do so that's fine I'm just going to hit continue I'm not changing anything here and asked if I'd like to save the file format for future use that's the thing was referring to in the first dialog here and asked if I want to paste the syntax I could do that but I've already got it pasted I'm just going to hit done and there it is it's opened it up and it's formatted properly if we go to variable view you can see it's got a string variable it's got three numeric variables it has the proper number of digits as the proper number of no decimal places and it recognizes them as nominal which actually is not the case so I actually need to come here and change that to a scaled variable because the data that you get from Google Trends is sort of zero to one percentages in terms of relative popularity search terms so I change that to scale and otherwise I'm good to go let's do the same thing but with a CSV file to do that I'm just going to get rid of this data file I'll just open up a new one there we go I'll come back up to the file and open to data this time I need to tell I'm looking for a CSV but if you remember it that's actually under text so I click here and except this time instead of stuck in the dot txt file I'll select the dot CSV file and what you find is that the procedure is almost identical there's only one super tiny change here I hit continue I tell the variable names are at the top it is delimited it needs to know each lines a case I just hit continue on all of this here's the one difference when I did the text file tab was automatically selected now that I'm doing the CSV which means comma separated values comma is automatically selected I hit continue it does the same thing with month we're going to leave it as string I hit continue and I can hit done and you see it looks exactly the same I do have the same issue though that these three numbers which go from 0 to 100 are coded as nominal I need to change them manually to scale right and now we'll do the third one an excel file now in a lot of programs you get very stern warnings about importing excel files and there's good reasons for that because excel files are very flexible and people can put a lot of stuff in there again comments and changing column widths and merging cells that make it easy to use excel just for displaying information but if you're importing you don't want to do that fortunately I have it set up as tidy data already columns are the same as variables rows are the same as cases there's nothing else in there and so what I can do in this case is come to file open we'll go to data again and this time I come down to this one it actually has Excel file as a format there it is I'll hit open and you'll see that the dialogue is different in this case it says opening Excel data source instead of the text import wizard it says read the variable names from the first row that's checked by default it knows how many rows of data I have and it's got this thing about maximum width I don't need to worry about that I just hit OK and that was that here's the data from Excel it's the same data I still need to change these three measures manually you could save this information in syntax if you're going to be doing it many times over but that is sufficient for what I need and so it turns out that importing information into SPSS is really easy and it's massively more efficient and easier to do than entering it directly you do it in a spreadsheet especially if you do it on Google sheets if you're entering stuff manually you can collaborate on it and then you save it as a CSV file and you pop it in there and then you can get straight to your analysis and that is the point of all this work anyhow and now in SPSS and introduction we get to the part that maybe you were waiting for and that's analyzing data I'll mention however I'm going to give only a very small overview of analyzing data because we have an entire separate course here for data analysis and also data visualization in SPSS and I recommend that you check those out but as a taste of what's available we'll talk about a procedure that's of interest to a lot of people in applied settings and that's hierarchical clustering now the idea here is that you're trying to find clusters you're trying to find the clusters in your data more specifically what you're trying to see is whether similar cases cluster together in some way that you can use to make inferences about them the trick however is that similarity depends on your criteria and there's a few decisions that you have to make when you're doing a cluster analysis of any kind so if for instance you have to decide whether you're going to do a hierarchical cluster analysis which goes from one group to as in many groups as you have cases or whether you're going to use a set K or set number of clusters you also have decide on the measures of distance that you're going to use Euclidean distance which is sort of like measuring that as the crow flies distance between cases is very common as is squared Euclidean distance which is what SPSS uses the question of whether you want to start with everything together and split it up in a divisive procedure or start with everything separate and put it together in an agglomerative procedure by default some programs like our dude divisive but by default SPSS does agglomerative you basically end up with the same general findings anyhow so it's really not a huge difference so we're going to do a cluster analysis but we're going to try to keep it simple we're going to use some of the most basic methods for doing this we'll use euclidean distance or squared euclidean distance in this case we'll use a hierarchical clustering where we don't have to choose the number of groups ahead of time and we're going to use an agglomerate Avicii jure where it starts with every case separate and then gradually puts them together we'll try this in spss but I need you to do something first there is a folder that you can download from the Case Files that ends with data here and in it there's one file it's cars dot save where the SAV is the proprietary SPSS data format and in addition to that there is the SBS syntax file and you'll want both of those for this demonstration if you save the data file to your desktop it looks like this you can just double click on it and it will open up in SPSS you also have the option of using syntax to do that it depends on your operating system this is for a Macintosh right here and this is for a Windows computer though you may need to use back slashes instead depending on your version of Windows I'm just going to go back and double click on this to open it up in SPSS and there's my data set what this data set is is a slight variation on a data set called Mt cars that's in the default data sets package in R it contains Road test data on a number of cars from 1974 from the magazine Motor Trend and what we're going to do is we're going to look at this information and we're going to see whether the cars cluster together in some important way I'll go to the data view here and you can see we have Mazda rx4 Hornets sport about Mercedes 450 se Lincoln Continental and so on cars that were all available in the early 70s and we have information about miles per gallon we have the cylinders we have the displacement in cubic inches horsepower weight in tons quarter second time in the standing quarter mile whether it's an automatic or a manual transmission the number of gears and the transmission and the number of carburetors are probably carburetor barrels here I'm going to turn on the labels only one variable changes here by the way one of the things I did is I formatted this for SPSS by adding labels and change some of the decimals makes a little easier to work with in the program but let's go to this syntax file right now once we have the data open we want to do a default hierarchical clustering now this is the code to produce it right here but I'm going to do it with the drop-down menus to show you that it's really not hard to do all we need to do is come up to analyze and then we come down to classify now I have to off the top of my head I cannot remember if every version of SPSS has this particular menu most will I hope yours does so you can follow along with this hierarchical cluster I'm going to click on that and what I'm going to do here is I'm going to take car name which really tells me it just says what the cars are and I'm going to use that to label cases because that's going to mean something to me and I'm going to take all of my other variables I'll just do a little shift click here and put them over here and at this moment I'm gonna change nothing else you'll see this going to cluster cases that's what we want it's going to give us both some statistics and some plots and that's fine I'm gonna hit OK and we're gonna get a result identical to my first syntax command I see it's in I'll make the output window bigger here here's what we have first off it tells us how many cases there were and there were 32 and they all had complete data which is nice then SPSS gives us something kind of unusual called an agglomeration schedule and it really specifies at what point in the procedure did two cases get put into the same cluster I personally don't have much use for this except I know that when there is a big jump in the coefficients as there is here from 3 to 26 you know that there is a very distinct category change as hermus 662 1125 and so on most of the time though I would just completely ignore this one and this this is called an icicle plot and it shows sort of the same information about when various cases got dropped in with everything else it's kind of pretty to look at I find it kind of meaningless and so truthfully the default output for SPSS is hierarchical clustering to me is not very helpful in fact it's so unhelpful I'm just going to delete it all and I'm gonna do this over again they'll come back up to my recent menu items and I'm going to go to this analysis again I'm going to make a couple of changes I don't want the agglomeration scheduled that doesn't really help me and four plots I'm going to get rid of the Icicle plot and I'm going to get a dendrogram instead ad enter Graham that means branches in Greek so it's a graph of the branches and this is usually the most important thing you can get out of a hierarchical cluster analysis I'll hit OK and now what we have is a chart here that lists all the cases the cars on the side and it shows how they grouped together so we see for instance that these first four cars the Mazda rx4 and the wagon and the Mercedes 280 and 280 C are very similar to one another they all go here together we see that some others if we come down here so for instance the Cadillac Fleetwood the Lincoln Continental Equestria which are all gargantuan American cars with big v8 they all go there together and then we see down here at the bottom that this one the Maserati Bora is all by itself for a very long time this is where cases are individual here on the left and they gradually get put together and you see how they come together in each of these branches that's why it's called a dendrogram and so this is a really nice way of seeing how similar your cases are and if you have more pixels displayed you can see the entire graph at once I've got a low resolution right here and you can see maybe it makes sense to split this often to say four groups looks like we've got a distinct group right here right there right there and right there and so I can do something else with this I'm going to come back to the menu here and what I'm going to do is I'm going to save group membership now I've done a hierarchical analysis so I didn't have to specify the number of groups but now that I've looked at the chart for seems like a good number so I'm gonna come here hey give me the group of membership for each case breaking it down into four clusters I'll hit continue and then I'm going to ask for it to not give me any plots I hit okay and this time we're not going to get any output except to say that it did the work let's just get that here it says that it processed them the place where we're going to see the difference is in the data file so I'm going to move over to the data file this button by the way will get me over to the data and now you can see I have a new variable that got added here for clusters for and you can see that each of the cars is listed in one of these four clusters and what you can do then is you can then take these cluster memberships and you can compare them on the other variables again remember the clustering here is only as valid as the data that we give it so it's only comparing these cars on a small number of variables and it's using that to decide what goes with what it's here for instance that you see the Maserati Bora is in a category all by itself and this is a neat way of looking at the similarity between items you can do it with people if you're doing a market research you can do it with companies if you're doing some sort of segmentation and it allows you to see what groups have important similarities for what your purposes are and which groups you need to treat differently is one another that's the goal of hierarchical clustering analysis and then you find it's a very easy thing to do in SPSS another important procedure in spss when you're analyzing data is something called a factor analysis now I like to think of it as looking at your data and trying to find shadows in this picture what you have our shadow is those are the black figures that you see and takes a moment to figure out that you're looking down and there actually are people but kind of sticking straight out and so in this photo while you're going from is sort of a three-dimensional origin that's the person itself to a two-dimensional variation with the shadow what's interesting about that is you maintain most of the useful data you can tell that they're people that they're walking you can probably even tell some things about how tall they are what they're wearing and so on what you've done is you've made things more efficient now in the data world that's called dimensionality reduction where each variable is a dimension and too many variables can actually be really problematic you're trying to boil things down a little bit and you can think about the saint's less is more or less equals more more specifically that is less noise and fewer unhelpful variables in your data set equal more meaning because that's what you're trying to do you're trying to extract meaning now when it comes to factor analysis and related techniques I have one very important piece of advice and that is to be practical at all point you want to remember what is your goal so what is the goal well the goal of factor analysis I'll tell you what it's not it's not an exercise in analytical purity you're not there to show that you know how to go through all the steps in the approved format really you're working with your data because you're trying to get some understanding so the goal of a procedure like factor analysis is useful insight trying to follow the rules do what you can to make sure you don't make any obvious mistakes but remember you're not bound by the mathematics you're bound by what the data tells you about the people another way of looking at that is use factor analysis or really any other procedure for its heuristic value that is it suggests possibilities to you as you analyze the data and you're trying to get insight to people now that's sort of a philosophical discouraging let me show you how this actually works in SPSS you're going to need to download from the course files a folder that says data here at the end and from it the cars dot save data set this is the one that we used in hierarchical clustering as well and then you want to open up the SPSS syntax file that goes with this particular section now the easiest way to open the dataset is simply to double click on it and you'll be ready to go I do have some syntax you can use if you've saved it to your desktop I've got it open already so let's take a quick look at the data set we have a collection of cars listed down the side and their attributes like miles per gallon and so on and gears and the transmission and carburetors that's great now I will have to make a very important confession here this is a very very small data set for factor analysis it only has nine variables other than the identifier and it only has 32 cases really you would want to have at least several hundred cases and let's say several dozen variables before you can do this really reliably but this example works and it it actually is really easy to see how it's happening and how to interpret the results the first thing we're going to do if you look at the syntax is we're going to do a default factor analysis and it's actually a misnomer because it's not a factor analysis it's principal components analysis but it's in the factor analysis command within SPSS so let's come up here to analyze and down to dimension reduction remember I said that's what this is called Oh big factor it's our only choice there and what we need to do is choose the variables that we're going to use to see what we can compress what goes into what so we don't need the name of the car that's just an identifier we can take the rest of these however and we can put them under variables now we've got a lot of options here I'm not going to do any of them I'm just gonna hit okay for right now I'll make the output window bigger and here's what we get from the default analysis we get a text output of the commands that were generated by the drop-down menus we get something called communalities each variable brings with it one unit of standardized variance that's based on how spread out the scores are and if you standardize them then you have a variance in the standard deviation of one for each and the extraction tells us how much of that variance is really able to get constituted through the process that we're doing an important one right here is the total variance explained because what this is done is it has created components remember I said this is actually a principal components analysis here which while it has profoundly different philosophical underpinnings from factor analysis the difference has to do with which came first the factors or the observed variables and truthfully most people treat them as relatively interchangeable and if you're using them for a heuristic value it's not going to be a big difference but what we have here are two components we have one with five point four seven two units of variance that's sixty one percent of the original variants of the nine variables and then another one with two point three four one I'm getting those numbers from right here and you can see it held onto these two which collectively add up to about 87 percent of the variance now the component matrix shows the relationship between the original variables and the two components these are like correlation coefficients you can see that miles-per-gallon is strongly negatively associated with the first component and really not associated with the second but number of carburetors has a pretty strong association with each and so that's a way to start to look at it but it's going to be a lot easier if we do certain modifications to this in fact I'm going to just delete this output right here and we're going to start over I'm going to make a few changes let's go through these options first we go to the descriptives and I don't really feel like I need the initial solution so I'm going to unselect that I'll hit continue extraction this is the actual algorithm that SPSS uses to work through the relationships in the multi-dimensional space you'll see right here it's principal components that's why I said this is really a principal components analysis you've got a lot of options here now in many situations maximum likelihood would be a very good answer I'm going to choose principal axis factoring simply because it's the classical version of factor analysis I don't need to see the unrotated factor solution but I do want to see something called a scree plot that is a graph that shows me maybe how many factors I should keep I'm going to come down here and change the maximum iterations for convergence that has to do with the math that's done I'm going to change it to 50 then I'm going to come to rotation what you get here is a multi-dimensional space and sometimes it's a little easier if you rotate the axes it can increase interpretability now there are a lot of different methods the very max is a method that maintains orthogonal relationships that makes all of your axes perpendicular to each other there are situations where that's really good but truthfully for exploratory purposes which is what we're doing I like to use what's called an oblique rotation that allows your factors to be correlated with each other they don't have to be totally perpendicular I'm going to use direct oblem and Promax is another really good choice but it usually is for larger datasets and I've got a tiny one here now here I can get a rotated solution I don't think I really need that but I do want to see the loading plot and I'm going to increase the maximum number of iterations to 50 I'll hit continue we'll come down to scores and you can save the factor loadings as scores and there might be situations what you want to do that but because I'm using factor analysis for its heuristic value as a way of suggesting what variables go with others I'm actually not going to do that so I'm going to hit cancel and then finally options this is where you get to talk about excluding cases I have a complete data set so I don't need to worry about that but the coefficient display format now I'm going to sort it and then I'm actually going to have it completely erase small coefficients now I've done this one before so I happen to know that a value of 0.6 under normal circumstances that's really high but given my very small data set this seems like a reasonable choice and it makes the solution very very clear when we look at it so I'm going to hit continue and then there I'm going to hit OK I've got my output here and the first parts pretty similar except it doesn't start with unit variance for each of these that's because I'm not doing principal components anymore I'm doing principal access factoring and so the math behind it's a little bit different but we don't need to dwell on that one total variance explained you see that we still have two factors and the first one accounts for a lot of the variance the second one accounts for a fair amount also then these are very close to what we had with the principal components the scree plot is a very simple line plot that suggests how many factors we might want to keep now there are several different rules you can use for interpreting this one is anything that's above a value of 1 because 1 is what it would be if a variable explained simply one unit of variance but that's what it brought with it you want factors that I can explain more than that and you see we have two that do a lot more than one these others sort of straggling down the other rule is to look for a bend in the line and you do see a strong band right here so three is where the bend is we're justified in saying with two there are other methods that get more involved about checking for the slope of this line and finding things that are above that slope you can do those in another time this is a quick demonstration you for a final look at SPSS and analyzing data at least in this brief overview course let's take a look at one of the most useful procedures around regression now you might think of regression as sort of the statistical version of The Three Musketeers where it's all for one I say that because all for one is actually all variables for predicting one outcome put another way regression uses many different variables many predictor variables to predict scores on one outcome variable this makes it ryu SFIL in a huge range of circumstances especially because there's something for everyone with regression there are many different versions of it and many adaptations of regression that make it truly flexible and powerful when analyzing data and make it a go-to tool for almost any analytical purpose you might have will try a simple version of this in spss first make sure you've downloaded this data folder from the course files it will use the car save data set that we've used in our two previous examples along with this syntax file and when you get to this in text file it begins as usual with the code for loading the data set from the desktop truthfully is easier to just double-click on the file car safe and have it open it up directly in SPSS that's what I've done here and you can see it's the same data set with about 32 rows of data a bunch of cars from 1974 and several variables what we're going to try to predict in this one is miles per gallon based on things like the number of cylinders the displacement horsepower weight quarter second time transmission and kind and gears and carburetors alright so that should be pretty easy what we're going to do is go to analyze and come down to regression and we'll use this second option here linear that's just basic linear regression we need to put under dependent the outcome variable the thing we're trying to predict kind of bugs me here cuz independent and dependent really should be reserved for manipulated experiments but we still know what they mean our outcome variable the thing that we're trying to predict goes here independent so that's miles per gallon and then we can take everything else except car name that's just a label we'll take all the rest of these and we'll put them under our independent or the variables that we are using to predict the outcome now I want to do the totally default no extra steps version first so I've put the variables in their respective place and I'll just hit OK and now we get our output and it tells us first the code that was used to produce this analysis that it used all of these variables simultaneously to predict a single outcome which is listed down here and they were entered at once the model summary tells us that we have a multiple correlation of these predictor variables with our outcome variable of 0.9 31 which is really really high if you square that to get the proportion of variance explained it's eighty six point seven percent even the adjusted R squared because we have a small sample it's still 82% it's it's huge we get a significance test right here we are not surprised to see that the significant is point zero zero zero it's not zero it's all the way through but it's it's highly significant and then we get coefficients for the individual regression coefficients so what we're looking for here are significance levels that are under Oh 5 and interestingly only one of them in this collection is under oh five and that's wait in tunt none of the others are they're close that doesn't mean that none of the others of manner is simply means that when you take all of the variables together at the same time when they are taken as a whole really only one of them deviates significantly from zero to become a predictor that's a weight now there are a lot of other ways of doing regression and SPSS gives you a lot of choices I mean it come up here back to analyze down to regression now I will mention there's a really interesting one here called automatic linear modeling this is a SPSS function that it's came in a few versions ago it does a lot of automatic data prep it does a lot of combining and splitting up at variables on the other hand it's really kind of difficult to explain how it all works and then to interpret it properly and I'm going to save that for another course where I specifically talk about analyzing data for now I'm going to go back to linear and we're going to make a few choices we're going to make a few option rephrase and we're going to make a few choices we're going to take some of the options that SPSS makes available now the first one I'm going to do at the risk of doing something very controversial is I'm actually going to go from simultaneous entry to stepwise regression this is controversial because some people in the literature have called it positively diabolic in its risk of a type 1 or false positive error then there's a good evidence for that on the other hand in modern machine learning stepwise procedures have been very fruitful used and so it's not totally unacceptable to try it especially when we're doing sort of an exploratory project like this right now you certainly wouldn't want to use it for rigorous model building but is the nice way to get some insight into the data pretty quickly I'll come to statistics and I'm going to add a few things I'm going to get confidence intervals for the coefficients those are nice to have we have the overall model fit and I'm going to get the R squared change because a stepwise model goes through several different steps adding variables and we want to see if each variable adds something that is statistically significant to the overall model we could get a lot more information here but I'll leave it there for now under Platz we can get a ton of different plots but I'm actually just going to come down here and choose the standardized residual plots a histogram and a normal probability plot now there are other options as well I could save about 15 different kinds of scores to the data set I can save unstandardized predicted values I can save studentized deleted residuals and so on and so forth things I could do here and there are situations in which I might want to do those but for right now I'm going to skip them because I'm simply trying to build a model without necessarily saving all of the steps in between options really just talks about the criteria used in the step wise procedure I'm going to leave it at the default right now but you could change it if you wanted to and then style is a new thing that has to do with the formatting of the table I'm going to leave that one alone for right now because we're going to have exactly what we need now I've created this already and I've saved it in this in text so I'm just going to hit OK and you'll see that we get a different kind of output right now I'll zoom in on this now what we have is some code that's a little bit longer at the system to go through the variables one at a time and find the predictor variable that is most strongly associated with the outcome put it in the model get partial correlations and go through step after step what we find here is that although we had nine predictors originally only two of them were statistically significant when put into the model they were wait and number of cylinders again what we're trying to predict is gas mileage miles per gallon if you come down here you can see that they were both statistically significant where the adjusted R squared for just weight is seventy four point five and when you add on number cylinders it goes up not a huge amount but it goes up almost eight percent the analysis of variance table lets us know that both of these models with just one variable and with two predictor variables they're both statistically significant here are the individual coefficients along with their confidence intervals over here on the right side now because we've gone through a step wise procedure it's not surprising that all of these are statistically significant because that was the criterion used for including them here we have a list of excluded variables along with their collinearity statistics and this has to do with how much each of these variables is correlated with the others so for instance number of carburetors is highly collinear or easily predicted by the other variables that we could have included in the model and then we come down to the residual so I'm going to look specifically at the chart in an ideal world your residuals are normally distributed which means they're just as likely to be high as they are low and they're symmetrical and we see here that they're not horribly pathologically far from normal so this is probably a good model in this set and here is a normal PP probability probability plot of this same data and if it were perfectly normal all the dots would be on the line the diagonal line they're close these are the 32 individual observations and how far off there they're close enough and so this lets us know that our model is predicting really well and it appears to be not biased in one direction or another so this is one method of developing the model again the step less procedure is best for exploratory analyses it's not something you would use for confirming of finding but as a quick way of sifting through a large collection of potential variables this is a nice way to do it it lets us know for instance that in this particular dataset miles-per-gallon is predicted primarily by a weight which completely makes sense about the car and number of cylinders which is associated with having a large and thirsty engine so the general idea of multiple regression again is to use many variables to predict a single outcome SPSS gives a lot of options for those we've looked at the default we looked at one variation on there but there's a lot more that you can explore and that we will cover in another course on statistical analysis in SPSS but for now I encourage you to take some time and look at some of these options and see the kind of insight that they can give you on your own data and see what options you can use to get useful insight into your own analyses I want to thank you for joining me in SPSS and introduction and we'll conclude by giving you some next steps things that you can do next because you know once you get through this it can be a little confusing feel like things are going everywhere and it may not be totally clear where you should go well here at data lab CC we've got a few opportunities for you first of course is more SPSS we have additional courses on data preparation on data visualization on statistical analysis and other topics that you can use to expand what you've learned in this introductory course and work on your own data now if you've liked what you've learned with SPSS you may want to try branching out to some other languages the statistical programming language are and the general-purpose programming language Python are very common powerful tools in the data science community and analytics in general they are a great way to expand both the things that you can do with your analyses and your employment opportunities and so I strongly encourage you to take a look at the courses on our in Python at data lab next we have specific courses on data visualization one of the most important things you can do in getting to understand your data SPSS can work well in those as well as other programs and then I'm going to mention one final thing here SPSS is a wonderful program but it still has a fair amount of bugs and it can also be very expensive fortunately some really interesting work recently in the open-source community has developed a program called JJ ASB it's actually pronounced jazz which is sort of an open-source version of SPSS it runs very differently I find it very easy to use and it makes it reproducible it makes it easy to share it's got some tremendous advantages and we have courses on chest here at de lab I suggest you check those out and see how well that program is able to fulfill some of your computing needs that being said there are some things missing what's missing exactly well SPSS doesn't have a really strong and active user and developer community the same way that languages like r and python do but if you're creative you can get around that academic conferences meaning specifically topical academic conferences like biology or management or the social sciences they often have very dedicated SPSS users and teachers and may sponsor specific hands-on workshops for learning more about SPSS and I can use it within your particular domain but no matter what you do I'm going to encourage you to simply get started go exploring and see what you can do with SPSS in your own day to work thanks so much for joining me and happy computing
Info
Channel: Academic Lesson
Views: 636,566
Rating: 4.9089026 out of 5
Keywords: spss data analysis, spss, spss tutorial for data analysis, spss for beginners, spss course, data analysis with spss, spss tutorials, spss full video, academic lesson
Id: Bku1p481z80
Channel Id: undefined
Length: 136min 47sec (8207 seconds)
Published: Mon Aug 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.