Boxplots & Outliers in SPSS – Identify and Deal with Outliers (4-8)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
The best description I've ever heard for boxplots (a.k.a. box-whisker diagram) is "elegant simplicity." The boxplot serves up a great deal of information about both the center and the spread of the data, allowing us to identify skewness and outliers in a form that is both easy to interpret and easy to compare to other distributions. It is the graphical equivalent of the five number summary which we will talk about when we cover variability. All of that in one simple graph made of just a few lines, but since you may be new to box plots let's start with the basics. The box plot, also known as a box-whisker diagram, is made up of two components. A box and two whiskers. This is the box, and these are the whiskers. The box plot shows the center and the spread of the data. This helps us to identify the skewness and the outliers, and makes it easy to compare different variables, and the box plot is the graphical equivalent to a five number summary. The line in the middle of the box is the median. The ends of the box show the upper and lower quartiles. The sides, or the length of the box, shows the interquartile range. The middle 50% of the scores. The ends of the whiskers show us the minimum and the maximum scores. In SPSS, the whiskers are trimmed to identify outliers, which are indicated by a small open circle, and extreme outliers, which are identified by a star. The length of the whiskers shows the top or bottom twenty-five percent of the scores. From end to end, the whiskers show the range. The boxplot is also an excellent way for us to identify outliers. We are going to create a box plot in SPSS using the data set clickers.sav. The box plot will show the range of the data trimmed in SPSS, so that we can more easily identify the outliers outside of the whiskers. In SPSS using the clickers.sav dataset, Let's go to graphs > chart builder. Let's press reset. Now in the gallery, we're going to choose box plot, and we'll select this first, simple box plot, drag that up into the canvas. We can left-click a route, sorry right-click or control-click to chose variable names. We're gonna move round one into the y-axis, and we put gender on the x- axis. That's all we need to do, let's click OK. Window opens up, and what do we see? Well we see these two box plots and something unusual - two asterisks or stars with the numbers 133 and 157. Those are the outliers. And what do you think those numbers stand for? Those are the case numbers. SPSS gives us the case numbers for the outliers so that we can find them more easily. Now we could scroll through our dataset looking for those cases, but let me show you a trick that will make this even easier, especially when you have a large data set or a lot of cases to look for. We're going to go to edit > go to case. Remember that the first case was 133, so we're gonna type 133 and go. Notice how case 133 has now popped to the top, and we see for round 1 the score is 55. This is an outlier. An outlier is an extreme score on a variable. Now some cases of outliers are legitimate, but in this case, it's problematic, because what we have here is a data entry error. And data entry errors are not legitimate outliers, and they should be corrected. So what do we do with outliers? If it is a legitimate outlier, you may choose to leave it in, as we say, outliers are people too. So if this one person really is 6 foot 6, that is a legitimate height and we just have to include that really tall person in our data set. However if the outlier skews the data set because it exerts too much leverage on the mean, then you may need to switch to nonparametric alternatives to test your analysis. Sometimes, as is the case here we have data entry errors, we know that the range of possible values was 1 to 7, so that 55 was not a legitimate value. It was a probably supposed to be a 5. Correct data entry errors if it can be determined what the accurate value should be. Now in other cases, you have legitimate outliers but they are non-representative. So say that we're examining the incomes for professional sports teams where most of the players are making $75-100,000 but you have to superstar players who make three million dollars. You may choose to Winsorize the data. To "Winsorize" is to trim the outliers to match the highest or the lowest representative value. So if the next highest player makes 135 thousand dollars, the two superstar salaries are Winsorized, or or trimmed to $135,000 for the analysis. Now as a last resort you can remove the outlier from your data set. Now this is more commonly done when the outlier is a multivariate outlier. In other words, the case has extreme scores on several variables, not just one. Now regardless of what technique you decide upon, you should always include details about your data cleaning in your write-up, and be honest and transparent about how you cleaned your data. So what do we do with this first outlier? We know that the range was one to seven, so the fifty-five is not a legitimate value, and probably was supposed to be a five. We can't tell for sure, but that's a fair guess, so we're gonna change our 55 to a five. Now let's talk briefly about univariate versus multivariate outliers. Univariate means one variable. Multivariate means multiple variables. These are the variables that are being considered together. If you use a Likert scale that runs from a minimum of one to a maximum of seven, you will not have outliers for a single item. The most that any person could answer is a 1 or a 7, and those are both within the range of the data. When you combine multiple items into a single subscale, such as you asked 5 similar questions about a personality trait like neuroticism, then a person who answered 7 on every question could be a univariate outlier. We measure variables with multiple items combining those items is what creates the variable. Outliers on that one variable are the univariate outliers. When we analyze multiple subscales such as work satisfaction, intention to quit, and creativity scales, all three together, then we might have multivariate outliers. A multivariate outlier is an outlier on every subscale. Multivariate outliers occur when someone is answering survey questions facetiously or only using the extreme ends of the responses. We identify multivariate outliers using a Mahalanobis test, and this can be done in SPSS. Think of multivariate outliers like this: if you have a friend who is funny, quirky, a little crazy in one way, well then you have an eccentric idiosyncratic friend, who is probably delightfully odd. Keep that friend. But if you have a friend who's crazy in a lot of ways, crazy in every way, then you need a new friend. We keep univariate outliers who are odd in only one way, but we remove multivariate outliers who are odd in all kinds of ways. They tend to mess up the analysis. So let's look at case 157. It has a value of 66. Correct that 66 to a 6 run the boxplot again in the chart builder by simply clicking ok. You do not need to make any changes in the dialog boxes because your original box plot settings will be saved. Now these box plots look much better. We still have some extreme values, but they're within the range of 1 to 7, so they look like legitimate outliers. Notice how they're indicated by circles, not by asterisks. We're going to leave them in the data set. So while we're at it, let's rerun the bar chart for repeated measures that got us started looking for these outliers. Let's go to graphs and chart builder. We're gonna click reset. In the gallery we'll choose bar, and we'll drag the simple bar chart into the canvas. Under variables, click on round 1. While holding down the shift key, click on round 3. We're going to drag all three of these into the y axis. Except the create summary group and let's add some error bars. Click on display error bars under the elements property window, click apply, and click OK. Here we see what the bar chart looked like originally with the outliers. The mean for round one was about 4.5, and the error bar was broader than the other two rounds. Well how does it look now that we have removed the outliers? Yes that looks better. Now the means for round one are closer to four, and the error bars for each of the rounds are similar. Removing the outliers made our bar charts much more accurate and honest. We're going to do more analyses with this data set. Because you have corrected errors in the data, be sure to save the data set now to preserve the corrections you made. Use this corrected data set for all further analyses that we do, including in your homework.
Info
Channel: Research By Design
Views: 61,681
Rating: 4.928287 out of 5
Keywords: Todd Daniel, statistics, flipped classroom, beginners, introduction, Research by Design, how to, SPSS, graphing, chart builder, clickers, Turning Technology, Kristin Paloncy, Kristin Tivener, boxplot, outlier, whisker plot
Id: 5P94PXHEBs8
Channel Id: undefined
Length: 12min 4sec (724 seconds)
Published: Thu Aug 18 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.