The best description I've ever heard for
boxplots (a.k.a. box-whisker diagram) is "elegant simplicity." The boxplot serves up
a great deal of information about both the center and the spread of the data,
allowing us to identify skewness and outliers in a form that is both easy to
interpret and easy to compare to other distributions. It is the graphical
equivalent of the five number summary which we will talk about when we cover
variability. All of that in one simple graph made of just a few lines, but since
you may be new to box plots let's start with the basics. The box plot, also known as a box-whisker diagram, is made up of two components. A box and two whiskers. This is the box, and these are the whiskers. The box plot
shows the center and the spread of the data. This helps us to identify the
skewness and the outliers, and makes it easy to compare different variables, and
the box plot is the graphical equivalent to a five number summary. The line in the
middle of the box is the median. The ends of the box show the upper and lower
quartiles. The sides, or the length of the box, shows the interquartile range. The
middle 50% of the scores. The ends of the whiskers show us the minimum and the
maximum scores. In SPSS, the whiskers are trimmed to identify outliers, which are
indicated by a small open circle, and extreme outliers, which are identified by
a star. The length of the whiskers shows the top or bottom twenty-five percent of the scores. From end to end, the whiskers show the range.
The boxplot is also an excellent way for us to identify outliers. We are going to
create a box plot in SPSS using the data set clickers.sav. The box plot will
show the range of the data trimmed in SPSS, so that we can more easily identify
the outliers outside of the whiskers. In SPSS using the clickers.sav dataset,
Let's go to graphs > chart builder. Let's press reset. Now in the gallery,
we're going to choose box plot, and we'll select this first, simple box plot, drag
that up into the canvas. We can left-click a route, sorry right-click or
control-click to chose variable names. We're gonna move round one into
the y-axis, and we put gender on the x- axis. That's all we need to do, let's
click OK. Window opens up, and what do we see?
Well we see these two box plots and something unusual - two asterisks or stars with the numbers 133 and 157. Those are the outliers. And what do you think those
numbers stand for? Those are the case numbers. SPSS gives us the case numbers for the outliers so that we can find them more easily. Now we could scroll
through our dataset looking for those cases, but let me show you a trick that
will make this even easier, especially when you have a large data set or a lot
of cases to look for. We're going to go to edit > go to case. Remember that the
first case was 133, so we're gonna type 133 and go. Notice how case 133 has now popped to the top, and we see for round 1 the score
is 55. This is an outlier. An outlier is an
extreme score on a variable. Now some cases of outliers are legitimate, but in
this case, it's problematic, because what we have here is a data entry error. And
data entry errors are not legitimate outliers, and they should be corrected. So
what do we do with outliers? If it is a legitimate outlier, you may choose to
leave it in, as we say, outliers are people too. So if this one person really
is 6 foot 6, that is a legitimate height and we just have to include that really
tall person in our data set. However if the outlier skews the data set because
it exerts too much leverage on the mean, then you may need to switch to
nonparametric alternatives to test your analysis. Sometimes, as is the case here
we have data entry errors, we know that the range of possible values was 1 to 7,
so that 55 was not a legitimate value. It was a probably supposed to be a 5.
Correct data entry errors if it can be determined what the accurate value
should be. Now in other cases, you have legitimate outliers but they are non-representative. So say that we're examining the incomes for professional
sports teams where most of the players are making $75-100,000 but you have
to superstar players who make three million dollars. You may choose to Winsorize the data. To "Winsorize" is to trim the outliers to match the highest or the
lowest representative value. So if the next highest player makes 135 thousand
dollars, the two superstar salaries are Winsorized, or or trimmed to $135,000 for
the analysis. Now as a last resort you can remove the
outlier from your data set. Now this is more commonly done when the outlier is a multivariate outlier. In other words, the case has extreme scores on several
variables, not just one. Now regardless of what technique you decide upon, you
should always include details about your data cleaning in your write-up, and be honest
and transparent about how you cleaned your data. So what do we do with this
first outlier? We know that the range was one to seven, so the fifty-five is not a
legitimate value, and probably was supposed to be a five. We can't tell for sure, but that's a fair guess, so we're gonna change our 55 to a
five. Now let's talk briefly about univariate versus multivariate outliers.
Univariate means one variable. Multivariate means multiple variables.
These are the variables that are being considered together. If you use a Likert
scale that runs from a minimum of one to a maximum of seven, you will not have
outliers for a single item. The most that any person could answer is a 1 or a 7,
and those are both within the range of the data. When you combine multiple items into a single subscale, such as you asked 5 similar questions about a personality
trait like neuroticism, then a person who answered 7 on every question could be a
univariate outlier. We measure variables with multiple items combining those
items is what creates the variable. Outliers on that one variable are the
univariate outliers. When we analyze multiple subscales such as work
satisfaction, intention to quit, and creativity scales, all three together,
then we might have multivariate outliers. A multivariate outlier is an outlier on
every subscale. Multivariate outliers occur when someone is answering survey
questions facetiously or only using the extreme ends of the responses. We
identify multivariate outliers using a Mahalanobis test, and this can be done in
SPSS. Think of multivariate outliers like this: if you have a friend who is
funny, quirky, a little crazy in one way, well then you have an eccentric
idiosyncratic friend, who is probably delightfully odd. Keep that friend. But if
you have a friend who's crazy in a lot of ways, crazy in every way, then you need
a new friend. We keep univariate outliers who are odd in only one way, but we
remove multivariate outliers who are odd in all kinds of ways. They tend to mess
up the analysis. So let's look at case 157. It has a value of 66. Correct that 66
to a 6 run the boxplot again in the chart builder by simply clicking ok. You
do not need to make any changes in the dialog boxes because your original box
plot settings will be saved. Now these box plots look much better. We still have
some extreme values, but they're within the range of 1 to 7, so they look like
legitimate outliers. Notice how they're indicated by circles, not by asterisks.
We're going to leave them in the data set. So while we're at it, let's rerun the
bar chart for repeated measures that got us started looking for these outliers.
Let's go to graphs and chart builder. We're gonna click reset. In the gallery we'll
choose bar, and we'll drag the simple bar chart into the canvas. Under variables,
click on round 1. While holding down the shift key, click on round 3. We're going
to drag all three of these into the y axis. Except the create summary group and let's add some error bars. Click on display error bars under the elements
property window, click apply, and click OK. Here we see what the bar chart looked
like originally with the outliers. The mean for round one was about 4.5, and the error bar was broader than the other two rounds. Well how does it
look now that we have removed the outliers? Yes that looks better. Now the
means for round one are closer to four, and the error bars for each of the
rounds are similar. Removing the outliers made our bar charts much more accurate
and honest. We're going to do more analyses with this data set. Because you
have corrected errors in the data, be sure to save the data set now to
preserve the corrections you made. Use this corrected data set for all further
analyses that we do, including in your homework.