Good morning or good afternoon to everyone. I'm Michelle Dunn, in the Office of the Social
Director for Data Science at NIH. It gives me great pleasure today to introduce
Dr. Brian Caffo, who is the Professor in Biostatistics at the Johns Hopkins School of Public Health,
who is my neighbor here in Maryland, down the street. He is world famous as the co-creator and co-director
of the Coursera Data Science Specialization. And if I remember correctly, his group was
even the one that suggested the idea of a specialization to Coursera. He is also the co-creator and co-director
of the SMART Group. SMART stands for Statistical Methods and Applications
for Research in Technology. And this group focuses on statistical methods
for biological signals. So Brian's research is in statistical computing
and generalized linear mixed models, with applications to neuroimaging, functional magnetic
resonance imaging, and image processing in general. He is well entrenched in the analysis of big
data, leading teams to work in prediction competitions. A couple of the notable ones that his team
has done well in are the ADHD 200 prediction competition, which they won, and the Heritage
Health prediction competition, which they got 12th place in. So congratulations on those. Dr. Caffo has been given numerous awards,
and I list a couple of them here. The first one is the PECASE award, which is
the Presidential Early Career Award for Scientists and Engineers. This is probably the most prestigious award
that can be given to an early career researcher. He's also won awards for teaching and for
mentoring. So with that, I will turn it over to Dr. Brian
Caffo. Brian, thank you very much for being with
us today. Thanks, Michelle. I hope everyone can hear me. Thank you for inviting me. And thanks, Crystal, for doing all the organization. So my topic today is exploratory data analysis. And I'm going to talk about this more like
an instructor of an exploratory data analysis, because it's an area that I use a lot, but
it's certainly not an area that I do a lot of active research in. I do have one notable exception. My former PhD student, named Bruce Swihart,
who's a really brilliant guy, he works at the NIH now. He wrote a paper that was really about contributing
the idea of using heat maps to perform visualization for longitudinal data analysis. And he has probably the best title of any
paper I've ever written, for certain, called Lasagna Plots, a Saucy Alternative to Spaghetti
Plots. So if you want to see a nice little paper
on visualization, check out Bruce's paper. But the fundamental idea-- I'll just cover
this really quickly-- is that with spaghetti plots, like in this-- hopefully everyone can
see my cursor. If you can't, maybe someone will let me know. Like, this big, black mass over here is
just a lot of overplotting from spaghetti plots. In cases like this, you might want to do something
like put it into a heat map where you could actually see trends. Like, here, we were investigating sleep disorder
breathing and some EEG signals. And you see some interesting missing data
patterns. You also see just generally that the sleep
disorder breathing subjects have more red than the non-sleep disorder breathing subjects. Anyway, it's a nice little paper. It's on PubMed. Mostly what I'm going to talk about today
is principles of exploratory data analysis. If you want a lot of practical information
about exploratory data analysis, in other words, if you want to know, how do I do exploratory
data analysis in R in specific or Python in specific, or Sassin specific, or something
like that. Certainly the best resource for R, I would
say, is Roger Peng's Exploratory Data Analysis with R book. It's just all about how do I do it, not why
do I do it, or what are some of the guiding principles of how you do it, which is what
I'll talk about today. He also has a Coursera course. I should make a small comment. It was actually Coursera that came up with
the idea of a specialization. But we jumped on it rather quickly, to our
credit. And then, of course, the Bible of exploratory
data analysis is John Tukey's book, if you can find a copy of that. But this book by Roger Peng, and his course,
are very good, practical how-to type books. And today, what I'm going to cover are things
that are really-- don't apply across however you're doing exploratory data analysis, whether
you're doing it in Python, or R, or whatever. So let me just start by saying what exploratory
data analysis is. And if I'm going to define what exploratory
data analysis is, I should define what statistical activities are not exploratory data analysis. So the traditional dichotomy that people talk
about is to differentiate between Exploratory Data Analysis, EDA, and Confirmatory Data
Analysis, CDA. So exploratory data analysis focuses on discovery
and hypothesis generation, whereas confirmatory data analysis tends to focus on hypothesis
confirmation. Exploratory data analysis still controls error
rates and performs uncertainty quantification, but it tends to do that much more loosely. Confirmatory data analysis tends to focus
on formal inference and prediction techniques. Both of them can use the same methods. The difference is exploratory data analysis
tends to be more freeform and less structured, and I liken it to improvisational jazz. And then CDA is a little bit more prescriptive,
protocolized and planned, and so I have a picture of a symphony down there. So I think that differentiates the idea behind
the two techniques. I would say in general that data analysis
falls on a spectrum between the highly prescriptive and formal
conformatory data analysis sort of ideal, and that's probably most realized in the area
of highly regulated clinical trials. So where clinical trials tend to be highly
protocolized and regularized, especially in drug development. And then there are some areas like high throughput
technological measurement areas, like genomics tends to fall in this category, where things
are a little bit more exploratory in nature. And so I think things-- though typically most
analyses fall in some gray zone between the two. So I actually often think the EDA versus CDA
division is more useful conceptually than practically. And then I would say maybe an alternative
dichotomy, if you really have to have a dichotomy, is to think about whether or not you're using
strongly phrased scientific hypothesis to drive your research versus doing purely empirical
studies. I think that's a more useful division. But I actually put some papers down here that
discuss some of these issues. For example, this paper by Kell and Oliver
that says, here's the evidence, now what's the hypothesis? And it talks about the roles of inductive
versus hypothesis-driven science in genomics. And then this great blog post by Simply Statistics,
where it says the keyword in data science is not data, it's science. Therefore obviously promoting this kind of
strong, hypothesisdriven research agenda. And then again, I haven't really defined EDA
or really gone into much about the specifics of EDA, but I want to put one of the biggest
points that's generally brought up about EDA versus CDA now is the idea of doing a lot
of exploratory analysis and then presenting only the final step, as if that was the only
step that you did. And that particular process is called p-hacking
or data-- or phishing. Or phishing expeditions, or that sort of thing. And it generally falls under the rubric of
the more you use your data for hypothesis generation and exploration, the harder it
gets to really control the error rates on the same data for some sort of final analysis. And I would say there's been a lot of research
in the past couple of years on the extent and consequences of this problem. So probably most notable was Ioannidis' very
famous article that said, why most published research findings are false. Then I found this other great article on the
extent and consequences of p-hacking in science, which I thought was also very interesting. And then Jeff Leek and Leah Jager, who are
here in the department with me, both really great researchers. Really tried to put some numbers on this idea
of the rate of false positive findings in science. And they gave an estimate of the false discovery
rate in the top medical literature. So I would suggest you read all this literature,
and maybe even the broader literature in this area, if you're interested in this problem
on using the exploratory data analysis too much, to the point where it biases your current
formatory results. But that is a big warning about big-- a big
no-no is to do a lot of exploratory data analysis and then present the final step as if it was
confirmatory. So now let me actually get to the topic of
exploratory data analysis. So I'm going to give-- this is a shout out
to Jeff Leek, who teaches exploratory data analysis in his class. I borrowed a lot of slides from him. He had this great breakdown of what he thinks
of as the steps in an exploratory data analysis, and I think it's a really nice summary. And these are reading in the data. Figuring out the data. After you read it in, you need to figure out
what the columns are, that sort of thing. Pre-processing it. Looking at dimensions and making sure they
measure up, especially if you have more than one dataset that you're trying to merge together. You have to look at its values. Make sure there's no weird values. You need to make tables and hunt for messed
up values. Figure out how missing data is coded. NA stands for missing data. How missing data is coded. How much missing data is there? Is missing data coded consistently? And then you want to do a lot of plots. Plots are the cornerstone of exploratory data
analysis. And then the final step says don't fool yourself. So today-- and I'll talk more about that--
but today, really, these are the things I'm going to focus on. And there's a lot of content in exploratory
data analysis, and a lot of content in these areas I'm not going to cover. But just in a one-hour talk, I certainly can't
get them all. So I'm going to spend a little bit of time
on pre-processing, a little bit of time on not fooling yourself, and a lot of time on
some basic guidance on creating plots. So let's talk about pre-processing. So we recently had a really wonderful talk
from Jenny Bryan, who is a very notable data scientist who works at UBC, but then also
just got hired by RStudio to be part of their educational and development team. And there aren't a lot of great principals
in pre-processing. There's almost like-- it's almost like an
artistic endeavor, where you do it differently for every problem. And I think one of the things, if I'm thinking
broadly about what her basic research agenda is, is to really try and synthesize this process
into some general principles. And one general principle, she said, which
I think is a very useful one, is to try and get everything into a rectangle. And she actually had this picture of a rectangle
with angel wings on it. And let me just expand on this a little bit. She really emphasized this idea that data
wrangling is work. So if you spend a day or two days getting
your data into a nice format, don't think of that as, well, I haven't even started working
yet. The reality is, you've done a lot of work,
and some of the most important work. It's like the most exciting part of building
a house is seeing it all come together and getting a lot of the nice finishing touches. But without laying the concrete for a firm
foundation, the rest of the house is irrelevant. And so
no one likes laying the foundation, but everyone thinks it's probably one of the most valuable
parts of building a house. And I put that into this quote that would
say, no one has ever said, I really regret getting my data into such a wellorganized
and thought out format. So some basic things you want to remember
is to try and save your steps while you're doing it. Use a version control system, like Git, try
and engage in reproducible research as you're doing this. And then her point was also try and get your
data into a rectangle. Name your columns with a sensible naming convention. Use names that are amenable to software packages. So don't put spaces and quotation marks and
weird symbols and other things that aren't useful for coding, in your names. Try and use capitalization and spacing in
your column names like you're a programmer. No special features. If you're using a spreadsheet, don't put embedded
graphs in your basic data format. Just try and split that out. Split that process out. Even if you're using Excel to perform your
data analysis, try and create copies of the dataset where you record how you created the
copy and create graphs and different sheets. Don't use numbers for missing values is one
that comes up for me a lot, where people code missing values as 888, and then it gets read
into a software package that doesn't recognize that and treats it as a number. It messes up the analysis. And usually you can detect that very quickly. But if you don't, then it messes everything
up. OK. And there some tools-- again, I'm not going
to spend a lot of this talk talking about tools. But in R, if you like to use R, there is a
grammar of data wrangling that's coming about. It's still under development, but tools like
dplyr and tidyr and purrr and stringr and these things are really making it a lot easier. In R in specific. Of course, you can do data wrangling in any
of the major analysis programs. Now let me talk about not fooling yourself. And the first thing I'm going to do when I
talk about not fooling yourself is give a couple of parables that go along with this. So the first is the famous one of the elephant
and the blind men. This picture was off of Wikipedia. And it's the story of the elephant and the
blind men. It's that six blind men are investigating
an elephant, and one of them touches its side and says, it's a wall! And another one touches the trunk and says,
it's a snake! And another-- you get the picture. They all got a very incomplete picture of
the elephant. And this parable is used to illustrate several
points. But I think germane to our discussion is you're
really-- don't be like the blind man investigating the elephant in that-- or at least acknowledge
that you were like that, in the sense that you only get to work with the data that you
get to see. Another thing that's very common is this kind
of bias that you get by finding something, finding a pattern, and then
pretending like it was what you were searching for to begin with. And this is most often described as shooting
an arrow and painting a bull's eye around where it lands. I actually found online a company called the
Bullseye Painting Company. So I think they don't actually paint bullseyes,
I think that bullseye is just their name. But at any rate, that's an important consideration
when you're doing exploratory data analysis is coming up with a chance finding and then
pretending like it was something you were searching for all along. Then there's this famous Mutt and Jeff cartoon. I got this from Quote Investigator. Where it's a drunk looking for his quarter
underneath a lamppost, and it says, I'm looking for my quarter that I dropped. And the policeman says, did you drop it here? And the drunk says, no I dropped it two blocks
down the street. And then he says, well, why are you looking
for it here? And the drunk says, because the light is better
here. And the idea is that you're looking at the
data that you have with the biases that it has, and we want to as little as possible
be like drunks looking for our change under a lamp post. Looking where there's light rather than looking
where we need to be looking. And then of course the one that comes up the
most, that most people are aware of, is this problem of multiplicity. And this is a great XKCD comic book. It's number 882. And in this comic book, someone says, jellybeans
cause acne. And another person says, scientists investigate. And the scientists look, and it says, we found
no link between jelly beans and acne. Then someone says-- then the person says,
I hear that it's only a certain color. And then they start investigating all the
colors, and then basically, all these panels are people investigating different colors. And then they find green does have an association,
and then there's a news article that says green jelly beans are linked to acne. Of course, what this is illustrating is that
if you keep looking for things in noise, you'll eventually find them. So some example common ways you can fool yourself
in exploratory data analysis. So these are some of the most common ways
this can come up. And I'll relate them to the parables. So one thing is just issues with the data
that you have. Can it even answer the questions that you're
trying to ask? That's the parable of the elephant and the
drunk under the lamppost. Both discuss that point. Another thing is that even if you find a true
thing, those true things may not paint a complete picture. That's, of course, the story of the elephant. The confirmation bias, the idea that you pretend
like you discover things that you were just looking for ahead of time, and you just confirmed
them, but ignore the evidence that refutes it. That's like painting the bullseye. False findings are always a problem, just
where you get chance associations. That's like the bullseye in multiplicity. And then of course the problem of multiplicity
is just repeatedly looking for things until you find something. So these are all the ways that you can fool
yourself, and I think these problems are more pronounced in exploratory data analysis, because
we're not even pretending that we have some sort of strict error rate control that we're
trying to go through. Rarely in exploratory data analysis do you
have a highly protocolized version of what you're doing. So now let me-- so that discusses ways in
which you can fool yourself. Now, the rest of the discussion, I'm going
to spend entirely talking about interocular content, or other words, plots. Things that hit you right between the eyes. And the idea is that plots are really the
cornerstone of exploratory data analysis. And then plot-- there's something about visual
information that pictures really are often worth a thousand words, and there's just something
about them that can really speak to us and drive home points and help us discover things
that we-- just a table or a number or a written summary just somehow lacks. And I'll give a great example. Probably the most famous graph in graph design
and exploratory data analysis history is this plot by Menard about the French invasion in
Russia in 1812. And this is generally thought of as one of
the most information-rich graphs that you can find. It's also just beautiful in and of itself. And this is the original graph, and then you
can see it's the same thing. But here is a version I grabbed off of Wikipedia
that has the English translations on it. And so what this graph depicts is as you go
from left to right here, this is showing the French troops as they marched into Russia. Starting here, and then ending up in Moscow. Down here, you actually see the temperature. OK? And the path actually does quite closely mirror
the geographic path that the troops took. Right? So this really does look like the path they
were taking. The width of the graph represents the number
of troops at that particular time. And you have the troops-- and these little
breakouts things are groups that broke off. Right? That tried to go a different direction. And then the black line coming back this way
is the retreating troops. So this group, for example, broke off, and
then they retreated, and this is this retreating group, and this is this retreating group. OK? And what was interesting historically, what
happened was the French had a very strong troop. They started out with a collection of troops. They started out with 422,000. OK? And the Russians had a very harsh strategy. They were going to engage the French troops
and then retreat and then engage them and then retreat. And then as they retreated, they used these
scorched earth protocols, where they would burn all the fields. And as they retreated, the French troops would
have no way to resupply. And so their goal, the way the Russians were
going to win this campaign was just by basically freezing and starving the French troops to
death with a war of attrition as the Russian troops just repeatedly fell back toward Moscow. And you can just see with the width here,
the way this graph displays the death toll, it's just striking. So it started out with 422,000, down to 100,000. You can almost feel the cold and horror as
these troops are trying to cross this river. When they get to 100,000, when they hit Moscow,
and they get to retreat, and then you see down here, back at the beginning, 10,000 troops. So a loss of over 400,000 troops in this march. Any rate, at same time, you can see that the
temperature along the bottom axis here, helping show when the temperature spikes caused large
troop losses. Anyway, a really wonderful graph, and it's
just a highlight of how graphs can be used to just-- used beautifully. At any rate, this is such a famous graph,
I thought I'd describe it. Here's another famous plot that really helps
us more easily drill down on why plots are inherently useful. So Anscombe was a very famous statistician,
and he created a data set where the mean of the x's, the mean of the y's, the correlation
between the x's and the y's, and hence the r squared value and the slope of the regression
line. The standard deviation, the x, the standard
deviation-wise, these were all the same, matched from these graphs. And so if you ran these in a regression model
or a correlation, or you did a basic summary, like mean of x, mean of y, standard deviation
x, standard deviation y, you would get the same answer for each of these four graphs. However, obviously, there's very different
stories being told by the four graphs. This one looks like kind of just a regular
noisy regression relationship. This one clearly looks like a noise-free parabolic
curve. This one looks like a noise-free line with
one outlier. This one has almost no variation in the x
variable except one point that's way outside of the data cloud. So in all of these cases, unless you had done
the graph, if you had just done the obvious summaries, you would miss this incredibly
different story from each of these four datasets. And he did this exactly to make it as striking
as possible. Another great example is Len Stefanski, who's
at NC State. Likes to trick his students. And he creates these settings where you have
a regression model. And if you look, it looks like the variables
just follow a nice regression model. There are some significant p-values. And if you fail to do a residual plot, you
could just run through the analysis, figure out what variables might be necessary, and
see nothing in particular. But what happens is if you plot your predictive
versus your residual values for the model he's telling people to start with, you actually
get a picture of Bob Dylan in the residuals. And he's written some algorithms that show
how you can create whatever picture that you want in the residuals. So he can tell whether his students did the
plot or not, because they would obviously comment, hey, there was a picture of Bob Dylan
in my residuals. So I hope that illustrates some of the important
reasons of why we need to do plots. And then I'm basically going to spend the
rest of the lecture talking about ways in which we can improve our plots or some pitfalls
to avoid. OK? So as some general principles, one is by Tufte,
which is, I think, a really great contribution to the plotting literature, which is to maximize
the data to ink ratios. So if you take-- this slide is-- these slides
are from Karl Broman. I don't see it noted on here, but I'll add
that. Karl Broman is a faculty member at Wisconsin
who's a real expert in exploratory data analysis. So here you have this great plot, where you
have a response plotted by the treatment versus control. It has incredible data to ink ratio, because
it displays all of the data with very little ink, and then it has this nice mean plus to
standard deviation confidence intervals for the two groups. An incredible amount of information recreates
the whole dataset in one simple plot. In contrast, you could do this plot as a bar
chart. Two bar charts. And then look at the amount of ink you're
using, effectively to display two numbers; the mean for the treatment, and the mean for
the control. Just the loss of information going from left
to right in these two plots is simply incredible. But also the increase in the amount of just
toner from your cartridge that you need to get this second plot to display two numbers
that you could have just put in a paper or whatever anyway is almost ludicrous. So that's the principle that Tufte is trying
to elucidate, is try to create plots like this one on the left, where you display the
data as much as you can. And when in doubt, try to devote ink to data. Devote-- I think probably a useful way to
put it would be information to ink ratio. You want every bit of ink you're using to
display important information. This gray background isn't adding anything. Most of these lines aren't adding anything
on the right plot. The purple-- this is not a very informative
plot. It's just basically displaying two numbers. Another general principle is don't use 3-D.
So here we've taken that same plot, and we've basically ruined it even further by adding
3-D. You have this optical illusion of this corner here that doesn't add anything either. And just to really highlight the principle
of why you shouldn't use 3-D, we can basically take this plot and remove
all information by the angle at which we look at it. So and this just is to bring it to the point
of ludicrousness, where you're looking straight down at it. And of course now, we don't even display our
two numbers. This is just-- we've removed all information
content from our graph. But that's also showing that as you take this
3-D plot and rotate it a little bit, you're losing information. Now, I work in brain imaging, and very often
we use 3-D because the brain is a three-dimensional object, and we're using it very carefully
to try and display information. But I think you can say a general principle
is that for ordinary plots, don't use 3-D. Another thing is logging. So taking the natural log or a log base 2
or a log base 10 can be crucial if the scale-- if orders of magnitude are important. So here's an example that Karl came up with
in a genomic setting. You can see on the right hand side of this
plot, it's log base 10. And you see this interesting variation in
separation between the groups, with a lot of information. Here's the same data set unlogged, and you
see, basically, it looks like everything is 0. So if it is the case that orders of magnitude
are important, take for example an obvious setting, like astronomical distances. You obviously care more there about orders
of magnitude than anything else. If orders of magnitude are important, then
if you look at this plot in the way on the right, you've lost all the relevant information
by not taking logs. So at any rate, taking logs is often a very
important thing to do. Here's another example Karl came up with,
and this is a mean difference plot on the log scale. So it's the log of the ratio of two gene expression
levels, and then it's the average of the two gene expression levels. And you see all this interesting variation
on this plot. Also what this does is it takes the scatterplot. You might think, well, I could log expression
level 1 by log expression level 2 in a scatterplot. This basically does the same thing but rotates
it 45 degrees to get rid of the unnecessary blank space. And you could see interesting patterns, such
as the variance of the difference decreasing by the increase in the average log expression
rate. But you can also see the bulk of the data. It's spread out in a nice way. If, for example, you were to plot this unlogged,
as a scatterplot, first of all, you have all this unnecessary white space above and below
the collection of data. But as Karl points out, 99% of the data is
below this red line. OK? So you might look at this plot and come to
some conclusions, but you're really only looking at 1% of the data. OK. So now the last thing I'd like to talk about--
I have about maybe 15 minutes left, is the psychology of plots. There's a science of plots. So let me just talk about for exploratory
plots, what are some general characteristics? So usually you're making them quickly. You're not finalizing. So I'm not going to talk about design today. There could be an aspect of once you've got
a plot that conveys all the information that you want, how do you get it into a super nice
design for a reader for publication. OK, that's different than what goes on in
exploratory plots. The plots are usually made for you or your
team as you're going through the data. So the goal is for personal understanding. And a large number of plots are made. They're made quickly. Generally, it is worth spending some time
labeling your axes. So if you can not only have your axes labeled,
but also have the units of the measurement on the axis, that is generally better. But also the axis and the tick marks and things
like that are usually cleaned up, because otherwise we'll see that that's a really easy
way to make plots difficult to understand is to not have spent any time on your axes. And then colors and size are primarily used
for information. Like I said, you're not spending-- in exploratory
data analysis, you're not spending a lot of time on final colors and things like that,
that are used to make it look nice for presentation. You're spending more time on information. And so especially for plotting, there is a
not terribly well developed, but at least a nice history and some great research in
what I would call a theory of EDA. And what I mean by theory of EDA, I don't
actually mean the mathematical theory that underlies some characteristics of plots. That exists, of course, and that's super welldeveloped. But what I'm talking about is the psychological
part of EDA. How we perceive plots. And it's unfortunately true that we're designed
to find patterns even when there are aren't any. And our visual perception is then biased by
this humanness. And so again, the goal in EDA, just like we
discussed earlier, is not to fool yourself. And the real pioneer in this field is Bill
Cleveland, at least in the field of statistics. I think it's a much bigger research area now. But in the field of statistics, he was a real
pioneer of it. So I'm going to talk about some of his early
work. But just to get us in the mood, let me show
you some slides where we see some optical illusions which could very realistically occur. Something like this could very realistically
occur in an exploratory data analysis. So in this optical illusion, these two middle
points are of the same size. But of course, because of the size and distance
of the surrounding points, we perceive them to be different in size, typically thinking
this one is smaller. This is some unintended framing that's happening. So that's one example. This one drives me nuts. But you can load up this image into Photoshop
or whatever and check it yourself. But the a and b squares in this plot are actually
the same tone of gray, which I find amazing. So the surrounding colors around-- and this
is achieved by this shadowing effect. But again, the idea is that even if you're
trying to compare something in a plot by tone, we can actually have instances where our perception
is very much so off with respect to tone. And the optical illusions just hone in on
this and make it as bad as it can possibly be to highlight this principle. So another important point, just to return
to this point about multiplicity and testing things until you find something that's true,
Hadley Wickham actually does exactly formalize this concept. And there's a link down there. Where you take, for example, a dataset. Let's say this middle one is the real data. He actually then permutes and plots. Some examples of the permuted dataset, so
that you can visually perform a hypothesis test. So he's formalizing this idea of a hypothesis
test done on data. However, this point illustrates to us that
whenever we do a plot, whenever we make a decision and then do another plot and make
a decision and do another plot, these are informal hypothesis tests. These are informal models that we're fitting. And so all of the problems that exist that
we talk about with formal models, informal hypothesis tests get brought into this process,
just in a fuzzy way. OK. Now, let me talk about Cleveland's work, because
it's so fundamental. And so what he starts with in this work, this
paper, where I have the link down here, is really-- the Journal of the American Statistical
Association paper is really the foundation of this work and is really just a fundamental
paper in this area. And one thing he brings up is this idea of
the DNA of perceptual tasks. These perceptual units. And he basically says what we're going to
do is try and break down graph characteristics and do these perceptual units. So contracting lengths is an example of a
perception task. Comparing angles is an example of a perception
task. Comparing direction and area and volume, curvature,
shading, position on a common scale, position on nonaligned scale. These are all perception tasks. And the argument he said is well, we can figure
out how people do-- we can isolate perception tasks, test how people do on them, and that
will back inform what kinds of graphs we should be creating. So take this experiment he did, where he looked
at several different types of specific position and length type perception tasks. So in all of these cases, the people were
comparing trying to get an idea at the ratio of the length of the two dotted bars. And they have type 1, where they are right
next to each other, another type where they're separated, but they're on the same scale,
in the lowest box. You know, this one, they're separated into
histograms. This one they're the top box separated. And this one,
they're right on top of these others. And then he called this the position and length
experiments, because some of these are varying position, and some of these are varying lengths
comparisons. And when he looks at this, we can see that
there are certain kinds of things-- this is the log absolute difference between the actual
true ratio and what the people he was testing, what they thought it was. So he saw that certain positions were doing
better than other ones. And you can guess how it would work. On this type 5, the length one, that's this
one, where we're comparing two things that are not next to each other. Right? It's very hard to figure out the relationship
of these two things, because they're not directly comparable. But these two things, type 1, we're doing
quite well at when they're right next to each other and on a common-compared with a common
axis. But he could quantify that. And then he looked at several different perceptual
units. As an example, he also looked at volume, ratios
of volumes, by comparing things like positions or angles. So here would be something like a pie chart,
where we're comparing angles. If you're comparing the volumes of two things,
you're really making that comparison by virtue of the angle of the slice, whereas the bar--
he might consider comparing it with the position of the bars. And what they found was one thing is that
these angle comparisons were quite terrible. The log absolute difference for the angle
comparisons was quite terrible. Another thing, one of the worst examples,
was if you had to compare the ratio of two slopes. And they looked at whether-- different ways
of displaying the-- different ways of displaying the two slopes. So you want to compare the slope from A to
B to the slope from B to C. You want to estimate that ratio. And of course, if you squash it in, it gets
harder to do. If you display it vertically rather than horizontally,
it gets harder to do. And we're very bad at doing this. But remember, we often make plots where we're
plotting the slopes of two groups. And intrinsically, we're asking the reader
of the plot to perform this calculation. And what this research is showing is how much
you squash the plot, how much you stretch it, or even this task to begin with, is actually
quite difficult for people. Another interesting-- and this is a separate
paper-- another interesting idea was that the scale matters quite a bit, of the plot. So as an example, they showed people scatter
plots with the same correlation and found people reported dramatically different correlations,
depending on what they varied. So here they might show things that have roughly
similar correlations, but when they zoom out, people actually estimate, if they visually
estimate the correlation, they say that it's higher when you zoom out. OK? So again, our perception of correlation is
dependent on the arbitrary scale at which we choose to display the data. Here's the result of this perception task,
right? So of course, if there's zero correlation,
people seem to be doing fine, and if there's a 100% correlation, people seem to be doing
fine. But here are these-- plots the size of the
circle represents the variability at that point. You can see at around correlations, around
0.6, 0.5, 0.7 or so, people get the worst. They're the furthest away from what the actual
correlation was. These two other lines-- I happened to reread
this paper last night, just to say this-- are they were trying to figure out what were
the best geographic models. So they think this is a model for what people
are actually visually modeling, rather than this, which is closer to the truth. So they came up with theories of the geography
that they think people might actually be operating with. A good theory of the geography that people
appear not to be operating with, and a bad theory of the geography that appears to not
be exactly go through the data, but appears to more closely correlate with what people
are doing. Another great paper was by Jeff Leek, who
does some research in this area using our Coursera classes. And he actually had students try the experiment
of whether or not they could ascertain significance from a plot, of a correlation from a plot,
just by looking at scatter plot. And one interesting thing they found was that
people could not do these kinds of tasks, but you could train them so that they were
able to do it. And they broke it down by different categories
of things that they could change. They broke down the accuracy by different
ways in which they-- they changed the way the plot was displayed, axis scale, whether
or not they put a low s-curve in there. The smaller end versus larger end. And showed versus accuracy. But an interesting component of this article
was A, that they were able to break down the various components that impacted whether or
not people were able to surmise these p-values. But then They also saw this training effect. So some basic summaries. Whenever possible, use common scales. One of the perception results was when you
mess up the scales, when you have two plots right next to each other, and one of them
is on one scale, and the other one's on another scale, that the comparison-- that translation,
mentally, is quite difficult for people. So when possible, use position comparisons,
things that are on the same scale, just measuring where they are relative to that scale, basically
asking where two things are when they fall on a ruler. Those are the best things that people are
the best at. One of the things they were the worst at was
angle comparisons. So they're very hard for people to do. And a consequence of this is it basically
says that people are no good at interpreting pie charts. Things like we mentioned earlier, adding a
third dimension, generally doesn't add much, but also decreases from
people's ability to perceive things. And then again, this point that I've raised
up several times during this lecture is do not fool yourself about significance. And I think that I got the slide from Jeff
also. He makes an important point about saying in
either direction. So we've talked a lot about not fooling ourselves
in the terms of not making false positives, but you don't want to then go on the other
way. You can avoid all false positives by saying
everything's junk, nothing is ever significant. And you don't want to head in that direction
either way. You want a good compass to guide you. I just want to get some acknowledgement. So I got a lot of these slides from Jeff Leek
and Karl Broman. And then Jenny Bryan, Genevera Allen, I got
some stuff from XKCD, Wikipedia, Len Stefanski, and Rstudio. And then I think we-- I hit exactly 12:45,
so I saved some time for questions if there are any. Thank you, Brian. So if there are any questions, please type
them into the question box. Everyone's muted, so they can't actually ask
them, to keep all 200 people from asking at once. But while we're waiting for more to show up
in the question box, I have one to ask you, and that's whether you have an example of
where a lack of preprocessing or lack of looking at your data ahead of time has really lead
you astray and really messed up your analysis later on. Well, I think, certainly anyone who teaches
a statistics class can give you an instance of when students have written reports, for
example, that when they don't look at values, they get messed up results. So most teachers will do something like spike
in some crazy values that will mess up everyone's results and try and teach people that lesson
early. But I think that lesson isn't-- it's very
hard to make that lesson sticky, because I've certainly had many times in my life where
I've gone through a full analysis, thought I'd found something super interesting, started
to write it up to share with my collaborators, and then found after-- maybe on the way to
talk to them, realized, oh, no. The direction of the effect is the exact opposite
of the direction that I thought it was going to be and that science would dictate. And then when I go back, I realize it's always
one of these sort of errors that there was some error in preprocessing, there were some
missing data that were coded as 888 or 999 that I didn't catch, or there was some errant
values, like someone who was 200 pounds being put in as 2,000 pounds, or something like
that. And that also reminds us not only to do plots,
but also check-- there's diagnostics like DFFITS and DFBETAS and things like that that
really help us, in our models, diagnose these errant values. And it's always worthwhile to do those things. So I mean, I can say pretty much any time
I've ever failed to do the common steps of doing a lot of plots, checking my regression
diagnostics, every time I don't do that, I always wind up with something screwy. And then it just depends on how far I take
it before I have to go back and check it. Well, that's very good. That's a very good lesson for all of us to
learn is to spend time doing that first. But what about-- do you have a sense of what
percentage of your time you spend doing the exploratory analysis versus a confirmatory
analysis? I think that very much so depends on what
branch of science that you live in. I think I live in the discovery science world. So when I work on FMRI and MRI, much of my
analysis is really on this discovery site science. And in that setting, I think a lot more of
my analysis could be described as exploratory, as more toward the exploratory end. So then I would say, for many of those projects,
I would say, a huge chunk of it is exploratory. 80% or something like that. Now there are other settings I work in where
this setting is more mature. And take an Alzheimer's disease. I think the setting, because there are so
many people working in Alzheimer's disease, we have a better scientific compass to lead
us. And in those settings, we tend to go into
the analysis with more directed hypotheses. And then I spend a lot less time-- we tend
to be able to come up with a nice frozen dataset that really is exactly what we're interested
in. More often, in other cases, it seems like
we start out with a dataset that we're interested in. We find some interesting things and then find
that we didn't process the data in the way that was necessary to answer now these new
questions that arose, and then we have to reprocess. And I certainly would think that someone who
worked in clinical trials would spend a little bit less time on data pre-processing. But I don't know. That would be an interesting empirical study,
would be to get, actually, numbers about it. Because I'm sure what we perceive the amount
of time we spend on data pre-processing is different than what it actually is. Yeah, I bet. So we have a couple of questions in the question
box. The first one is about lasagna plots. Coming back to the very beginning. And whether lasagna plots violate your data
to in ratio. They do have a lot of ink. That is true. But on the other hand, they are displaying
the full data set. So the numerator and the denominator is high
in that case. So I don't know if they have a great data
to ink ratio, but they do have at least-- in our estimation-- a tolerable one. Especially because they tend to take large
datasets, and they tend to-- they basically redisplay the whole dataset. The key with lasagna plots is not so much
the original plot, because that's interesting in and of itself. And if you're lucky, like in the sleep experiment,
where there's some obvious missing data pattern that, for whatever the reason, was common
across subjects, that's great. But probably the key to the lasagna plots
is some sort of sorting or organizing your rows, where you can try to do to detect patterns. But anyway, yeah. The point is well taken that yeah, they do
use a lot of ink. OK. And then the next question is about any suggestions
that you might have for software to do EDA. Do you have any preferences on software, or
can you just give some advice about what are some things people might use? Yes. So I use R. And then-- but I've been using
R for a long time. So I use Base R, which is just R's default
plotting stuff. Since then, in R, there's been a revolution
with this package called ggplot. And ggplot, the ggplot stands for grammar
of graphics. And this was a theory that was put forward
on-- worked on by starting back-- I think it dates back to Cleveland and other work
at Bell Labs and then goes all the way forward to Hadley Wickham's work on it, and several
other people. And then I forget who the actual inventor
of the term and the concepts of grammar of graphics. Any rate, this led to a system that is then
operationalized in the R package, ggplot, which is, I think, one of the most popular
R packages. And if you get used to ggplot, people seem
to absolutely love it. I use it-- I find myself using it more and
more. But because I was so used to Base R, it's
been hard for me to switch over. And ggplot, as implemented in R, basically
has two steps. One is you have to work really hard to clean
up your data and get it into a nice format. And that's almost a feature of ggplot in that
it forces you to work really hard on your pre-processing before you start plotting. And then once you get it in a nice format,
then you're off to the races with the plots. And then the actual plotting syntax is quite
good for ggplot. So I would recommend that. But I would say, every statistical data analysis
program has completely tolerable graphics capabilities. And it's more, I think-- I don't think the
tool is as much the problem. Now, if you're talking about creating production-ready
graphics, then I think you can get into the specifics of various platforms. One thing that I didn't mention that is quite
useful is interactive graphics. That's where you create a plot, and you create
sliders or buttons or dialog boxes and things like that so that you-- so that the person
who's looking at the plot, whether it's you, or whether you're giving it to someone else,
can interact with it. And there is increasingly great tools for
that. Plotly is an example. Shiny in R is an example for that. And then, of course, the gold standard for
that is D3, which is a JavaScript library. But that requires a pretty heavy investment
of programming knowledge and time to master that. But there's some layers, like Plotly, that
have been built on top of D3 and other libraries that make it kind of easy. So at any rate, I guess the answer to the
question, from my perspective, I use R, and I like R. And then if I needed interactive
graphics, I would use Shiny or Plotly. Thanks. And there was a comment in here, not along
the lines of software, but other resources. There was a comment here from Ethan McMann
about a report-- a CIA report that he's seen that is useful in this area. Can you, in addition to your comments on software,
could you give some advice about where someone might go to find out more information about
EDA? I mean, you mentioned some of the leaders
in this area, like Hadley Wickham, and Karl Broman, Jenny Bryan, Bill Cleveland. And you also mentioned Roger Peng's Coursera
course. But other than those, going to the research
of the first group, or Roger's course, do you have any other advice for where to find
more resources? So-
Than the tolerable book that you could just pick up. Meaning anything that really stands out. So there's two aspects to EDA, and one is
super well-developed, and the other is a little more amorphous. So the first aspect to exploratory data analysis
is the plotting component to it. And there is, I think, a lot of great materials
on plotting. So on design, there is Tufte's books. On the perception, there's all this Cleveland
work. On the implementation, there's lots of books
on specific plotting implementation. There was a great book written by Nathan Yau,
who does FlowingData. Look at his work on graphs. I'm trying to think. Roger Peng's course has quite a bit of stuff
on graphing. Cleveland actually has a book on plotting
as well. Kind of a Tufte style book on plotting. There is a-- boy, I'm forgetting. I'm blanking on some names. But there's lots of books on plotting. And then there's the other aspect of exploratory
data analysis, which I didn't get too far into, which is you could use models, clustering,
huge chunk of tools, what I would just call modeling tools. But in general, statistical tools can also
be part of exploratory data analysis. Regression is a key part of exploratory data
analysis. Right? You fit, add a model, throw in a confounder,
check things. And I would say that's less well-developed
as a field. So there, you have to go to books on machine
learning or cluster to look at unsupervised clustering. Any regression class will have lectures on
how to use regression, but they won't differentiate the use of regression in EDA versus the use
of regression in a final confirmatory-type analysis. The standard for that is, of course, Tukey's
book. For that style of analysis, it's Tukey's book. It's wonderful in that regard. So that would be the starting point, and then
I'll think about that. Maybe on the slides, which are Google
Docs, I'll add some references if I can think of some more. OK. And I think you just answered the last question,
which was commenting on whether some other techniques, like dimension reduction and collective
variable search would be considered EDA or could be categorized as EDA. And that sounds like what you just answered. Yes. Absolutely. Absolutely, I would consider that EDA. Anything-- EDA, I guess one way to define
EDA is anything that's not confirmatory in data analysis. But I think when I look at my definition,
it's a little bit-- it's not very strict for what constitutes EDA. And I think I define it more along the lines
of if you're really doing hypothesis generation, and you're really doing a more free-flowing
style of analysis, where you're looking for trends, and you're poking and you're prodding,
and you're developing your hypotheses on the fly. Anytime you do that, no matter what techniques
you're using, I think you're doing EDA. OK. Great. And with that, since we are out of time, I
want to thank you again for this great lecture and for coming here to help us understand
this better. Thank you, Brian. No problem. Thank you. All right. Bye-bye. Bye.