Exploratory Data Analysis

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Good morning or good afternoon to everyone. I'm Michelle Dunn, in the Office of the Social Director for Data Science at NIH. It gives me great pleasure today to introduce Dr. Brian Caffo, who is the Professor in Biostatistics at the Johns Hopkins School of Public Health, who is my neighbor here in Maryland, down the street. He is world famous as the co-creator and co-director of the Coursera Data Science Specialization. And if I remember correctly, his group was even the one that suggested the idea of a specialization to Coursera. He is also the co-creator and co-director of the SMART Group. SMART stands for Statistical Methods and Applications for Research in Technology. And this group focuses on statistical methods for biological signals. So Brian's research is in statistical computing and generalized linear mixed models, with applications to neuroimaging, functional magnetic resonance imaging, and image processing in general. He is well entrenched in the analysis of big data, leading teams to work in prediction competitions. A couple of the notable ones that his team has done well in are the ADHD 200 prediction competition, which they won, and the Heritage Health prediction competition, which they got 12th place in. So congratulations on those. Dr. Caffo has been given numerous awards, and I list a couple of them here. The first one is the PECASE award, which is the Presidential Early Career Award for Scientists and Engineers. This is probably the most prestigious award that can be given to an early career researcher. He's also won awards for teaching and for mentoring. So with that, I will turn it over to Dr. Brian Caffo. Brian, thank you very much for being with us today. Thanks, Michelle. I hope everyone can hear me. Thank you for inviting me. And thanks, Crystal, for doing all the organization. So my topic today is exploratory data analysis. And I'm going to talk about this more like an instructor of an exploratory data analysis, because it's an area that I use a lot, but it's certainly not an area that I do a lot of active research in. I do have one notable exception. My former PhD student, named Bruce Swihart, who's a really brilliant guy, he works at the NIH now. He wrote a paper that was really about contributing the idea of using heat maps to perform visualization for longitudinal data analysis. And he has probably the best title of any paper I've ever written, for certain, called Lasagna Plots, a Saucy Alternative to Spaghetti Plots. So if you want to see a nice little paper on visualization, check out Bruce's paper. But the fundamental idea-- I'll just cover this really quickly-- is that with spaghetti plots, like in this-- hopefully everyone can see my cursor. If you can't, maybe someone will let me know. Like, this big, black mass over here is just a lot of overplotting from spaghetti plots. In cases like this, you might want to do something like put it into a heat map where you could actually see trends. Like, here, we were investigating sleep disorder breathing and some EEG signals. And you see some interesting missing data patterns. You also see just generally that the sleep disorder breathing subjects have more red than the non-sleep disorder breathing subjects. Anyway, it's a nice little paper. It's on PubMed. Mostly what I'm going to talk about today is principles of exploratory data analysis. If you want a lot of practical information about exploratory data analysis, in other words, if you want to know, how do I do exploratory data analysis in R in specific or Python in specific, or Sassin specific, or something like that. Certainly the best resource for R, I would say, is Roger Peng's Exploratory Data Analysis with R book. It's just all about how do I do it, not why do I do it, or what are some of the guiding principles of how you do it, which is what I'll talk about today. He also has a Coursera course. I should make a small comment. It was actually Coursera that came up with the idea of a specialization. But we jumped on it rather quickly, to our credit. And then, of course, the Bible of exploratory data analysis is John Tukey's book, if you can find a copy of that. But this book by Roger Peng, and his course, are very good, practical how-to type books. And today, what I'm going to cover are things that are really-- don't apply across however you're doing exploratory data analysis, whether you're doing it in Python, or R, or whatever. So let me just start by saying what exploratory data analysis is. And if I'm going to define what exploratory data analysis is, I should define what statistical activities are not exploratory data analysis. So the traditional dichotomy that people talk about is to differentiate between Exploratory Data Analysis, EDA, and Confirmatory Data Analysis, CDA. So exploratory data analysis focuses on discovery and hypothesis generation, whereas confirmatory data analysis tends to focus on hypothesis confirmation. Exploratory data analysis still controls error rates and performs uncertainty quantification, but it tends to do that much more loosely. Confirmatory data analysis tends to focus on formal inference and prediction techniques. Both of them can use the same methods. The difference is exploratory data analysis tends to be more freeform and less structured, and I liken it to improvisational jazz. And then CDA is a little bit more prescriptive, protocolized and planned, and so I have a picture of a symphony down there. So I think that differentiates the idea behind the two techniques. I would say in general that data analysis falls on a spectrum between the highly prescriptive and formal conformatory data analysis sort of ideal, and that's probably most realized in the area of highly regulated clinical trials. So where clinical trials tend to be highly protocolized and regularized, especially in drug development. And then there are some areas like high throughput technological measurement areas, like genomics tends to fall in this category, where things are a little bit more exploratory in nature. And so I think things-- though typically most analyses fall in some gray zone between the two. So I actually often think the EDA versus CDA division is more useful conceptually than practically. And then I would say maybe an alternative dichotomy, if you really have to have a dichotomy, is to think about whether or not you're using strongly phrased scientific hypothesis to drive your research versus doing purely empirical studies. I think that's a more useful division. But I actually put some papers down here that discuss some of these issues. For example, this paper by Kell and Oliver that says, here's the evidence, now what's the hypothesis? And it talks about the roles of inductive versus hypothesis-driven science in genomics. And then this great blog post by Simply Statistics, where it says the keyword in data science is not data, it's science. Therefore obviously promoting this kind of strong, hypothesisdriven research agenda. And then again, I haven't really defined EDA or really gone into much about the specifics of EDA, but I want to put one of the biggest points that's generally brought up about EDA versus CDA now is the idea of doing a lot of exploratory analysis and then presenting only the final step, as if that was the only step that you did. And that particular process is called p-hacking or data-- or phishing. Or phishing expeditions, or that sort of thing. And it generally falls under the rubric of the more you use your data for hypothesis generation and exploration, the harder it gets to really control the error rates on the same data for some sort of final analysis. And I would say there's been a lot of research in the past couple of years on the extent and consequences of this problem. So probably most notable was Ioannidis' very famous article that said, why most published research findings are false. Then I found this other great article on the extent and consequences of p-hacking in science, which I thought was also very interesting. And then Jeff Leek and Leah Jager, who are here in the department with me, both really great researchers. Really tried to put some numbers on this idea of the rate of false positive findings in science. And they gave an estimate of the false discovery rate in the top medical literature. So I would suggest you read all this literature, and maybe even the broader literature in this area, if you're interested in this problem on using the exploratory data analysis too much, to the point where it biases your current formatory results. But that is a big warning about big-- a big no-no is to do a lot of exploratory data analysis and then present the final step as if it was confirmatory. So now let me actually get to the topic of exploratory data analysis. So I'm going to give-- this is a shout out to Jeff Leek, who teaches exploratory data analysis in his class. I borrowed a lot of slides from him. He had this great breakdown of what he thinks of as the steps in an exploratory data analysis, and I think it's a really nice summary. And these are reading in the data. Figuring out the data. After you read it in, you need to figure out what the columns are, that sort of thing. Pre-processing it. Looking at dimensions and making sure they measure up, especially if you have more than one dataset that you're trying to merge together. You have to look at its values. Make sure there's no weird values. You need to make tables and hunt for messed up values. Figure out how missing data is coded. NA stands for missing data. How missing data is coded. How much missing data is there? Is missing data coded consistently? And then you want to do a lot of plots. Plots are the cornerstone of exploratory data analysis. And then the final step says don't fool yourself. So today-- and I'll talk more about that-- but today, really, these are the things I'm going to focus on. And there's a lot of content in exploratory data analysis, and a lot of content in these areas I'm not going to cover. But just in a one-hour talk, I certainly can't get them all. So I'm going to spend a little bit of time on pre-processing, a little bit of time on not fooling yourself, and a lot of time on some basic guidance on creating plots. So let's talk about pre-processing. So we recently had a really wonderful talk from Jenny Bryan, who is a very notable data scientist who works at UBC, but then also just got hired by RStudio to be part of their educational and development team. And there aren't a lot of great principals in pre-processing. There's almost like-- it's almost like an artistic endeavor, where you do it differently for every problem. And I think one of the things, if I'm thinking broadly about what her basic research agenda is, is to really try and synthesize this process into some general principles. And one general principle, she said, which I think is a very useful one, is to try and get everything into a rectangle. And she actually had this picture of a rectangle with angel wings on it. And let me just expand on this a little bit. She really emphasized this idea that data wrangling is work. So if you spend a day or two days getting your data into a nice format, don't think of that as, well, I haven't even started working yet. The reality is, you've done a lot of work, and some of the most important work. It's like the most exciting part of building a house is seeing it all come together and getting a lot of the nice finishing touches. But without laying the concrete for a firm foundation, the rest of the house is irrelevant. And so no one likes laying the foundation, but everyone thinks it's probably one of the most valuable parts of building a house. And I put that into this quote that would say, no one has ever said, I really regret getting my data into such a wellorganized and thought out format. So some basic things you want to remember is to try and save your steps while you're doing it. Use a version control system, like Git, try and engage in reproducible research as you're doing this. And then her point was also try and get your data into a rectangle. Name your columns with a sensible naming convention. Use names that are amenable to software packages. So don't put spaces and quotation marks and weird symbols and other things that aren't useful for coding, in your names. Try and use capitalization and spacing in your column names like you're a programmer. No special features. If you're using a spreadsheet, don't put embedded graphs in your basic data format. Just try and split that out. Split that process out. Even if you're using Excel to perform your data analysis, try and create copies of the dataset where you record how you created the copy and create graphs and different sheets. Don't use numbers for missing values is one that comes up for me a lot, where people code missing values as 888, and then it gets read into a software package that doesn't recognize that and treats it as a number. It messes up the analysis. And usually you can detect that very quickly. But if you don't, then it messes everything up. OK. And there some tools-- again, I'm not going to spend a lot of this talk talking about tools. But in R, if you like to use R, there is a grammar of data wrangling that's coming about. It's still under development, but tools like dplyr and tidyr and purrr and stringr and these things are really making it a lot easier. In R in specific. Of course, you can do data wrangling in any of the major analysis programs. Now let me talk about not fooling yourself. And the first thing I'm going to do when I talk about not fooling yourself is give a couple of parables that go along with this. So the first is the famous one of the elephant and the blind men. This picture was off of Wikipedia. And it's the story of the elephant and the blind men. It's that six blind men are investigating an elephant, and one of them touches its side and says, it's a wall! And another one touches the trunk and says, it's a snake! And another-- you get the picture. They all got a very incomplete picture of the elephant. And this parable is used to illustrate several points. But I think germane to our discussion is you're really-- don't be like the blind man investigating the elephant in that-- or at least acknowledge that you were like that, in the sense that you only get to work with the data that you get to see. Another thing that's very common is this kind of bias that you get by finding something, finding a pattern, and then pretending like it was what you were searching for to begin with. And this is most often described as shooting an arrow and painting a bull's eye around where it lands. I actually found online a company called the Bullseye Painting Company. So I think they don't actually paint bullseyes, I think that bullseye is just their name. But at any rate, that's an important consideration when you're doing exploratory data analysis is coming up with a chance finding and then pretending like it was something you were searching for all along. Then there's this famous Mutt and Jeff cartoon. I got this from Quote Investigator. Where it's a drunk looking for his quarter underneath a lamppost, and it says, I'm looking for my quarter that I dropped. And the policeman says, did you drop it here? And the drunk says, no I dropped it two blocks down the street. And then he says, well, why are you looking for it here? And the drunk says, because the light is better here. And the idea is that you're looking at the data that you have with the biases that it has, and we want to as little as possible be like drunks looking for our change under a lamp post. Looking where there's light rather than looking where we need to be looking. And then of course the one that comes up the most, that most people are aware of, is this problem of multiplicity. And this is a great XKCD comic book. It's number 882. And in this comic book, someone says, jellybeans cause acne. And another person says, scientists investigate. And the scientists look, and it says, we found no link between jelly beans and acne. Then someone says-- then the person says, I hear that it's only a certain color. And then they start investigating all the colors, and then basically, all these panels are people investigating different colors. And then they find green does have an association, and then there's a news article that says green jelly beans are linked to acne. Of course, what this is illustrating is that if you keep looking for things in noise, you'll eventually find them. So some example common ways you can fool yourself in exploratory data analysis. So these are some of the most common ways this can come up. And I'll relate them to the parables. So one thing is just issues with the data that you have. Can it even answer the questions that you're trying to ask? That's the parable of the elephant and the drunk under the lamppost. Both discuss that point. Another thing is that even if you find a true thing, those true things may not paint a complete picture. That's, of course, the story of the elephant. The confirmation bias, the idea that you pretend like you discover things that you were just looking for ahead of time, and you just confirmed them, but ignore the evidence that refutes it. That's like painting the bullseye. False findings are always a problem, just where you get chance associations. That's like the bullseye in multiplicity. And then of course the problem of multiplicity is just repeatedly looking for things until you find something. So these are all the ways that you can fool yourself, and I think these problems are more pronounced in exploratory data analysis, because we're not even pretending that we have some sort of strict error rate control that we're trying to go through. Rarely in exploratory data analysis do you have a highly protocolized version of what you're doing. So now let me-- so that discusses ways in which you can fool yourself. Now, the rest of the discussion, I'm going to spend entirely talking about interocular content, or other words, plots. Things that hit you right between the eyes. And the idea is that plots are really the cornerstone of exploratory data analysis. And then plot-- there's something about visual information that pictures really are often worth a thousand words, and there's just something about them that can really speak to us and drive home points and help us discover things that we-- just a table or a number or a written summary just somehow lacks. And I'll give a great example. Probably the most famous graph in graph design and exploratory data analysis history is this plot by Menard about the French invasion in Russia in 1812. And this is generally thought of as one of the most information-rich graphs that you can find. It's also just beautiful in and of itself. And this is the original graph, and then you can see it's the same thing. But here is a version I grabbed off of Wikipedia that has the English translations on it. And so what this graph depicts is as you go from left to right here, this is showing the French troops as they marched into Russia. Starting here, and then ending up in Moscow. Down here, you actually see the temperature. OK? And the path actually does quite closely mirror the geographic path that the troops took. Right? So this really does look like the path they were taking. The width of the graph represents the number of troops at that particular time. And you have the troops-- and these little breakouts things are groups that broke off. Right? That tried to go a different direction. And then the black line coming back this way is the retreating troops. So this group, for example, broke off, and then they retreated, and this is this retreating group, and this is this retreating group. OK? And what was interesting historically, what happened was the French had a very strong troop. They started out with a collection of troops. They started out with 422,000. OK? And the Russians had a very harsh strategy. They were going to engage the French troops and then retreat and then engage them and then retreat. And then as they retreated, they used these scorched earth protocols, where they would burn all the fields. And as they retreated, the French troops would have no way to resupply. And so their goal, the way the Russians were going to win this campaign was just by basically freezing and starving the French troops to death with a war of attrition as the Russian troops just repeatedly fell back toward Moscow. And you can just see with the width here, the way this graph displays the death toll, it's just striking. So it started out with 422,000, down to 100,000. You can almost feel the cold and horror as these troops are trying to cross this river. When they get to 100,000, when they hit Moscow, and they get to retreat, and then you see down here, back at the beginning, 10,000 troops. So a loss of over 400,000 troops in this march. Any rate, at same time, you can see that the temperature along the bottom axis here, helping show when the temperature spikes caused large troop losses. Anyway, a really wonderful graph, and it's just a highlight of how graphs can be used to just-- used beautifully. At any rate, this is such a famous graph, I thought I'd describe it. Here's another famous plot that really helps us more easily drill down on why plots are inherently useful. So Anscombe was a very famous statistician, and he created a data set where the mean of the x's, the mean of the y's, the correlation between the x's and the y's, and hence the r squared value and the slope of the regression line. The standard deviation, the x, the standard deviation-wise, these were all the same, matched from these graphs. And so if you ran these in a regression model or a correlation, or you did a basic summary, like mean of x, mean of y, standard deviation x, standard deviation y, you would get the same answer for each of these four graphs. However, obviously, there's very different stories being told by the four graphs. This one looks like kind of just a regular noisy regression relationship. This one clearly looks like a noise-free parabolic curve. This one looks like a noise-free line with one outlier. This one has almost no variation in the x variable except one point that's way outside of the data cloud. So in all of these cases, unless you had done the graph, if you had just done the obvious summaries, you would miss this incredibly different story from each of these four datasets. And he did this exactly to make it as striking as possible. Another great example is Len Stefanski, who's at NC State. Likes to trick his students. And he creates these settings where you have a regression model. And if you look, it looks like the variables just follow a nice regression model. There are some significant p-values. And if you fail to do a residual plot, you could just run through the analysis, figure out what variables might be necessary, and see nothing in particular. But what happens is if you plot your predictive versus your residual values for the model he's telling people to start with, you actually get a picture of Bob Dylan in the residuals. And he's written some algorithms that show how you can create whatever picture that you want in the residuals. So he can tell whether his students did the plot or not, because they would obviously comment, hey, there was a picture of Bob Dylan in my residuals. So I hope that illustrates some of the important reasons of why we need to do plots. And then I'm basically going to spend the rest of the lecture talking about ways in which we can improve our plots or some pitfalls to avoid. OK? So as some general principles, one is by Tufte, which is, I think, a really great contribution to the plotting literature, which is to maximize the data to ink ratios. So if you take-- this slide is-- these slides are from Karl Broman. I don't see it noted on here, but I'll add that. Karl Broman is a faculty member at Wisconsin who's a real expert in exploratory data analysis. So here you have this great plot, where you have a response plotted by the treatment versus control. It has incredible data to ink ratio, because it displays all of the data with very little ink, and then it has this nice mean plus to standard deviation confidence intervals for the two groups. An incredible amount of information recreates the whole dataset in one simple plot. In contrast, you could do this plot as a bar chart. Two bar charts. And then look at the amount of ink you're using, effectively to display two numbers; the mean for the treatment, and the mean for the control. Just the loss of information going from left to right in these two plots is simply incredible. But also the increase in the amount of just toner from your cartridge that you need to get this second plot to display two numbers that you could have just put in a paper or whatever anyway is almost ludicrous. So that's the principle that Tufte is trying to elucidate, is try to create plots like this one on the left, where you display the data as much as you can. And when in doubt, try to devote ink to data. Devote-- I think probably a useful way to put it would be information to ink ratio. You want every bit of ink you're using to display important information. This gray background isn't adding anything. Most of these lines aren't adding anything on the right plot. The purple-- this is not a very informative plot. It's just basically displaying two numbers. Another general principle is don't use 3-D. So here we've taken that same plot, and we've basically ruined it even further by adding 3-D. You have this optical illusion of this corner here that doesn't add anything either. And just to really highlight the principle of why you shouldn't use 3-D, we can basically take this plot and remove all information by the angle at which we look at it. So and this just is to bring it to the point of ludicrousness, where you're looking straight down at it. And of course now, we don't even display our two numbers. This is just-- we've removed all information content from our graph. But that's also showing that as you take this 3-D plot and rotate it a little bit, you're losing information. Now, I work in brain imaging, and very often we use 3-D because the brain is a three-dimensional object, and we're using it very carefully to try and display information. But I think you can say a general principle is that for ordinary plots, don't use 3-D. Another thing is logging. So taking the natural log or a log base 2 or a log base 10 can be crucial if the scale-- if orders of magnitude are important. So here's an example that Karl came up with in a genomic setting. You can see on the right hand side of this plot, it's log base 10. And you see this interesting variation in separation between the groups, with a lot of information. Here's the same data set unlogged, and you see, basically, it looks like everything is 0. So if it is the case that orders of magnitude are important, take for example an obvious setting, like astronomical distances. You obviously care more there about orders of magnitude than anything else. If orders of magnitude are important, then if you look at this plot in the way on the right, you've lost all the relevant information by not taking logs. So at any rate, taking logs is often a very important thing to do. Here's another example Karl came up with, and this is a mean difference plot on the log scale. So it's the log of the ratio of two gene expression levels, and then it's the average of the two gene expression levels. And you see all this interesting variation on this plot. Also what this does is it takes the scatterplot. You might think, well, I could log expression level 1 by log expression level 2 in a scatterplot. This basically does the same thing but rotates it 45 degrees to get rid of the unnecessary blank space. And you could see interesting patterns, such as the variance of the difference decreasing by the increase in the average log expression rate. But you can also see the bulk of the data. It's spread out in a nice way. If, for example, you were to plot this unlogged, as a scatterplot, first of all, you have all this unnecessary white space above and below the collection of data. But as Karl points out, 99% of the data is below this red line. OK? So you might look at this plot and come to some conclusions, but you're really only looking at 1% of the data. OK. So now the last thing I'd like to talk about-- I have about maybe 15 minutes left, is the psychology of plots. There's a science of plots. So let me just talk about for exploratory plots, what are some general characteristics? So usually you're making them quickly. You're not finalizing. So I'm not going to talk about design today. There could be an aspect of once you've got a plot that conveys all the information that you want, how do you get it into a super nice design for a reader for publication. OK, that's different than what goes on in exploratory plots. The plots are usually made for you or your team as you're going through the data. So the goal is for personal understanding. And a large number of plots are made. They're made quickly. Generally, it is worth spending some time labeling your axes. So if you can not only have your axes labeled, but also have the units of the measurement on the axis, that is generally better. But also the axis and the tick marks and things like that are usually cleaned up, because otherwise we'll see that that's a really easy way to make plots difficult to understand is to not have spent any time on your axes. And then colors and size are primarily used for information. Like I said, you're not spending-- in exploratory data analysis, you're not spending a lot of time on final colors and things like that, that are used to make it look nice for presentation. You're spending more time on information. And so especially for plotting, there is a not terribly well developed, but at least a nice history and some great research in what I would call a theory of EDA. And what I mean by theory of EDA, I don't actually mean the mathematical theory that underlies some characteristics of plots. That exists, of course, and that's super welldeveloped. But what I'm talking about is the psychological part of EDA. How we perceive plots. And it's unfortunately true that we're designed to find patterns even when there are aren't any. And our visual perception is then biased by this humanness. And so again, the goal in EDA, just like we discussed earlier, is not to fool yourself. And the real pioneer in this field is Bill Cleveland, at least in the field of statistics. I think it's a much bigger research area now. But in the field of statistics, he was a real pioneer of it. So I'm going to talk about some of his early work. But just to get us in the mood, let me show you some slides where we see some optical illusions which could very realistically occur. Something like this could very realistically occur in an exploratory data analysis. So in this optical illusion, these two middle points are of the same size. But of course, because of the size and distance of the surrounding points, we perceive them to be different in size, typically thinking this one is smaller. This is some unintended framing that's happening. So that's one example. This one drives me nuts. But you can load up this image into Photoshop or whatever and check it yourself. But the a and b squares in this plot are actually the same tone of gray, which I find amazing. So the surrounding colors around-- and this is achieved by this shadowing effect. But again, the idea is that even if you're trying to compare something in a plot by tone, we can actually have instances where our perception is very much so off with respect to tone. And the optical illusions just hone in on this and make it as bad as it can possibly be to highlight this principle. So another important point, just to return to this point about multiplicity and testing things until you find something that's true, Hadley Wickham actually does exactly formalize this concept. And there's a link down there. Where you take, for example, a dataset. Let's say this middle one is the real data. He actually then permutes and plots. Some examples of the permuted dataset, so that you can visually perform a hypothesis test. So he's formalizing this idea of a hypothesis test done on data. However, this point illustrates to us that whenever we do a plot, whenever we make a decision and then do another plot and make a decision and do another plot, these are informal hypothesis tests. These are informal models that we're fitting. And so all of the problems that exist that we talk about with formal models, informal hypothesis tests get brought into this process, just in a fuzzy way. OK. Now, let me talk about Cleveland's work, because it's so fundamental. And so what he starts with in this work, this paper, where I have the link down here, is really-- the Journal of the American Statistical Association paper is really the foundation of this work and is really just a fundamental paper in this area. And one thing he brings up is this idea of the DNA of perceptual tasks. These perceptual units. And he basically says what we're going to do is try and break down graph characteristics and do these perceptual units. So contracting lengths is an example of a perception task. Comparing angles is an example of a perception task. Comparing direction and area and volume, curvature, shading, position on a common scale, position on nonaligned scale. These are all perception tasks. And the argument he said is well, we can figure out how people do-- we can isolate perception tasks, test how people do on them, and that will back inform what kinds of graphs we should be creating. So take this experiment he did, where he looked at several different types of specific position and length type perception tasks. So in all of these cases, the people were comparing trying to get an idea at the ratio of the length of the two dotted bars. And they have type 1, where they are right next to each other, another type where they're separated, but they're on the same scale, in the lowest box. You know, this one, they're separated into histograms. This one they're the top box separated. And this one, they're right on top of these others. And then he called this the position and length experiments, because some of these are varying position, and some of these are varying lengths comparisons. And when he looks at this, we can see that there are certain kinds of things-- this is the log absolute difference between the actual true ratio and what the people he was testing, what they thought it was. So he saw that certain positions were doing better than other ones. And you can guess how it would work. On this type 5, the length one, that's this one, where we're comparing two things that are not next to each other. Right? It's very hard to figure out the relationship of these two things, because they're not directly comparable. But these two things, type 1, we're doing quite well at when they're right next to each other and on a common-compared with a common axis. But he could quantify that. And then he looked at several different perceptual units. As an example, he also looked at volume, ratios of volumes, by comparing things like positions or angles. So here would be something like a pie chart, where we're comparing angles. If you're comparing the volumes of two things, you're really making that comparison by virtue of the angle of the slice, whereas the bar-- he might consider comparing it with the position of the bars. And what they found was one thing is that these angle comparisons were quite terrible. The log absolute difference for the angle comparisons was quite terrible. Another thing, one of the worst examples, was if you had to compare the ratio of two slopes. And they looked at whether-- different ways of displaying the-- different ways of displaying the two slopes. So you want to compare the slope from A to B to the slope from B to C. You want to estimate that ratio. And of course, if you squash it in, it gets harder to do. If you display it vertically rather than horizontally, it gets harder to do. And we're very bad at doing this. But remember, we often make plots where we're plotting the slopes of two groups. And intrinsically, we're asking the reader of the plot to perform this calculation. And what this research is showing is how much you squash the plot, how much you stretch it, or even this task to begin with, is actually quite difficult for people. Another interesting-- and this is a separate paper-- another interesting idea was that the scale matters quite a bit, of the plot. So as an example, they showed people scatter plots with the same correlation and found people reported dramatically different correlations, depending on what they varied. So here they might show things that have roughly similar correlations, but when they zoom out, people actually estimate, if they visually estimate the correlation, they say that it's higher when you zoom out. OK? So again, our perception of correlation is dependent on the arbitrary scale at which we choose to display the data. Here's the result of this perception task, right? So of course, if there's zero correlation, people seem to be doing fine, and if there's a 100% correlation, people seem to be doing fine. But here are these-- plots the size of the circle represents the variability at that point. You can see at around correlations, around 0.6, 0.5, 0.7 or so, people get the worst. They're the furthest away from what the actual correlation was. These two other lines-- I happened to reread this paper last night, just to say this-- are they were trying to figure out what were the best geographic models. So they think this is a model for what people are actually visually modeling, rather than this, which is closer to the truth. So they came up with theories of the geography that they think people might actually be operating with. A good theory of the geography that people appear not to be operating with, and a bad theory of the geography that appears to not be exactly go through the data, but appears to more closely correlate with what people are doing. Another great paper was by Jeff Leek, who does some research in this area using our Coursera classes. And he actually had students try the experiment of whether or not they could ascertain significance from a plot, of a correlation from a plot, just by looking at scatter plot. And one interesting thing they found was that people could not do these kinds of tasks, but you could train them so that they were able to do it. And they broke it down by different categories of things that they could change. They broke down the accuracy by different ways in which they-- they changed the way the plot was displayed, axis scale, whether or not they put a low s-curve in there. The smaller end versus larger end. And showed versus accuracy. But an interesting component of this article was A, that they were able to break down the various components that impacted whether or not people were able to surmise these p-values. But then They also saw this training effect. So some basic summaries. Whenever possible, use common scales. One of the perception results was when you mess up the scales, when you have two plots right next to each other, and one of them is on one scale, and the other one's on another scale, that the comparison-- that translation, mentally, is quite difficult for people. So when possible, use position comparisons, things that are on the same scale, just measuring where they are relative to that scale, basically asking where two things are when they fall on a ruler. Those are the best things that people are the best at. One of the things they were the worst at was angle comparisons. So they're very hard for people to do. And a consequence of this is it basically says that people are no good at interpreting pie charts. Things like we mentioned earlier, adding a third dimension, generally doesn't add much, but also decreases from people's ability to perceive things. And then again, this point that I've raised up several times during this lecture is do not fool yourself about significance. And I think that I got the slide from Jeff also. He makes an important point about saying in either direction. So we've talked a lot about not fooling ourselves in the terms of not making false positives, but you don't want to then go on the other way. You can avoid all false positives by saying everything's junk, nothing is ever significant. And you don't want to head in that direction either way. You want a good compass to guide you. I just want to get some acknowledgement. So I got a lot of these slides from Jeff Leek and Karl Broman. And then Jenny Bryan, Genevera Allen, I got some stuff from XKCD, Wikipedia, Len Stefanski, and Rstudio. And then I think we-- I hit exactly 12:45, so I saved some time for questions if there are any. Thank you, Brian. So if there are any questions, please type them into the question box. Everyone's muted, so they can't actually ask them, to keep all 200 people from asking at once. But while we're waiting for more to show up in the question box, I have one to ask you, and that's whether you have an example of where a lack of preprocessing or lack of looking at your data ahead of time has really lead you astray and really messed up your analysis later on. Well, I think, certainly anyone who teaches a statistics class can give you an instance of when students have written reports, for example, that when they don't look at values, they get messed up results. So most teachers will do something like spike in some crazy values that will mess up everyone's results and try and teach people that lesson early. But I think that lesson isn't-- it's very hard to make that lesson sticky, because I've certainly had many times in my life where I've gone through a full analysis, thought I'd found something super interesting, started to write it up to share with my collaborators, and then found after-- maybe on the way to talk to them, realized, oh, no. The direction of the effect is the exact opposite of the direction that I thought it was going to be and that science would dictate. And then when I go back, I realize it's always one of these sort of errors that there was some error in preprocessing, there were some missing data that were coded as 888 or 999 that I didn't catch, or there was some errant values, like someone who was 200 pounds being put in as 2,000 pounds, or something like that. And that also reminds us not only to do plots, but also check-- there's diagnostics like DFFITS and DFBETAS and things like that that really help us, in our models, diagnose these errant values. And it's always worthwhile to do those things. So I mean, I can say pretty much any time I've ever failed to do the common steps of doing a lot of plots, checking my regression diagnostics, every time I don't do that, I always wind up with something screwy. And then it just depends on how far I take it before I have to go back and check it. Well, that's very good. That's a very good lesson for all of us to learn is to spend time doing that first. But what about-- do you have a sense of what percentage of your time you spend doing the exploratory analysis versus a confirmatory analysis? I think that very much so depends on what branch of science that you live in. I think I live in the discovery science world. So when I work on FMRI and MRI, much of my analysis is really on this discovery site science. And in that setting, I think a lot more of my analysis could be described as exploratory, as more toward the exploratory end. So then I would say, for many of those projects, I would say, a huge chunk of it is exploratory. 80% or something like that. Now there are other settings I work in where this setting is more mature. And take an Alzheimer's disease. I think the setting, because there are so many people working in Alzheimer's disease, we have a better scientific compass to lead us. And in those settings, we tend to go into the analysis with more directed hypotheses. And then I spend a lot less time-- we tend to be able to come up with a nice frozen dataset that really is exactly what we're interested in. More often, in other cases, it seems like we start out with a dataset that we're interested in. We find some interesting things and then find that we didn't process the data in the way that was necessary to answer now these new questions that arose, and then we have to reprocess. And I certainly would think that someone who worked in clinical trials would spend a little bit less time on data pre-processing. But I don't know. That would be an interesting empirical study, would be to get, actually, numbers about it. Because I'm sure what we perceive the amount of time we spend on data pre-processing is different than what it actually is. Yeah, I bet. So we have a couple of questions in the question box. The first one is about lasagna plots. Coming back to the very beginning. And whether lasagna plots violate your data to in ratio. They do have a lot of ink. That is true. But on the other hand, they are displaying the full data set. So the numerator and the denominator is high in that case. So I don't know if they have a great data to ink ratio, but they do have at least-- in our estimation-- a tolerable one. Especially because they tend to take large datasets, and they tend to-- they basically redisplay the whole dataset. The key with lasagna plots is not so much the original plot, because that's interesting in and of itself. And if you're lucky, like in the sleep experiment, where there's some obvious missing data pattern that, for whatever the reason, was common across subjects, that's great. But probably the key to the lasagna plots is some sort of sorting or organizing your rows, where you can try to do to detect patterns. But anyway, yeah. The point is well taken that yeah, they do use a lot of ink. OK. And then the next question is about any suggestions that you might have for software to do EDA. Do you have any preferences on software, or can you just give some advice about what are some things people might use? Yes. So I use R. And then-- but I've been using R for a long time. So I use Base R, which is just R's default plotting stuff. Since then, in R, there's been a revolution with this package called ggplot. And ggplot, the ggplot stands for grammar of graphics. And this was a theory that was put forward on-- worked on by starting back-- I think it dates back to Cleveland and other work at Bell Labs and then goes all the way forward to Hadley Wickham's work on it, and several other people. And then I forget who the actual inventor of the term and the concepts of grammar of graphics. Any rate, this led to a system that is then operationalized in the R package, ggplot, which is, I think, one of the most popular R packages. And if you get used to ggplot, people seem to absolutely love it. I use it-- I find myself using it more and more. But because I was so used to Base R, it's been hard for me to switch over. And ggplot, as implemented in R, basically has two steps. One is you have to work really hard to clean up your data and get it into a nice format. And that's almost a feature of ggplot in that it forces you to work really hard on your pre-processing before you start plotting. And then once you get it in a nice format, then you're off to the races with the plots. And then the actual plotting syntax is quite good for ggplot. So I would recommend that. But I would say, every statistical data analysis program has completely tolerable graphics capabilities. And it's more, I think-- I don't think the tool is as much the problem. Now, if you're talking about creating production-ready graphics, then I think you can get into the specifics of various platforms. One thing that I didn't mention that is quite useful is interactive graphics. That's where you create a plot, and you create sliders or buttons or dialog boxes and things like that so that you-- so that the person who's looking at the plot, whether it's you, or whether you're giving it to someone else, can interact with it. And there is increasingly great tools for that. Plotly is an example. Shiny in R is an example for that. And then, of course, the gold standard for that is D3, which is a JavaScript library. But that requires a pretty heavy investment of programming knowledge and time to master that. But there's some layers, like Plotly, that have been built on top of D3 and other libraries that make it kind of easy. So at any rate, I guess the answer to the question, from my perspective, I use R, and I like R. And then if I needed interactive graphics, I would use Shiny or Plotly. Thanks. And there was a comment in here, not along the lines of software, but other resources. There was a comment here from Ethan McMann about a report-- a CIA report that he's seen that is useful in this area. Can you, in addition to your comments on software, could you give some advice about where someone might go to find out more information about EDA? I mean, you mentioned some of the leaders in this area, like Hadley Wickham, and Karl Broman, Jenny Bryan, Bill Cleveland. And you also mentioned Roger Peng's Coursera course. But other than those, going to the research of the first group, or Roger's course, do you have any other advice for where to find more resources? So- Than the tolerable book that you could just pick up. Meaning anything that really stands out. So there's two aspects to EDA, and one is super well-developed, and the other is a little more amorphous. So the first aspect to exploratory data analysis is the plotting component to it. And there is, I think, a lot of great materials on plotting. So on design, there is Tufte's books. On the perception, there's all this Cleveland work. On the implementation, there's lots of books on specific plotting implementation. There was a great book written by Nathan Yau, who does FlowingData. Look at his work on graphs. I'm trying to think. Roger Peng's course has quite a bit of stuff on graphing. Cleveland actually has a book on plotting as well. Kind of a Tufte style book on plotting. There is a-- boy, I'm forgetting. I'm blanking on some names. But there's lots of books on plotting. And then there's the other aspect of exploratory data analysis, which I didn't get too far into, which is you could use models, clustering, huge chunk of tools, what I would just call modeling tools. But in general, statistical tools can also be part of exploratory data analysis. Regression is a key part of exploratory data analysis. Right? You fit, add a model, throw in a confounder, check things. And I would say that's less well-developed as a field. So there, you have to go to books on machine learning or cluster to look at unsupervised clustering. Any regression class will have lectures on how to use regression, but they won't differentiate the use of regression in EDA versus the use of regression in a final confirmatory-type analysis. The standard for that is, of course, Tukey's book. For that style of analysis, it's Tukey's book. It's wonderful in that regard. So that would be the starting point, and then I'll think about that. Maybe on the slides, which are Google Docs, I'll add some references if I can think of some more. OK. And I think you just answered the last question, which was commenting on whether some other techniques, like dimension reduction and collective variable search would be considered EDA or could be categorized as EDA. And that sounds like what you just answered. Yes. Absolutely. Absolutely, I would consider that EDA. Anything-- EDA, I guess one way to define EDA is anything that's not confirmatory in data analysis. But I think when I look at my definition, it's a little bit-- it's not very strict for what constitutes EDA. And I think I define it more along the lines of if you're really doing hypothesis generation, and you're really doing a more free-flowing style of analysis, where you're looking for trends, and you're poking and you're prodding, and you're developing your hypotheses on the fly. Anytime you do that, no matter what techniques you're using, I think you're doing EDA. OK. Great. And with that, since we are out of time, I want to thank you again for this great lecture and for coming here to help us understand this better. Thank you, Brian. No problem. Thank you. All right. Bye-bye. Bye.

Info

Channel: The Foundations of Biomedical Data Science

Views: 3,509

Rating: undefined out of 5

Keywords: Brian Caffo, Johns Hopkins University, Data Science, Big data, BD2KTCC, Big Data U, BD2K Guide, Tutorial, Science, Exploratory Data Analysis, EDA, statistical computing, MOOC

Id: 5rTb6AkKhds

Channel Id: undefined

Length: 59min 35sec (3575 seconds)

Published: Fri Dec 09 2016