Miriah Meyer | Effective Visualizations | RStudio (2020)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] without further ado here's mariah mayer talking about designing effective visualizations thank you so much thanks so much for being here i'm super excited this conference is very energizing and really awesome i feel like i'm one of the cool kids so um yeah let's go with that uh so my name is mariah and i am a professor at the university of utah which is in beautiful salt lake city and at the u i get to work with an amazing group of students in the visualization design lab which i co-run with my colleague alex lex now in our lab what we focus on is doing deeply collaborative design-oriented visualization research and so what that means is that we spend a lot of time working very closely with domain experts in and available through cyber security and even poetry and as researchers we use these projects as test beds for us to experiment and to ask research questions um and so the kinds of questions that we ask as researchers can kind of bucket into three main classes of things so the first one are questions about you know how can we use these collaborative projects to come up with new and innovative visualization designs that can help us better understand the increasingly complex data that we have the second thing that we also ask questions about is about the visualization design process itself so how can we as visualization designers and practitioners do what we do more effectively and more efficiently and then the third thing we think a lot about um and one that i'm increasingly excited about is really moving beyond thinking of visualization as the end product and to instead think of it as a way to probe into how people think about their data and their relationship to technology in general and for me this is really about using visualizations as a tool in the sort of human side of data science so today what i wanted to share with you were a couple of examples from projects about how we used um ideas from the rece from the visualization research community in our very pragmatic um design-oriented process that we also do as practitioners um and hopefully along the way maybe give you guys a little bit of ideas how you can also look to some of the visualization research as a way to change and shift on the way that you're approaching visualization design all right so to begin let's talk a little bit about the idea of designing new and innovative visualization techniques and here i want to talk to you about i want to tell you about a project that i worked on with a colleague of mine bang wang who is in boston he's a really brilliant visual designer and bang and i were working with a group of biologists who are studying yeast or more specifically they were interested in the process of metabolism and yeast now this group um was trying to understand how different species evolved a similar process for metabolism because this actually has implications for our understanding of how diseases like cancer work now the main kind of data that this group was working with was something called gene expression which is really just a measurement of how much a gene is turned on or off in a cell and the group was collecting gene expression for many different species of yeast and under different experimental conditions now their main method for visualizing this data is a heat map and this is in fact one from theirs and i can see will in the front sort of tilting his head and i'm sorry i will never do that again so uh yeah so but a heat map is the predominant way that these biologists were looking at their gene expression data so bang and i began began to um engage these scientists in conversations around what worked well for them with this visualization but really what was really problematic and what they said is that they wanted to be able to make very fine scale nuanced decisions based upon changes in gene expression and they they found that in these heat maps it was nearly impossible to see that kind of detail of the data so bang and i stepped back and we said okay so we need to redesign a heat map but how in the world do we start even you know approaching this problem of coming up with something new and so we decided to turn to a fundamental principle in visualization design which is that spatial encoding is the most effective a coding channel that we have so by spatial encoding what i mean is by positioning marks on a common scale such as in a scatter plot or using the length of some sort of mark like we do in a bar chart so okay so this is a foundational principle you can ask many visualization designers about this um and but you know as a researcher you know how do how do we actually know that this is that this is a that this is a good principle to go by and part of it we know through on controlled studies that have been done looking at different encoding channels the first of which was done by the statisticians cleveland and mcgill back in the 80s when they did a controlled lab experiment where they ran participants through a series of tasks using different encodings like color position angle and so on and they found that the spatial encoding techniques significantly outperformed other types of visual encodings now this um study was uh replicated a number of years ago by jeff hare and mike bostock only this time they did it on mechanical turks so they were reaching hundreds and hundreds of people and and the results that they got um were very similar to what um cleveland and mcgill had done in their studies and so it's it's in studies like this that we're able to say with some confidence that certain kinds of visual encoding channels are more effective for certain kinds of tasks and based on this um that we can we can think about our basic encoding channels and rank order them so for example here when i'm what we're looking at on the left hand side are encoding channels that we have for encoding quantitative data and on the right hand side are the basic encoding channels for categorical attributes and you'll notice at the top both of them have spatial encoding um as the highest ranked one so as visualization researchers we take studies we do we do studies on a variety of things like basic visual encoding channels but we also combine it with things that we learned from design practice and years and years of anecdotal evidence to get to ideas like this all right so back to the yeast so i when bang and i we decided okay so let's take this principle and let's think about what does that mean in terms of the data we have and we decided okay well the data we have is actually time series so let's look at at line charts so we went off in an illustrator we mocked up every single variation that we could possibly think of for how to show a line chart and we printed out on a big poster and what we noticed was the one over here on the right these little filled frame charts really made the shape of those line charts pop out and so we incorporated that idea into a new technique that we called a curve map that we implemented as part of a bigger multi-view system that we deployed to our collaborators now using this new curve map technique our collaborators said that they were able to much more quickly see nuanced things that they knew about the data but they also saw things that they had never known were there which led to follow-up experiments and the discovery of some new scientific concepts that they weren't aware of before so in this way um you know i i wanted i want to stress that this idea is that we can take ideas about um uh perceptual principles and other things and use those as a way to brainstorm and sort of a springboard for thinking about new ways to encode our own data okay the second point the visualization design process so so here is really about saying okay well we have to create visualizations but there's a whole lot of stuff we have to do before that which is even figuring out what is it that i want to visualize in my data and this is one of the things that i personally have spent a lot of time on and care about a lot so i'm going to start with a little like flavor of what it's like to to be a visualization researcher like myself so we have two people a biz person and a biologist we might start by saying oh what is it you want to visualize and the biologist might say something like well from patterns of conservation we want to visualize the mechanisms that influence gene regulation and to me it sounds like this so it turns out the ginger effect isn't just for debugging it also occurs all the time in visualization design and as a side note um beware of jenny bryan scooping your punch lines but i i am just going to plow ahead anyway um thanks jenny okay so but but it's true that that same experience about only really like seeing a few things at a time um is is absolutely the case when when we jump into these new projects and what i'm really interested is in how we get from a very semantically rich question or like task like we have here to the knowledge that we can then design a visualization like this to support so fundamental to this process is identifying something i like to call proxies which are the partial and imperfect data representation of that semantically rich thing that the analyst actually cares about now the high level goals of a of a data analyst is actually rarely captured directly in the data and if it if it was you probably wouldn't need visualization anyway so instead what we have to spend a lot of time doing is thinking about what are the things in my data that uh that are that i can infer something um from that will then actually let me answer the question i have and so let me give you an example of what i mean by this let's say that i wanted to identify good film directors because i don't know sundance is happening right now in salt lake and i'm a journalist and i need to identify some good film directors okay great so i go off and i scrape some data from imdb about movies so now i have this data set that includes lots of information about movies such as how much money they've made their ratings and so on wonderful but now going back to my question my question is really about film directors but i have information about movies so what do i do and this is where i like to think about this notion of how do i select a good proxy in my data for my question and so my colleague danielle fisher and i came up with this very simple mechanism for doing this which is we break a task down into three things into the action which is the thing you want to do into the object which is the items you want to take that action on and then finally the measure which is the value that you care about for those objects and so if we do this let's look at our tasks so the action is identifying what do i want to identify film directors now this is where i know that i have a data set about movies not film directors but i could say if i can if i can learn something about movies i can infer something about directors so i'm going to choose movies of my proxy for director and what do i want to know i want to know if they're good and now you know i don't have an attribute labeled good in my spreadsheet so this is where i think all the time in data science we are constantly making decisions that are inherently subjective and really specific for the one for the whatever the view is on the problem you're tackling so in this case good could be movies that make a lot of money it could be movies that reach a broad demographic but for now i'm going to keep it simple and just say it's good equals movies with high imdb ratings so by identifying my proxies i can now translate my task into something identify movies with high imdb ratings which is actually something i can design a visualization or other sort of analysis to tackle but it turns out that actually being able to do this translation and being able to find these proxies is a really challenging process and takes a lot of time and in our group what we do is we spend a lot of time really immersing ourselves with the people that we're working with in order to better understand the kinds of challenges they face and the things that they want to do and some colleagues at the university of calgary just this past year formalized this notion of immersion by calling it design through immersion and talking about some some of the the sorts of benefits that we get um when we when we do transdisciplinary kind of research and and bring you know sort of big tense style approach to how we um problem solve and in this case um do data analysis okay so that's a little bit about the visualization design process and how we how we can do that in practice um and then the last thing i want to talk about here is the notion of visualizations as probes i'm going to do this by talking about another project that i worked on that was headed up by a former student of mine nina mccurdy where we were working with a group of public health experts at usaid who were studying zika the spread of zika in latin america and all the the effects notably microcephaly in babies and so this does during during this project nina spent a six month six months at the usaid working closely with the health experts there and it had all the makings of a straightforward visualization project the collaborators had data they had tried to visualize their data and failed and so they were super excited to work with us and so nina left and we dug in now over the course of the study nina did a lot of rapid prototyping and ultimately designed a tool that used best practices for how we can incorporate all sorts of tabular data with geospatial information she evaluated this tool with various stakeholders throughout that time and the and the response was overwhelmingly positive that this was a great tool for the data that they had and yet when we tried to engage our immediate collaborators and you in incorporating the tools part of their workflow we noticed a lot of hesitation and we were really concerned about this and as we probed deeper into this this hesitation we came to understand that even though the tool was a good representation of their data the data was not a good representation of what they knew to be true about the spread of zika on the ground and so this first came up in a discussion of a choropleth that looked like this one showing zika cases at a national level so as you'll see here brazil is in dark red indicating a relatively high percentage of cases whereas columbia is in a lighter orange indicating a relatively lower percentage now one of our collaborators when she saw this she noted well brazil reports all cases whereas colombia only reports cases after a thorough investigation and the implication of this comment was that what um was that the the different ways in which these countries were reporting zika was leading to a uh inaccurate and possibly misleading picture of what was actually happening so we pivoted to focus on this problem and found that the data was littered with discrepancies like this and then in the case of zika data this came from the um the way that the data was collected and processed and reported and it was this distributed heterogeneous data generation pipeline that occurred differently in every country that led to all these discrepancies and now you know all was not lost so even though these um these problems with the data were not included in the data set the experts working with it had deep and intimate knowledge about them um and that so so uh um and these discrepancies um it turns out were shaped by the individual countries political cultural economic geographic and demographic um context and so we'd get discrepancies like the union in region x goes on strike often and doesn't report zika data or country y recently overhauled its surveillance system leading to a sudden increase in detected cases so in thinking about this we formalized this notion into what we call implicit error which is measurement error that is inherent to a given data set assumed to be present and prevalent but not explicitly defined or accounted for and so we were able to characterize aspects of implicit error as well as then design an annotation mechanism to help experts externalize um nina implemented this annotation mechanism back into our tool which we deployed back out so that these health experts could start annotating their data in order to share information with their colleagues as well just as to start to build up a database of contextual information around that that initial data set so although this work was grounded in zika health data we suspect that this type of implicit error is prevalent in many many other domains okay so so this was an example where we were able to use visualization not as necessarily the end result but as a conversation starter about challenges that these the group was having with the data and then ultimately use visualization as a sort of mechanism to help analysts externalize things that they knew that wasn't captured in that data set so so these are three things that are are the you know the flavor of things that as visualization design researchers we think a lot about and i think has a lot of impact about how we do visualization and practice and i just wanted to lead quickly or leave with a quick recommended reading list the first one maybe some of you know is a data visualization by noah olinsky it's a great book that i think incorporates a lot of the foundational principles that we in the in the research community have for designing visualizations the next one visualization analysis and design is the grad textbook by tamara munzener if you want to geek out about how we talk about visualizations as well as more complex visualization types this is a great book the next one making data visual is personal little plug but um this is a book that my colleague danielle fisher and i recently published that really looks at the process of how we figure out what we're designing for and then the last one is the sort of sleeper recommendation and this is a hand-me-down recommendation from martin wahtenberg which is a book that will absolutely change the way that you think about visualizations and actually nicely complements a lot of what will was talking about earlier today and with that i just want to thank you all for your attention thanks so much mariah um some questions i well since you just mentioned him was martin wattenberg right when he said pie charts were underrated i absolutely agree i totally agree that they're underweighted but this is a this is a deep philosophical debate within the viz community for sure great thanks for tackling that um can you describe the many multiples chart uh you used to replace the gene expression heat map what is shown on each access did the final chart have more annotation more annotation interesting yeah so what we were showing is actually um the the rows were different species the columns were different genes and then we were looking at they had taken experimental measurements at different points during the life cycle of species so that's what the time curves were but really the curve map gives you a table layout so you can define attributes along the rows and the columns to help facet your data and inside we were showing time curves great thanks um how do you bridge the knowledge gap to someone who might not understand or have ever seen a visualization like you've shown oh my gosh okay i don't know how to answer that quickly that's a great question um you know what we actually do um in our i think this is a really important point actually is when we are designing a very new and different type of visualization we don't just show up one day and say voila look you're going to love it but instead it's a process of actually building trust with our colleagues so that they trust us with their hard-earned data we often the first thing we implement is similar to what they already have and slowly change things over time we're actually designing with them not for them and so that's that collaborative process i think that that really helps people embrace new and different ways of looking at data great um how do you determine the effectiveness of various visualization methods any qualitative metrics oh yeah oh wow great question um some of the the basic encoding channels i talked about are things that i think are pretty straightforward to be able to test quantitatively in a controlled experiment it's largely based on our perceptual system the kinds of visualizations that we create in my research group though are complex and part of the broader workflow and we cannot test quantitatively and so we use a lot of qualitative methods such as case studies and other sorts of things um in order to understand the efficacy of the kinds of tools that we create so yeah definitely we rely heavily on qualitative methods what is the funding model for an academic vis lab do you have project specific grants function as a core group for other labs is your work uh acknowledged well sorry gosh there's a lot of questions yes largely we apply for grants on certain projects and then that money funds our graduate students a little bit of our time we personally our lab is not a service lab because we do focus on research that we do have service groups for example on our campus that do that kind of work um and then sorry what was the last one they take them away from me okay did i read them third one oh oh is it what is it well acknowledged ah yeah no well um especially what i when creating these kinds of visualizations that we often put out and that people use as part of like a larger just you know sort of analysis pipeline visualizations i think are woefully undersighted and has often led to a crisis within our research community about like what's the value of what we do because a lot of times it's not really acknowledged i think visualizations are often used for creativity and brainstorming to get us thinking about something else we want to do not necessarily for the final answer and i think that part is something people don't necessarily recognize as impor as an important thing to cite well thank you so much another round of applause

Info

Channel: RStudio

Views: 3,129

Rating: 5 out of 5

Keywords: rstudio::conf(2020), Miriah Meyer, rstudio, data science, machine learning, python, stats, tidyverse, data visualization, data viz, ggplot, technology, coding, connect, server pro, shiny, rmarkdown, package manager, CRAN, interoperability, serious data science, dplyr, forcats, ggplot2, tibble, readr, stringr, tidyr, purrr, github, data wrangling, tidy data, odbc, rayshader, plumber, blogdown, gt, lazy evaluation, tidymodels, statistics, debugging, programming education, rstats, open source, OSS, reticulate

Id: BEnLLQaUyzQ

Channel Id: undefined

Length: 22min 54sec (1374 seconds)

Published: Tue Nov 17 2020