Jake VanderPlas - How to Think about Data Visualization - PyCon 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good afternoon everybody and welcome to the final presentation in this room for Pike on 2019 on the subject of data visualization please welcome Jake van to+ well thanks very much thanks for sticking around to the very end today I want to talk about about thinking about data visualization and this is sort of in in opposition to the question where people usually start like what visualization tools should I use and I don't know if any of you were here in 2017 or in Portland in 2017 I made the mistake of deciding to do a talk about this what visualization tools should I choose and my plan was to you know compare and contrast four or five good visualization tools in Python but as I started digging it turns out there's a lot of ways to visualize data in Python and yeah so so I don't think this is a good place to start right now because there I saw someone putting their camera up I'll let you take another picture of this yeah so the I don't think this is a good place to start because there's too much confusion and there's too many different things and really what it boils down to is the Python community has so such a wide diversity of use cases and applications that you need a lot of different tools to accomplish the the different types of visualization that people are doing so I don't want to start with that question what visualization tool I I should use I want to start with this how should I think about data visualization like what is data visualization and then we can build up from there build up the concepts and start thinking about how we can make effective visualizations no matter which of those million plotting packages you decide to use so um so what is visualization you know if you take a data set like this what visualization does is it tries to take this representation of the data set kind of a tabular form and put it in in a form that makes it more intuitive so to at first glance you can you can find the relationships between the data um does anyone recognize this data set there's a few few hands how about if so so this is a this is a data set where there are if you take common statistical summary statistics like the mean standard deviation correlation and things like that all these four data sets are the same but if you visualize them you start to see that there's some very different properties in the data set so visualization is important because it helps us see what's in our data it helps us get an intuitive feel for what's in these tables of numbers and and what's going on here when we when we visualize data is where we're essentially encoding the the values in the data set into certain visual representations so what are these here and in this case the way we've encoded this data set is we've put the x value in the x position on each plot the y value in the y position we've drawn a little circle and then we've split up the data by we've we faceted the data into four different facets or four different panels to get a feel for what each individual data set is so this is this is an encoding this is something a way we've transformed the data from numbers into individual properties so we could think like what what other encodings might we use we use the x value y value and the facet here to look at this we could do something different right we could use instead of the facet we could encode the different data sets by color and then we we get the advantage of having everything on the same panel where it's easier to maybe compare but this is this is sort of a muddled plot right so maybe maybe color is not the best encoding for the data set what if we do shape right oh you see a lot of these kinds of encoding where people draw different classes of data with different shapes it can be effective and sometimes not effective I would I would argue in this case it's not very effective it's hard to pull out what's going on there you know you could do something like size that that might help here and these are all different visual features you can use to encode the same information and maybe we can do the shape size and color altogether this is starting to get a little you know a little bit crazy but it but it it does help you maybe distinguish what's going in these data sets but um you know for argument's say it I would I would say that this is not a very effective visualization right there's something about this visualization that's just not quite as appealing and not quite as intuitive but then compared to the four panel plot so we'll get into that later we can start thinking about what drives that that intuition in visualizations so here's one that's you know we just throw everything at it-- we encode the data set by by facet by shape by size by color it's a little bit crazy but it you know it helps us see what's going on so this is good to think about like we can we can encode all these different properties in different ways but there's one there's one property in this data set that we're missing so far as anyone does anyone see what that might be what what does what does this visualization of the data set completely ignore I hear some I hear some mumblings the the here it's the the row number there's no there's no information here that tells you the order of the points as they appear in the table right so we could we could add some more information to that we could say maybe the index or the row number of the data is encoded in the color and this floor panel plot that's sort of hard to see it's you know the maybe if you like if you if you've looked with a magnifying glass and kind of like categorized things you can maybe see which point is is which so maybe for the order we can do something like adding a line and we see that the that here in this data set the order doesn't doesn't really mean anything right so but that information was there and that information was completely lost in the previous panels that we did so on you know we could start thinking about better ways to represent the order in the data like here's all these four values in the dataset encoded in a completely different way we have index on the y-axis the so the order of the points we have the the data set on the x-axis and then we use color and size to show the relationships between the values yeah this doesn't really work you can't really you can see colorants eyes doesn't give you that intuitive view of the relationship between the data points right but we could play this game we could do all sorts of things we could split the data up into the four data sets and use use the size of the points or we could use the color of the points we could we could decide that instead of using a circle mark we might use a rectangle patch and create kind of a heat map of the data we could use a bar chart in each of those panels that shows the shows the the scale of the data split up by these various factors so this is essentially this is a data set with four different dimensions of information and we can choose all these different ways to encode the information in that data some of them are more intuitive than others some of them let us see the relationships better than others and some of them are just really really bad this is one that I really like I played around for a while and this one the the row of the data the the order isn't encoded but kind of the grouping of the index is encoded and we use a slope between two different points to show the relationships between the points in each table and this actually gives you it's not kind of it maybe not as intuitive as a normal familiar scatterplot but this gives you a good idea of what some of the the weird points in the data set are so anyway the point is all these different representations that I've run through are encoding zuv the exact same data they contain all the exact same information but there's something in here that we're starting to get an intuition for that makes make some of these effective and some of these ineffective so as we're thinking about how to visualize data I would be nice to kind of put our finger on what makes a visualization effective you know what is it about these these faceted bar charts that that end up not being very useful so um just as a quick summary of what I've been talking about it over the course of looking at all these visualizations we have we've set up something where data properties are encoded in some sort of visual representation we have we have the data represented by some mark that might be a line a point a bar a patch on the heat map and we have scales that map these and coatings onto the values that go under underneath the scales might be the numbers on the x-axis or the the labels in the legend or the color bar right and so what this suggests is that we can start talking about visualization in terms of grammar and this this goes back a lot of people have been thinking about this for for decades and decades and and one of the more famous books that that talks about visualization is a grammar is this book by Wilkinson Wilkinson the grammar of graphics and and basically lays out that when you when you want to create a a chart the grammar you have is you start with the data you have some sort of transformation of the data you have these marks that we talked about whether it's points or a line or a bar or it might be multiple points of different shapes we have the encoding which could be x-position y-position color shape size and then we have the scale which are the ways that we indicate to the reader what the encoding means so we have here for example we have labels on the x axis and y axis and we have a legend that map's the shape and the color to values of interest so so the question is what what when we're looking at two different representations of the data set like this what visual encoding is going to be most effective and what mark is going to be most effective and what scale is going to be most effective for my data and these are the kinds of things that people in the visualization community have been researching for years and an example is this Jacque Burton put put out this semiology of graphics which basically tried to lay out the theory of which encoding x' which marks are going to be most effective for a given data set and it's in French so we'll we'll shift to French now for the rest of the talk now I'm gonna I have these in the translations here so he basically laid out what we what from the top to the bottom what the most effective and coatings are in terms of human perception and you know at the top we have things like 2d position and we can see the 2d position is is useful for for knowing the order of data and also the quantity of the data we can we can eyeball that and figure out you know what the values are roughly as well as the the order of the values size is similar we can size gives us a sense of order and a sense of quantity color value can also be give give us a sense of order in a sense of quantity we go from lightest to darkest as we as we start going down the scale though we get to encoding x' that are less useful for quantity or for order but can be useful for for categories so for example the the color hue the red blue green yellow this doesn't give you any sense of order so you wouldn't want to you you wouldn't want to apply this to a quantitative access you couldn't look and say red is less than green and green is greater than blue but it does give you a good sense of category and and similar for shape it gives you a sense of category but not order right so as you start thinking about this you can you can lay out this chart and essentially here that if you if you think of three types of data nominal data which is sort of categorical stuff that's not ordered at all ordinal data which is discrete ordered categories and then quantitative data which are which are continuous quantities they're encoded better or worse by each of these possible and codings so position for example is is very good for nominal ordinal and quantitative data gives us a sense of of data identity it gives us a sense of scale and a sense of order whereas if you go down to things like color value you can get nominal and and ordinal quantities though the order of the data but not necessarily as much the actual quantity of that the data is representing alright so as you start to think about this there there's a few practical takeaways when you're developing a visualization one is not all in coatings are created equally so if we look at this the we sort of identified that this was a bad visualization earlier uh I think the reason that this is a bad visualization is because it's encoding its encoding things that it's not using the most optimal encoding for a property in the data that's very important like here we're distinguishing between four different data sets so we we probably want to use the most most intuitive and most easily perceptible encoding to do that and color and shape and size are not the most easily immediately perceptible encodings for people looking at a chart whereas position is you know if we split these out positionally we can immediately look at it and see that these are four different categories without having to think deeply about the meanings of the symbols right so so one takeaway is we should prefer position and coatings whenever possible because they there's something that we can just at a glance understand so an example of this in the real world here's some data from the Census where we have the distribution of age versus population over the course of every decade going back to 1850 and this is not a very intuitive chart all the information is there but it's it's really difficult to see what's going on right you can't it's not easy to see the difference between two the year 2000 and the year 1990 in this plot so what can we do we can instead of encoding this important information in color we can in court encode this important information in position we facet the data and this is this is something that's known in the visualization community as small multiples making making lots of different views of the data that are changed slightly from panel to panel and making use of these kinds of small multiples can be a way to really quickly create effective visualizations because you're in coding these important properties in in an encoding that's easy to perceive the position and I love this this visualization by the way because it you can sort of see where are we in 1950 here's the the baby boomer bump and you can kind of follow them as I get older and there's the bump and then here's the baby boomers kids in 2000 and are starting to starting to populate it and it's sort of like marches along to the right and then and tapers off at later years so the the the second takeaway that I want to show tell you is that the the best color scale is going to depend on data type so we thought about we thought about using colors and coatings and and it really the effectiveness of a color encoding is going to change depending on what you're looking at so this is an example of a not very good encoding right this is using the color hue something that's very well suited to nominal data at a categorical data but this is using a color hue to try to encode quantitative information and it the reason it's not very good is because it it can often call your attention to the wrong aspects of the data set like this in this rainbow color map the yellow really stands out to you just because of the way that your eyes work and yellow is not it's not at the extreme so it's not high unemployment rate or low unemployment rate it's somewhere in the middle and so so by kind of emphasizing the yellow on first glance we're not really conveying the right information in this plot so a better thing color scale to use for this sort of quantitative data as a perceptually uniform one like color value so are so effectively we're looking at the transparency of the color as well as a uniform change from yellow to blue and this shows you a little bit more what's going on with this this unemployment in each county we see that there's there are patches of high unemployment and across the Midwest its relatively low unemployment and it gives you a better intuitive sense of what's happening in the data another type of color map that you should keep in mind is if you're if you're doing something that has that has a symmetric distribution around a midpoint like for example here this is the unemployment with the average subtracted where average unemployment across the u.s. is in white and higher unemployment is in blue and lower unemployment is in red this kind of diverging color map lets you lets you see two two extreme quantities at once in a very intuitive way so using these kinds of using your using color as effectively as possible can can help you create really nice visualizations and the last thing I want to say is as a general principle is it can be it can be really useful to use a visualization API that has these kinds of grammatical approaches built-in because then you need to spend less time thinking about it and making sure you're making good choices and instead you can you spend a lot of time with your with your package that you're using making the right choices for you so the way I like to think about it is we we want to have a visualization tool where the way we think about visualization in terms of these in coatings and grammar is mapped under the way we code in visualization and that that's mapped onto the way the visualizations are presented on the screen and one of the things I've found over the years that when you have a good set of api's that maps on to how you should think about things it helps you think about things better it's sort of this this positive feedback loop of making you more more aware of exactly how you should be approaching your problems so there there are a number of interesting grammar based plotting packages that are out there probably the best known is ggplot2 in the our world and that's that's an API that's built directly on this grammar of graphics book that I mentioned earlier so in in the our world you'll find that people people love this and you know swear by it and wouldn't have use Python because GG ggplot2 is not available it's it's an incredibly powerful tool in the Python world there's an interesting package called plot 9 which essentially its goal is to take the ggplot2 API and bring it to Python and it uses matplotlib as a backends matplotlib is this this visualization tool that's been used over the last 20 years and is very mature there are other approaches vega light as a grammar that's implemented in JavaScript and and JSON that lets you specify visualizations in a grammar based approach and Altair is a package that i've been working on in the Python world that that gives you a Python API for this vaga like grammar and because I'm the one speaking I'm gonna focus on Altair so a real quick look at Altair I know I told you I was not gonna recommend a tool but you should really use Altair it's a it's a good package so the idea with Altair is that if you if you take data that's in a tidy format it's kind of with ROS being being observationally being categories of those observations then you can you can create an alter plot by kind of directly specifying this sort of grammar that we were talking about before and so just real quick here we're specifying that we want the mark to be a point we want the X encoding to go to the peddle length the Y to the super widths and the color to the species and it pops out right there you the the grammar maps directly on to the plot so another nice thing about all tears because the visualization is implemented in JavaScript it's you can get interactive visualization very trivially by by adding an interactive tag and yeah just to emphasize this this whole grammar of visualization is built right into the API we specify the data in the chart we specify the marks we specify what encoding x' we want to there to be and then we we end up with a with a very flexible way to create charts so if we to add a column for example we just say that we want the column to be species and the fastest it if we want a tooltip encoding where you hover over the point and it tells you the values in that point you can specify that exactly and the power of this sort of grammatical approach means that you're not trying to remember whether you're going to make a tick plot or a bar plot the choice of the mark and the choice of the encoding specify exactly what it is you're not creating categories of plots you're you're creating a grammatical specification of what you want to be on the screen so here's here's a tick plot where we show all the the petal widths for each species um if we want to change this to a bar plot showing the mean all you all we have to do is change the mark there from tick to bar and change the encoding of the X from the petal width to the mean of the petal width so it's it's not about remembering what the API for a bar plot is versus the API for a ticket plot it's about using the same unified grammar to specify those things um now the other thing that's built into Altera which is really nice is this this idea of using the right color maps for the right data so if you specify a quantitative value for your color it'll choose a quantitative color map and we can we can if we want to be so explicit about what sort of categories of data we have this little : Q 4 : quantitative tells us that it's a quantitative value if we change for example to an ordered value we get the same color map but we get automatically get a legend that tells you what the categories are rather than a continuous error bar with a continuous color bar with the categories identified and if we change to a categorical type with no no order like the the origin it automatically shifts the color map to something that's appropriate for categorical data so it's nice because you don't have to think about color maps anymore just as long as your data type is specified correctly the the color map will be correct and so the other thing about all tears is on top of this grammar of visualization there's this there's a grammar of interaction that was added about a year ago before before last year's PyCon and what the grammar of interaction does is it allows you to specify types of selections that you can add to the plot so here I've added a an interval selection to the plot it doesn't do anything yet you can you can click and drag it around but what we need to do is attach this interval selection to some of the encoding so instead of saying the color is going to is the origin we can say the color is conditioned on the selection and if it's inside it'll be the origin if it's outside it'll be gray so all of a sudden with a few lines of code we can we we can specify in a grammatical chromatic grammar of visualization sense a really kind of complex interaction you know has anyone ever tried to make something like this in d3 it's yeah it can be it can it's a lot more than twelve lines of code so and we can start tying things together so if I take this this chart and I store it in a variable called chart and then I tie it together with with another chart with with this or bar that puts charts next to each other then then we have a cross linked dynamic brush that highlights points in each panel and it knows how to how to propagate those selections from panel to panel and we can even do things like create a histogram where the X values the count the Y value is the origin so we get the number of the number of cars for for each country and then attach that to the dataset add this transform it by filtering by the selection and then we get a dynamic histogram of what's inside our selection so that this kind of grammatical approach to specifying visualizations can can lead to a hugely powerful way of building up intuitive visualizations of data and so I'd encourage you to check out Altair the website is there and there are a bunch of examples on there I just did a big release of the package a week ago of the 3.0 version that adds a number of new features so anyway sorry for for telling you about a package even though I promised you I wouldn't but it's the thing that I've been working on a lot lately and and there's been a lot of fun so the summary the takeaway is if you want to if you're doing visualization what what you're trying to do is encode your data into visual properties and not all of those properties are equal so for example the the position is it's probably the best and most intuitive way to encode data so use position as much as possible before you start going into shape and color hue and things like that think about the colors that the marks and the encoding z' and the scales you use one nice thing about this grammatical approach is that it it categorizes things for you so it's easy to think about the possibilities that are out there rather than being tied to the API of some imperative visualization library that that keeps you from from exploring the other thing is explore small multiples you know do do lots of little visualizations of your dataset because it can give intuition into what's going on and I'd encourage you to choose a grammar based approach to data exploration if you're into are try ggplot2 it's it's an incredible tool if you're in Python try plot 9 or try Altair if you if you like JavaScript using Vega and Vega light directly is a good option particularly in there's these observable notebooks in the JavaScript world that that allow you to do that very quickly so explore those tools and see how effective you can be at the visualizations you're doing so thanks very much thank you again Jake I think we may have time for one question if anyone has a question there are microphones in the aisle one question of you a great presentation I guess if there was one feature you really wish you had and that isn't in existence in those packages like what would it be what's on your yeah so the one one feature that I really wish was there is at this point there's kind of in in this whole world there's a trade-off between interaction and size of data that can be handled so the the tools that create static plots are good at handling large data sets the tools that create interactive plots are not good at handling large data sets particularly in this in this sort of declarative grammar based world so I want a tool that has both of those together that can that can do millions of points with a grammar based specification in an interactive manner and if you if you look at my talk from two years ago I talked about some tools in the Python world that are out there that that took some of those boxes but I don't think the the killer app is out there just yet thank you again Jacob [Applause]
Info
Channel: PyCon 2019
Views: 24,711
Rating: 4.9411764 out of 5
Keywords: Jake VanderPlas, pycon, python, coding, tutorial
Id: vTingdk_pVM
Channel Id: undefined
Length: 29min 55sec (1795 seconds)
Published: Sun May 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.