Jake VanderPlas The Python Visualization Landscape PyCon 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and he's going to talk about the basic Python visualization landscape give a hand to him thank you very much all right thanks very much thanks for coming and for many of you listening to me a second time in a couple days I'm going to talk about pythons visualization landscape and so back in whatever the fall I wrote this abstract and said I'm going to give an overview of the landscape of data vis Tools and Python and then you know a month or so ago I thought I should figure out what that is right so as I usually do I tweeted and said hey these are these are the tools that I'm thinking of talking about are there any others I should consider and Twitter came back with this whole thing so like totally changed what I was going to do for my talk but I want to take this time to like make sense of this deluge of visualization tools that are out there in the Python world there's there's tons out there as you'll see and I think each one is kind of specialized for its own unique application or it's your only unique strengths I'm hoping what each of you will take away from here is the ability to look out there and say given my problem that I want to do my visualization tasks I know what package I should use in Python so let's get started it all starts with matplotlib right matplotlib has been around for for over a decade now almost two decades now and it's kind of the kind of the core tool and there's lots of things that have been built around Matt taught lives there's this base map card oppai thing for geographical visualization pandas and Seabourn have some tie-ins to matplotlib we have things like GG pi which gives a ggplot interface on top of mat live network x gives you network visualizations yellow-brick and sidekick plot these are some things I learned about that do like visualization for machine learning this is kind of like the matplotlib cluster of tools that I think of and I'll go into some of these a little bit later on top of that you have JavaScript and in the last few years a lot of these Python libraries have started to depend on JavaScript and use JavaScript to get some great interaction interactive visualization and probably the two biggest of those are plotly and okay and you can I'll talk about each of those later but they're more there's this toy plot and BQ plot which you may not have heard of but they're really really fun libraries to try out and you have things that are tied to the jupiter notebook like i pi volume.i pi leaflet pi/3 j s they let you take advantage of different aspects of JavaScript to do interactive visualization in the notebook which is pretty cool so there's other things cuff links is built on top of plotly that's kind of like the Java Script cluster of visualization tools and of course there's more right you you want to link Java Script in matplotlib so there's this d3.js as a thing and I wrote a package called MPL d3 that links matplotlib in d3 it's not super well supported anymore but it's kind of a fun thing if you want to turn your matplotlib plots in to d3 and there's even more built on d3.js there these these image specification languages Vega and Vega light and there's some some Python libraries altair vincent and d3 pio that like give you a Python interface to all these tools are you starting to get overwhelmed yet you can you can link all these together those tools like data shader which is kind of a bouquet tool that also works with matplotlib I'll show an example of that later there's this thing called VX which is very similar to data shader that can render to all these three platforms you have this tool hall of you'll call of use that links data shader and vocation matplotlib and then there's this whole opengl cluster with glumly pi and visit i and you know there's graph visualizations and there's all these ones that I don't even know how to categorize if you look Big Data so we just have this huge like landscape of viz tools right and how do you how do you make sense of this well I'm hoping that this talk will help you make sense of this a little bit just by color-coding these clusters so that when you when you go out and you google each of these later and check out the examples you'll kind of know how they were like together but for the for the rest of the talk I want to dive into a few of these and show you some quick examples of what look like so you'll kind of have an idea of what what you can do with them so of course we feel like this so how did we get here right it all goes back to matplotlib that's what I started with and matplotlib for for all the flack that it gets in the in the last few years it really is a pretty incredible tool you know it's strength it was designed to be basically just like MATLAB and this was key to all the scientists and engineers who were transitioning to Python back 10 15 years ago you know in the end of the 90s and into the 2000s it has a huge amount of rendering backends you can this is underappreciated if you make a plot in matplotlib you can render it on almost any visualization back-end you can export PNG PDF EPS SVG all these all these different outputs and that's not trivial to do to make all of those different outputs look the same for the code that you write and it it really is powerful it can reproduce just about any plot it takes a little bit of effort to make most of these plots even the most simple ones and it's well tested it has this is a almost 13 14 year history of git commits and it's really been battle tested and bomb-proof over the course of all that time so Matt thought Lib I wouldn't discount it it's a really powerful tool but you can you can do all these different things with it um but it does have its weaknesses right and if you've tried to do statistical visualization with matplotlib you run into this so let's say you have a some data like this data frame of the iris data set which you've probably run into if you've done any machine learning tutorials it's a relatively simple data set and let's say you want to scatter petal length versus sepal length and color by species right you can say that in a sentence fragment how many lines of matt meta love does it take any guesses this is kind of like the best way to do that scatter scatter one variable by another variable and color it by a third you have to do all this all this kind of boilerplate code and the thing is that Matz I live is powerful is it is not very expressive in in a lot of cases so these are one of the one of the weaknesses that the API can be pretty verbose sometimes the stylistic defaults are poor you know it was based on MATLAB circa 2001 so if you want plots that look like MATLAB circa 2001 but I should say that in MATLAB 2.0 recently released the stylistic defaults have been updated so it's a lot better recently it doesn't really support web or interactive graphics which is what a lot of people want these days and it can be slow for for large datasets so everyone's goal the reason we have in my mind the reason we have this huge network of competing libraries is that everyone wants to improve on these weaknesses of meta Lib hopefully without sacrificing those strengths and one way you improve on matplotlib without sacrificing the strengths is you use matplotlib so that all this is kind of matplotlib cluster here what these tools have in common is that they are keeping matplotlib at the core so you have all those output backends all that versatility all that power but you put a new API on top so you address that weakness and you say I can use matplotlib but I can make it easier to generate those plots and the two that I want to highlight here are pandas and Seabourn these have been big in the in the PI data ecosystem recently so pandas as you probably know is a library that's meant to to make data frames store labelled data labeled columns of data and it has actually some built-in plotting functions if you do take any data frame like this iris handle and do dot plot and then dot something else there's all these different ways of plotting the data in there that are built in so here we have just in one line we can we can scatter plot two columns of that data frame you can even do more complex things there they're more sophisticated statistical visualizations in there this is one that I discovered recently that I've never heard of before but I think it's pretty cool it's a way of taking all the columns of the data frame and turning them into a Fourier series and plotting them as lie so that each individual line is a row of the data frame it's an object and the in some ways the those curves like encode the values and all the columns so you can see just by looking at this that there are three very distinct distinct types of objects in there and you get the sense of the relationship so these Andrews curves things are kind of fun I'm looking forward to using them in my own work so the other one I wanted to mention is Seaborn and this was this is a library that was explicitly designed to make a statistical visualization and more complex statistical visualization easy in matt partland wraps matlab it gives it a nice set of style defaults and color palettes and you can do things in in a few lines it's kind of a higher-level language so you need to memorize more things you don't have as many little composable chunks but if you know what function you're looking for you can do things in very short line number of lines of code so for example you can call the pair plot function and get this pairwise comparison of all the columns on the entire data frame so Seabourn is really really nice of you if you want to do statistical data exploration in python using that plotted okay so then there's this javascript cluster right and and the reason everyone loves javascript is because it's the lingua franca of the of the web these days right so you can do incredible things in JavaScript because it brings that interactivity to your browser and everybody has a browser right you don't have to worry anymore about these cross-platform rendering backends you just render them to the browser and the browser developers have taken care of all the hard parts so the the key the common idea here is that you basically build an API and Python that generates some sort of serialization of the plot that then can then be passed over to the browser and inside the browser you have a corresponding javascript library that reads that serialization and renders the plot that's kind of what every one of these tools does in some men are another and I want to focus real quick here on plotly in bouquet which I think are the most developed of this cluster of tools and they're both really nice they give you this interactive feel that's really missing from matplotlib so this is this is plotting with bouquet the same same data I'm just taking the the columns and doing a circle plot and then showing it and you have all this interactivity you can click and you can zoom and you can pan around and if you go a little deeper you can start doing things like adding controllers and you can add tooltips to the points and things like that bokeh is really incredible language that that lets you do these sorts of visualizations if you look at the gallery of bokeh you go go online to bokeh dot PI data org and you can click on each one of these and in since it's browser-based each one of these examples in the browser is interactive and you can start clicking and dragging and get a feel for how it works so book is out of continuum Daioh the people who brought you anaconda and numba and some of these other great great tools I'd really suggest taking a look at this it's that advantages you have this interactivity you have several different layers and different api's for generating things the disadvantage is you don't have the same array of outputs that matplotlib has so at the moment unless I'm mistaken you still can't do PDF or EPS outputs and so if you're a scientist who was writing a paper for a journal that requires PDF or vector-based graphics you're out of luck and it's also a slightly newer tool it doesn't have as much of a user base it's not as battle tested as matplotlib but it's really getting there so it's an awesome it's an awesome program so plotly is quite similar the story with plotly is it's actually a startup out of Montreal and they have this interesting kind of like open source / closed source model where a whole lot of the plotly tool is open source BSD license you can use it for whatever you want but there are there are a few features that they use to kind of make their money and they charge you if you want a little bit more and that tends to be things like automatically hosting plots on on a website with some sort of some sort of server backends that sort of stuff but I know a lot of people who are using plotly the free version of plotly for some very nice very nice visualizations and for even scientific visualization they can do all sorts of different things they have some some things that bouquet doesn't like 3d plotting and animations built in I might be wrong there I think you can do animations in boquete but it's not quite as easy as plotly but it's very very nice visualization framework and the cool thing here is it's not only a Python library it's also an R library it's a julia library they have these different ways to target target the JavaScript back-end from different languages so like I said the advantages are similar to Boca it has all this web view interactivity is multi-language support as 3d plotting some features require a paid plan and depending on your kind of software philosophy that may or may not be a turn-off to you I know some people who go either way on that but I think it's a great library and I would suggest checking it out if you're interested in these these interactive visualizations so the next thing that comes up matplotlib is not very good for these visualizations of larger data and there's a bunch of these libraries that address that deficiency of matplotlib and they do things like relying on OpenGL this visit I and glumpy right there glumpy or glump I don't know how to pronounce it there's things like data shader and Vioxx these are interesting ones that actually use really efficient code so that rather than delivering data points to the GPU or to the to the computer to render they pre aggregate all the data and deliver pixel basically bitmaps to to the computer to render and so when you have a billion points there's no point in sending a billion points to your visual visualization screen because there's not a billion pixels to work with right so you can pre aggregate those and do kind of heat maps and that's that's the strategy these use and then these other tools down here they're really nice I wish I had time to go into them a few come out of the out of the Astronomy community for visualizing large three-dimensional datasets so if you're interested in that check out the gray zone down in the left corner but I want to take a look really quick at data shader because I think this is a this is a nice project it's still in pretty active development but there's some impressive demos and I have to apologize my plan was to do a live demo of datas shader because I think it's so awesome but then all of a sudden I had a kid in the keynote and it it didn't really happen so I have I have screenshots so yeah the data shader what it allows you to do is these fast fast server-side things so it's a fast server-side engine that does dynamic data aggregations so you can take things like the census data where you have 200 300 million points and in real time you can you can visualize those in on a map and the live demo you could do here as you zoom in and out and in real time it's actually calculating the bounds that you're looking at figuring out what subset of the data matches that and then re aggregating it and and sending it to the to the screen so if you want to work with hundreds of millions or billions of points data shader is awesome so this is a summative you you can you can smoothly zoom in on Lake Michigan and Chicago and get a more detailed view of the points inside there so I mean you zoom in even farther and you see kind of like neighborhood neighborhood level data so I would I would really suggest playing with this if you if you want to visualize really large data sets you can go and download their demo notebooks but installation instructions are pretty easy and it's the fun package so in another class of algorithms here that I think is is really interesting is the ones that start to tie these all together and in particular hollow views up there and then I'll tear down in the bottom right these are these are like new kind of declarative language specifications that that target different backends within the system they might target bouquet or mat our or matplotlib or or d3 and they let you be create plots in a very expressive way in a very powerful way so the first one is hall of use and this is a really interesting project it's worth watching I first heard about it a couple years ago when they gave a demo at the Syfy conference and the the initial philosophy of this was that data sets should have kind of intrinsic peach data set is the best way to visualize it right so if you have a data set consisting of columns of certain numbers there there is an intrinsic way that it should be visualized and we we as people Ike think about that and twiddle the x axis and the y axis and the labels and the ticks and the colors the computer should just know how to visualize this data so what what all of you started as is a way to wrap datasets and an object that when you eval or when you do a representation of that object in the notebook it gives you the visualization so rather than saying like this is a data frame it's such-and-such address it actually gives you a picture of the data and they've built in all sorts of really really interesting interactivity on top of that and you can do things like like map data and from what I've heard I was talking to some of the bouquet developers it sounds like hall of use is going to be kind of wrapped into bouquet and it's going to become their sort of declarative layer for visualization within bouquet but all of you is also I you can see the the links I put here it also can target matplotlib it can target data shader it sort of works seamlessly with all of those tools so no matter what back-end you need if you need an interactive back-end if you need a big data back-end or if you need a backends that can output every plot every every plot file or a figure file imaginable you can use the same system so I think that's a that's a really powerful way of going about it the last project that I want to talk about is kind of my pet project that I've been working on for a little while it's a library called Altair and the idea here is that what if instead of passing around pixels we actually pass around the data itself with the metadata that describes what kind of plot that we want this is something that's been really exciting the underlying library underneath Altair is called Vega and Vega light and this has been starting starting to be adopted by things like Wikipedia and saying you know we don't want to just save a bitmap we want to save the data and we want to save the specification that tells the webpage how to visualize that data and I think this is a really this is a really powerful idea because it if it can be adopted widely we'll be able to use all that whole ecosystem of tools and say have boquete output a Vega light specification that could then be read in by matplotlib and could be passed on to something else and I've been talking with academic journals in the Astronomy community about the the possibility of having scientists submit their their figures in the form of these sort of specifications so that then the the journal could generate a PDF to be printed but it could also generate an interactive plot equivalent interactive thought to be on their webpage so I really think this is the future and this is kind of my soapbox that I'm pushing but so the what we're pushing here is this idea of declarative visualization and this is a this is a project that's in collaboration with my colleagues with the e science Institute at the Jupiter project and also the interactive data lab at u-dub which is the people behind behind tools like d3 which you might have might know about so what's the difference between declarative and imperative imperative visualizations think matplotlib right you can say in one sentence what you want the plot to be and then you write 50 lines of code to make it happen right declarative visualization is trying to make the code as close to that as close to that one sentence description as possible if you say I want X to be this variable Y to be that variable and color to be visit this variable and show me the result right so imperative you're specifying how something should be done all the little steps all the manuals plotting steps and the specification and the execution are kind of intertwined and declarative visualization you specify what should be done and the details should be determined automatically by the system this is quite similar to if you're used to database languages like the difference between doing things by writing a Python script to sift through data and writing a sequel query the sequel query is a declarative specification of what you want the system to do and the system can find the most efficient path to doing that so the key here is that this lets you think about data and relationships rather than these incidental details and brought Brian Granger who's one of the ipython developers actually used this library I'll tear to teach an intro data science class this last semester and and he's he's really excited about moving forward with that so I think what we're trying to do is free students to start thinking about relationships and data rather than thinking about syntax and libraries so where does this come from and you've probably seen the the d3 language I'm gonna I'm going to click over here and go to the live version if you go to the New York Times and see any of these really interesting interactive demos where like you hover and you see different things basically anything at the New York Times that looks like this is written in d3 and that's because the New York Times graphics editor is is Mike Bostock who wrote d3 right so he uses it a lot and he makes all his people use it a lot and I just killed my full screen how do I get it back yeah so d3 is super powerful and you can do these amazing interactive graphics right but if you've ever tried to use d3 you figure out that it's so ridiculously low level that unless you're Mike Bostock you can't do anything with it all right so here's the example this is literally like the example from the d3 example page of how to do a bar chart right like like okay I'm going to write I'm going to do a histogram to see this makes you wish for matplotlib which is its it's crazy but so so after after Bostock went to the New York Times his advisor Jeff hare who was down at Stanford and helped develop d3 moved to University of Washington and he thought about this and said we need a better way for actual scientists and statisticians to visualize their data so they wrote this the specification language called Vega and Vega improves on this a little bit it's no longer this imperative list of commands for creating axes and things like that it's a it's a declarative specification that says this is my data this is what I want linked to the x axis and the y axis and things like that but it's powerful still you're not going to like sit down and write this JSON structure to see a bar chart of some data you're exploring right so once they got Vega working this is powerful powerful thing that's underlying a lot they said we need it to be simpler so they made Vega like right and Vega light is this is almost getting to the point where you could just sit down and type this in a text editor and and make it happen right you're basically saying these are my data I want to bar marking and I want the X to be a and the Y to be B and it spits out that output so what we're doing in Altair what the altair library does fundamentally is it's a Python API that creates these outputs it creates these JSON specifications because I like writing Python I don't like writing JSON by itself so I'll tear this is what it looks like you have the data in a data frame and you say I want to chart with that data I want a mark bar and I want X to be a and Y to be B right so all of a sudden you're you're literally just telling the computer what you want to be shown and and the computer figures out figures out how to display it and the output of this output of this code is basically exactly this this little JSON object and now you can start passing that around to other libraries and other places and so you've you've separated the specification of the plot from the execution of the plot and I really think I'm hopeful for this as a model moving forward for interactivity interoperability between all these Python libraries and also between libraries and other languages and our and Julia so here's another more complicated example going back to our original plot that we did in matplotlib we want to we want to say that X is the petal length y is people width and color is the species and you basically you just write that out and you can start adding things like the opacity of the circle and if you do this to a dictionary which is pythons basically JSON representation you get a dictionary that describes the plot that's everything you need to know about the plot in order to recreate that so this is really fun and you can do some incredibly powerful things with altair you know these are these are some of the more advanced examples that we have for doing different types of data visualization I especially like this does anyone recognize this kind of yellow and blue plot in the in the left middle that's the that's the plot of measles incidents over the course of time and there's this cutoff right there and that's when the measles vaccine was introduced so you you can see in the historical data each row is a state and each box is a year of a number of people who had measles and you can see the effect of this vaccine it just works really well so I you can check that out go to Altair I think it's Altair vis github diet I oh and see that and I should say this is under pretty active development one thing that is just happened is altair 2.0 or Vega light 2.0 has come out and this is incredibly exciting you know the grammar of visualization the grammar of visualization is not a new thing other people have done that before but what they just added is a grammar of interaction so you can you can build up these little these interactions from basic building blocks and we don't have that in altair yet but my project for June as soon as I'm finished with parental leave is to to finish this and get Altair out there so that you can start doing interactive interactive declarative plots anyway this is how you can try it kondeh install pip install you can get a tutorial you can go to the go to the website and that's the visualization landscape I have my contact info but I'll leave that up there and take a couple questions thank you we have time for questions at two minutes for questions yeah if you let someone like to go up to the mics we can do that I can also talk to folks afterwards oh we got a bouquet developer coming up Oh my last commit to bouquet I committed rectally to master an example and I broke master with my example so okay I don't have it ok anymore but I have a statement of form of a question but first I want to thank you very much for this comprehensive not easily put together and thank you so much for doing the work here but my question was did you know that bouquet actually also has an hour interface so it is also I actually yes but didn't know that I should have mentioned that but Thank You Jake for great presentation mm-hmm that was their phrase of the question thank thank you very much for putting this together regarding Altair I'm very interested but I invested like a few years of my wife a little while ago in learning ggplot2 I'm just wondering when you were looking at your grammar have you looked at things like ggplot2 which were very successful using the grammar of graphics of others yeah I'm hoping your API is influenced by them so that I won't have you know done all that work for nothing in the past yeah yeah so the it's quite similar and our API and I'll tear is really influenced not not as much by any of that work but by the Vega light specification itself like our API is 95% of it is automatically generated by just reading the Vega light schema and creating creating a Python object hierarchy and then a few little bells and whistles on top of that and one of the things I didn't put this slide in but one of the things I'm most proud of is we have to way back and forth are actually 3-way conversions so I can take altair code and generate Vega lights specs and then go from Vega light specs back to the Altair code so I can do this round-trip thing and literally our unit test suite that cut that test every conceivable thing is like twelve lines of code that just does this round-trip on all the big alight examples I thought I was totally geeking out about that I was smiling for like three days I happen to use the legal ID before so since the healthier apprised the Python API does that mean that I have still need the web browser to have see the graphics yeah you need the web browser to see it because it's a JavaScript library that finally renders the graphics that's not a fundamental limitation someone there's work on creating a vague alight renderer and matplotlib for example but right now we have it tied to the Jupiter project in Jupiter lab and it's kind of seamless you just just creates a big light object and Jupiter knows how to render it in the browser okay maybe one more question yeah I guess my question so the question is how do you solve for the IO problem because it looks like the billion dollar sorry the billion data set that you're showing was it in memory if it is in memory like how do you get into the memory from database so this is a question about about data shader actually I'm gonna I'm going to defer that to people like Peter who asked the first question because they can answer a little bit more about data shader and some of those things going on I'll take one last question I can take it offline if we need to wrap this up is it a quick one moderate yeah it thinks it right so as far as interactive graphs you showed a lot of options here but mostly it's centered around zooming and zooming in geographical constraints what about slicing data say for instance the census by income or race which would require some user boxes yeah or any of these tools set up for that or do you have to pass it a different JSON they are yeah if you look at the at the bouquet project in particular they have this way of creating dashboards that are really really powerful they can either be client-side or server-side dashboard and if you look at the bouquet examples of all of those there are examples of using sliders and conjunction with visualizations and things like that so I check out bouquet for that thanks and we're hoping to get there with altair too so thanks very much everyone [Applause]
Info
Channel: PyCon 2017
Views: 47,909
Rating: 4.9610705 out of 5
Keywords:
Id: FytuB8nFHPQ
Channel Id: undefined
Length: 33min 30sec (2010 seconds)
Published: Sat May 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.