PLOTCON 2016: Peter Wang, Interactive Viz of a Billion Points with Bokeh Datashader

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you hi everyone thank you for coming today thank you to the sponsors Domino plotly our studio and others putting on this this conference today I'm gonna be talking about interactive visual statistics on massive data sets my CMO always slaps my wrists and tells me I should say big data but for you all I will change it to massive data sets how many people here have tried to look at really large data great but there's a lot who have not right how many people have actually six feel like they successively looked at a few million points it really gotten something out of the vist I didn't already know not as many hands all right so today I'm gonna be talking about that particular topic and and a tool that we've developed at continuum to to do that I'm Peter Wang I am the co-founder of continuum and currently I serve as CTO a continuum we created things like anaconda I personally started the PI data conferences we've created a lot of tools like tasks and bouquet and these other things that people have been talking about today but data shaders one that that doesn't get as much coverage but I think is a very exciting technology capability and so first let me talk about what is the problem and the problem is not that big data is visualizing big data is hard it's actually that visualizing big data accurately and well is it just amplifies all the problems you already had with visualizing small data so so the standard plotting tools that we use even on small data sets already suffer from a number of visualization problems visual hallucinate or some people call them whatever they might be and we simply have some deafness to kind of get around it we have some some tools the trade we use to get around those kinds of things but all those issues are on end to their much harder or impossible to resolve with really large data sets and so I'm going to go through just what I mean specifically by those things we're all on the same page and then I'll show data shader and how we use it to solve some these problems so the most obvious problem is one of over drawing how many of you use matplotlib how many of you know the order in which map pot live will render a set of data points given to you you want about hundred bucks on it who want but all right oh not you not not not him but anybody else right or Excel replace my Apollo with Excel with good new plot with ggplot with whatever right this is I ran into this as an undergrad I was like oh wait a second if I just change the order of the columns I get different looking plots everyone runs into this problem with small data we're like oh well maybe what I can do is I can you know this is the problem right depending how I draw it the last the most recent color seems more predominant and maybe if I'm really clever I can set an alpha value if my tool allows me to do that and the problem with setting alpha value is that it's still not good enough you can still have collisions you still have saturation it just takes a little bit longer to get there so you've just kicked the can down the road and for a single category you know if you have ten 22 thousand points you can set the Alpha level to the lowest possible alpha and you will still end up in this problem and so like right here so if you have more points that starts running into the same problems you might say well I can reduce the point size then I'll be very clever I'll make the point smaller and I'll reduce the Alpha but this is a very difficult parameter to set because now your points are tiny and faint which is great for seeing when there's a whole bunch of them together not going for looking at outliers anybody here interested in outliers right so it actually fundamentally the techniques we use to get resolution in the core actually hurts us to the edge the things that we use to enhance the edge hurt us at the core it's a classical dynamic range problem most of these techniques we have they suffer from you know not offering us good good approaches for dynamic range and so when you get the points down to be small enough you're actually at multiple points per pixel and at that point you're actually doing something like binning so we have developed as vis people we use binning right we use hex pins square bins all sorts of various things we use circles and make these like really cool like diagrams that look really really cool but but they actually dramatically reduce the amount of information density that we're getting out of our vis and more importantly I would say that the the challenge is that when you plot big data you actually don't know when the VIS is lying to you you don't know when it's a mitten one outlier that was the really important one you don't know when it's just sitting there reah reah rating or reinforcing to you the models the statistical model of the data set that you walked in with as opposed to helping you validate or challenge it so that's one of the biggest problems with this like if you use Python or use R and use any of these kinds of programmatic tools if you do the fundamentally wrong thing oftentimes it will complain a throw an exception there's not a single vis tool that I know of there may be a few that offer advisement but most tools do not error out on you if they're lying to you they'll just do it okay and we laugh but we make life-and-death decisions based on these things we make policy decisions for countries on these kinds of things and as we start looking at more more data as we start trying to let the data suggest to us better questions we can't do that anymore right the data and the VIS tools are more than happy to reinforce so we think we already knew so the data shader approach started really as me trying to figure out how do I build a better server for web graphics because I took it as a given that I was not going to dump several gigabytes or terabytes of data into the browser no matter how good JavaScript gets that's not that useful and so if I wanted to do rendering on the server side but I want to present and interactively present meaningful data explorations in the client-side in the browser what is the right language for that what is the right protocol it's not a JSON if you think Jason's the answer you don't understand the question right because the because the question is how do i specify what I want visualized without imparting so much statistical sort of prior on it that I lose that the texture of the data itself and so what we develop will emerge from that line of thinking is a technique called data shading but a tool called data shader and what it is is an automat it's a flexible and configurable pipeline for automatic plotting of large data sets it has flexible plug-ins for there's stages and it really strives to prevent these common problems over plotting saturation things like that by providing a connection to bouquet we are able to provide a fully interactive experience in web browsers even very large data sets and so that mitigates bending issues because we can we can kind of go into the data zoom in zoom out things like that and the other key and I think the most important aspect of this is that most of the tools they use for visualization they let you configure things like color and line style and fill pattern and alpha and things like that what data shader wants to do is give you this pipeline make this pipeline first-class and let you write your statistics and put those at the very root of the the pipeline and so to make this more concrete if you think about how you start off for most visit art off with a table of data then you do what we call a projection process right you basically pick something you know to sit in the x-direction something sits in the y-direction you pick your retinal variable mappings to your data sets and at this step oftentimes you filter down the data okay once you've projected them into what I'll call the scene then there's some kind of sampling or rasterization process and that produces what we call aggregates you can imagine here we put in this raster grid which could be the pixel grid that you're you want to 800x600 image or something and in each one of those pixels there could be thousands of points so we create these irrigations we create these bins and inside each of those are all of your points and each bin can yield some number of statistics it could be raw counts it could be the mean the median whatever it is and then once we yield those counts for every bin then we apply what we call the transfer function the actual creation of some visual primitives and their visual aesthetic properties based on that so for this is kind of a little bit off in the info vis academic space but for those of you who've done graphics for a while you probably see how most of these steps are kind of implicit in most the tools you use and hopefully by looking at this you can also see how if we deconstruct this visit I flying for large data sets we are able to maybe solve some of these problems of over plotting and an oversaturation so let me go ahead and do the demos and I'll just show you how this is done so this is a a dataset that is how many points here 10 million points this is a flight paths and so you see have longitude latitude the origin and the country of origin which is categorical whether knots on the ground and whether it's ascending or descending in the velocity okay so we're gonna load these in and we can then dump the plot of this thing this is using data shader I don't wanna get too much in the code of this because that will just really drag us down but we're creating a canvas we're creating line elements for these flight paths and then we're just going to call transfer function shade that's part of the data shader pipeline and when use the Inferno map and so that gives us then a static image data shader is a library that can create static images so it's not tied to bouquet although it works really nicely okay so here we have a static image it took 1.2 seconds to render 10 million points and the thing about this is oh this looks really cool and pretty and looks on a weird like somebody scribbled all over Europe but the key thing here is that data shader does not lie to you it has mapped the brightest value here to the greatest density of values and it's mapped the lightest value to the lowest density if we actually plot this so here we're going to then actually load in boquete the interactive image object and we're going to add a tile source and that tile source gives us these this nice little over plotting on top of european you actually see what this looks like and you see this is I guess this is what Paris may be sound like that maybe that's London no that's London right so then what we actually do here is we just turn this to render all the categoricals and we're going to take the count of categories and I'm going to map it with HSV and so this shows you then all of these flights colored by origin and notice that so it takes a second here for it to kind of round-trip the actual data into the browser with the new improvements coming in boquete in the next version of ok that'll be much much faster because a lot of time is actually in that in the data encoding and decoding but as we zoom here over London notice as we zoom in the data refines that's the data shader engine you actually see my cores running there its refining and it's showing you more more cheers you zoom in you can see the holding patterns right of planes on approach the London area and notice like as i zoom out that doesn't over plot doesn't get turned a giant mess well it's kind of a giant mess but that's just the nature of the traffic but the data shader system is at every level of resolution it's taking into account my viewport as an input and it's doing a full table scheme of the data it's doing a full perceptually accurate mapping of this data set I can go down here and say well actually I don't want to I don't care about the country of origin I just want to know whether the plane was going up or down and so this gives us red vs. blue and what we can zoom in here on London I think this is actually quite pretty is you can actually see that all of those circles are holding patterns indeed right there red there's as the planes coming down the blue they don't really hold on the way up there's kind of go up and so it's a really nice little thing and what's cool about the system is this is ten million points loaded into a notebook a few lines of Python code and I'm interactively looking at down to the individual trace level and this is a live demo knock on a plexiglass and and it works great so yeah I think that's it for that one another data set that people here are probably familiar with and maybe tired of is the New York City Taxi Cab data set this is 12 million we have a subset of it so this is 12 mailing points just passenger count pickup time drop-off pickup locations and that's that's the table and this is very important that you remember this the columns every single row here has a pickup and a drop-off location so so when I go and I render this if I do a sub sampling of it 10,000 points I end up with this kind of map and over plotted on top of the map of New York it's sort of sub sampled right it's not that dense if I go to just 10,000 points however at the full at the full image scale of this data set you can see it's over plotted completely over plotted and at this point what I might do is go and start fiddling with alpha and pixel size things like that but if I use data shader I can just very easily set up in three lines a pipeline that maps from white to dark blue the least to most dense collection of points and I'm plying the passenger counts and I'm doing a linear color transfer function and that takes half a second for this to run and create a static image and there's the image right and what you notice with this image is that oh wait hold on a second what's going on here you'll notice that here around I guess Penn Station and in LaGuardia there's some density but everywhere else is relatively white and the reason is because if you look at the histogram of values and counts there's a massive the function is not very linear and so you have a few it's a long tail a few that are very popular and most the places are actually not very popular at all so that's why so many pixels are white so then we can do is actually say well let's do a logarithmic transfer functions to have a linear one and now you get something that looks a little bit more familiar right but why do we know why do we think logarithms are the right way to to map this why is that the right transfer function now some of the smaller people here in the room might say oh because of gamma and the human visual response system all stuff and you're mostly right there but the actual really great thing to do is to use an approach called histogram equalization where we look at actually the total amount of energy the whole amount of data at any given level and we give it a pre allocate an appropriate amount of visual weight and so with that we get then a slightly different and very lovely map of this beautiful city so then if we want to make it look really pretty and make it interactive it's a few more lines of code but now you have this is this is about you know twelve million points interactively in the browser and you can see most of let me actually hide the header or hide the toolbar you can see this is actually quite lovely but as we zoom in here what will happen is you notice how the data shader system once we've zoomed in it remapped the brightest and the and the brightest points it actually refined the the density of data and what you can see here is that Midtown I think you can see in the projector Midtown is noisier you see how it's grainy or noisier then the rest of the rest of New York right why do you guys anyone know why that is and it is there you go so GPS tall buildings and structures will obscure GPS signals and that's all we get fuzziness now what if you didn't know that about this data set can you think of what statistical function you would write to pull out that intuition about this data set might be very difficult and the reason that I bring that up is because when you look at this you can't not see it when you've is all the data it jumps out at you like why is Midtown all did somebody smudge my monitor no it actually that's that's the data right there's drop-offs in New Jersey which I can believe but there's drop-offs in the middle of Hudson which I don't believe right those are the things that show up and what's really what's really nice is of course with data shader and this isn't a little code example to show how we can overlay this on top of a map of New York and so let's zoom in here and see where is this extremely dense spot down here that we saw you know these so obviously Grand Central has a lot of traffic right here yeah so this is nice so we can actually see oh also one of the thing to point out about this before I go any further is if i zoom in to some of these really bright spots I'll do it really quick so you can see you can see how many points are within every single original pixel at the zoomed out level my data shader has access to the full data set all the time and so as i zoom in even more what you'll actually see is there's kind of this quantization grid of the points and that's actually because the data got the data set itself is a little bit truncated and you can see that quanta I don't know if you can see it there but it's it's a relatively regular grid these are things that pop out at you when you look at the data set when you can view all the data so one more cool thing that I can do you know I talked earlier about the fact that we can data shaders not just a way to throw all the data at the screen and hope that doesn't lie to you it's actually a live pipeline that lets you write statistics on the data so that's what i'ma do here I'm going to shade the places where there are more drop-offs than pickups in blue and I'm going to shade where there are more pickups than drop-offs in red and I'm gonna put them together and here we go give it a second to build a pipeline what it's actually doing here is it's actually using our number compiler and it's compiling that Python expression into low-level machine code then it's going and running that machine code on the full data for every single pixel and appropriate for us and now we know something about this data that New Yorkers probably know quite well that your cabs are going to pick you up along the avenues and you're more likely it dropped off on the on the cross streets but what's really interesting about this though is if you look a little bit if you look at there's some interesting structure here if you look over here at Javits you can actually see the pattern of traffic you can actually see the cab swing into drop-off and then they pick up at least that's what I'm guessing because I don't see cabs like dropping off and then backing up to pick people up but you can actually then infer traffic flow additionally you can look over here and if you don't know much about Queens if you don't know much about Queens when you look at this picture let me zoom out on this one you might not be able to tell what streets in Queens are the main streets and which streets are less important but if you bootstrap your knowledge through Manhattan and you look at Queens you can actually tell oh here we go this is we know where the popular streets are because that's places where the cabs are gonna pick up and this is again looking at this data if this was not New York but this was some foreign country or something you could bootstrap this kind of knowledge another really interesting thing is if you look here at LaGuardia what you see of this is this pattern where the cabs are dropping off in the middle of a bunch of pickups which seems kind of ugh here they're kind of side by side but here it's right in the middle of it which seems weird anyone guess what's going on here right there's actually a two-tier parking structure there right or two-tier terminal structure so the pickups are happening above I'm sorry the pickups are happening below the drop-offs are happening above and this is all knowledge or bootstrapping just by rendering all the data right any two lines of code and then leveraging a whole bunch of other stuff that you know much of of libraries so that's New York we can also color based on time of time of day hour of day when people get dropped off that's that's that's kind of a high-level view something we can do with data shader there we've recently added the ability to load raster images and render raster images and do data shading on raster so this is looking at Landsat satellite imagery and any of you who know something about satellite imagery you know there's different bands that they image the earth in different wavelengths and so we load all this in using raster IO and we're going to basically just render the blue band and here we have an area around Mobile Alabama if we zoom in oh did a lot there sorry but you can see actually the raster data we're now at the pixel level of the raster data as we zoom out we actually get the full raster data set there you can actually come down here and implement your own data normalization routine as well as the true color combination so if you do all that what you've done what we're going to do now is assemble the red green and blue bands using our own code and we get then the true color image of this area which is very lovely this is of course this is because it's using number because it's all Python code you can customize this however you want to to do some really interesting things with the raster data here for instance we're going to actually go and look at raster the raster data that's not intensity but actually height elevation and so if you do a basic elevation plot this is the elevation plot around Austin and it's going from dark red to white to dark blue so red is the darker area is the lower area blue as the higher areas and so what we can do is then we can actually go and write our own little function for computing slope very simple function and we get this this is again all just computer dynamically on the fly from that raw raster elevation data and it is you can customize this however you want to you can actually go and say well I want to look at aspect I don't want to just look at slope if you look at aspect you get that and this is all all this is just a few lines of Python code for each of these things so I'm actually this is actually dynamically the data shader pipeline is dynamically computing aspect ratio based on the functions I wrote off of this raster data off of this elevation data that's actually Lake Travis there if you combine slope and aspect together there's actually very how many people here do GIS analysis this is a very powerful way of getting a much better intuitive feel of the texture of the of the terrain you can actually do this thing called hill shading which looks at all these things together and you get this really lovely topographic map and this is not again I didn't have to rely on some big vendor on some GIS experts to go and craft something this is all just math code that's right there in the Python for me and you can actually go and do something even better if you want to actually look at for instance one of the ways you can look at vegetation in satellite imagery is by looking at the ratio between the niar near-infrared and the infrared channels so our near red and red channels and so if you implement the nearer if you implement this it's called NDVI normalized difference vegetation index or something like that you can actually go and look at this is an area around New Mexico in green are the areas that are that are vegetation purple is non vegetation and I did that by writing actually I have the code here I think this is the code for the normalized vegetation difference index it is a few lines of Python code of course with a number JIT operator on top of it this is the kind of thing that we can do with data shader now and the one last thing I'll show is looking at census data and what we can do with that so this census data set I think is about 300 million points all these by the way are running locally on my laptop I only have this tethered to my phone to load up like the map tiles things like that so all of them are simultaneously load on my laptop I apparently have CrashPlan running over my phone in the background one second here pause all backups but this is all running just on a standard MacBook Pro so this is 300 million points of the census data set and once we've loaded it up we can actually go and I won't go walk through all the code but this gives you a view of the population of the United States colored by race and data shaded 300 million points interactively in your Jupiter notebook we can come here and look at New York and see if it matches with y'all's intuition about the demographics of your fair city see here how many people here are local the New York actually out of curiosity actually a lot of people are not local to New York great so it's cranking away this is also using an addition using numba it's also using gask so it's actually going and paralyzing the work that also means that it can use desc distributed and run remotely on a cluster my sure why this is taking a while to transfer the data over to the client side run tasks run there it is okay again a lot of this delay is because of the round-tripping of the raster data from the server to the browser that will get much better in the next version of okay as i zoom in here you can actually see what's really nice about this again the data shader concept is render all the data just render all the data try to do a perceptually accurately and and when you do that what you get is you get a view into the texture and the gradations in your data set and so you can actually see here blue oh so let me give you a color key blue is a for white caucasian I guess green is African American red is Asian and the yellowish color is Hispanic and so one of my favorite parts in this map to highlight is up here on the upper east side where you have a hard line where the white people don't go past and you have Spanish Harlem as differentiated from Harlem you have Rikers Island here you've got like it's Queens in Flushing over here you see all this stuff in the data it's just there just like with the whole avenues versus cross streets it's just there in the data we have a actually a better census plot that shows you can actually start asking questions like show me all the place in the United States where there is where there are more black people than white people and you can change the equation you can say well you know 20% more black people than white people and so forth so what's nice about this also is that this is data shader not just tied in to bouquet but actually tied in to another visualization tool we build a continue called hollow views and so it's actually being scripted through the hall of use interface and so all these like all this is live these notebooks are published they're out there in anaconda cloud you can just play with them but what we can do is we can load in the district's the congressional districts of the United States and render those and what that gives us is this really interesting plot that doesn't actually show the races so why is that let me try this again I'm going to run this is actually really really interesting and it's actually somewhat relevant to the to the recent election I suppose took three seconds render it takes some time to round tripper here - okay here we go oh this is tied to the other plot okay so what we're seeing here is the congressional districts overlaid on top of the on top of the races do you notice this cut right here and then interesting so oh but it gets better because that's just New York let's go and look at Baltimore people are laughing this is Baltimore and what's really interesting about this give it a second to render this is legit gerrymander like there's nothing there's no like Voronoi there's no like distance minimization there is no function that this shape optimizes for except for particular vote voting outcomes right all of us can look at that and see that and as soon as the compute comes back we can see actually it how hard it cuts across the racial lines one second here there it is so what's really amazing about this map is again remember blue is white people green is black people there's some Asians over here but they don't matter any time no well I of course Asians matter all Asians matter but actually if I click check this out though if I click anywhere that's green pretty much look at what construct congressional district it's in if I click on a blue guess what what it is right it's just it's just amazing so you can basically click on color and get and you can actually see if you if you start zooming in on this little boundary right here you can actually see how hard it cleaves right there it's absolutely amazing so this kind of interactive visualization is possible with just a few lines of Python code completely based on open source tools running completely locally here in your notebook and and you can actually even go one step further and build interactive tools like this where you have dropdowns this is nice thing about hall of use you have dropdowns here that actually select for different races and it shows them across the United States you zoom in on place that so with that I think I'm out of time and if I'll be over at the continuum table if any of you have questions about this we're going to talk about using this in your projects and things like that would love to hear from you thank you very much
Info
Channel: Plotly
Views: 9,173
Rating: undefined out of 5
Keywords: plotly, plot.ly, graphing, data, analytics, visualizing
Id: fB3cUrwxMVY
Channel Id: undefined
Length: 29min 16sec (1756 seconds)
Published: Thu Dec 01 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.