Interactive Data Visualization with Altair

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi and welcome to my workshop interactive data visualization with altair this is for hack the north 2021 and my name is nicholas vadivelu in the bottom right corner here you can see the link to the notebook we'll be using later don't worry if you don't get it now it'll come back to start let me introduce myself i'm nick i've done internships in data science ml research engineering ml research and ml engineering so kind of all things data science and ml you can follow me on twitter at this handle or send me an email or reach out to me on slack and i will be on the hack to north slack all weekend i'm here on behalf of the uwaterloo data science club so please check out all of our stuff here we have youtube we have a youtube channel with tons of workshops and we do host tons of workshops every term and so we'd be happy to have you for an overview what we're going to cover today we'll start out by looking at jupiter collab which is the environment we're going to be working in we'll be looking at a pokemon data set which is what we'll be visualizing we'll start out with some static plots just some normal plots that you can visualize with altair and then we'll move into dynamic interactive plots with delta and then finally we'll take a look at how we can actually deploy these plots into a web page also using altair so if you actually just follow this link here you can check out the finished product of this workshop so if this will load cool so this is the finished notebook that we're going to be working towards which has all the code that i'll be talking about today as well as some brief explanations though this video will contain more explanations before we get started we can kind of take a look at what we're working towards let me zoom in a little bit here so this is kind of the quote-unquote final product that we're working towards which is this interactive graph and so in the you don't really have to worry about what's in the x-axis and y-axis but as a brief summary this x-axis contains the base stat total of pokemon this y-axis contains the number of moves they can learn and on this bottom plot here we have the average base stat total by the t or the pokemons in so just various facets and categories of this data and as you can see we can create this interactive plot where we can highlight the points that are selected below and we can have the sliding window which controls which points are in the averages displayed below so this is a pretty simple example but you can imagine create really complex dashboards where elements on different plots interact with each other you can show interesting views of your data and so we'll be working towards this and hopefully this will serve as inspiration for you to make your own really cool interactive plots and just as a sneak preview this is only a few lines of code this framework makes it extremely simple to create such complex plots but we won't be working up this notebook we will be working from scratch and so we will visit a website so if you go to colab.research.google this is the site i'm on and you can create a new notebook here and so before i get started this is what a jupyter notebook is so let's just name this altair workshop recording this is a jupiter notebook this is a python environment that allows you to run both code have markdown have visualizations and plots etc all in one environment and what's really cool is all the variables and everything persists as you run different cells so for example up here i'm printing hello and um i'm setting five once this initializes cool and this x is still persistent for the next cell this is really useful you can set up code and you can kind of easily debug and look at things step by step as you'll see throughout this workshop jupyter notebooks is the offline version of this which you can just run locally on your computer colab is a service run by google which is essentially a jupyter notebook in the cloud and this has some really cool properties so for example you can actually get a gpu or a tpu back-end for this so if you're doing any kind of machine learning you need a gpu and you don't have one locally this is a great place you can go to to get some accelerators for free which is pretty awesome there are some tips i want to show you about collab before we get going one of them is if you type in a question mark and then a function name for example you'll get the docstring and exactly what arguments it takes and and various helpful information and if you actually press two question marks this will give you the actual oh no print is a built-in function so maybe that's not the best one let's import a different library and then this will actually give you the code that is in this library so nothing is a black box you can look into it understand how the code works and before you continue on with your code there are tons of other tips that you can find in the main notebook these are just a few that will help us get started with this workshop cool let's start out by installing altair so let me make a new section getting started so if you're not familiar with python the conventional way to install packages is something called pip and so pip is something you can use on the command line collab is actually super convenient it allows you to use pip write within these cells if you prepend your command with an exclamation point this will actually run something in the terminal so we can just install altair here and we'll also install one more package called vega datasets and this will give us some cool data sets to work with later as you can see colab actually comes with these libraries included which is why it's not installing anything it's saying all the requirements are satisfied but i thought it'd be useful for you to know how to do this cool and now we can get started by importing the library a convention it's conventional to import this library with acronym alt and we're also going to import one more library called pandas so for those of you who are not familiar pandas is a data frame data manipulation package which allows you to basically work with tabular data and by tabular data i mean rows and columns so kind of like an excel spreadsheet this workshop will not focus on how to use pandas and our use of it will be quite minimal but if you are interested in learning more about pandas the syllabus as well as the full notebook have links to useful resources including a workshop i ran last year hack the north about how you can get started with pandas but we the one thing that we will do is just get started with this data set here so i have this url here you can get it from again the link i posted earlier in the workshop and actually let me put that link right here so anyone can access it at any time cool that's the link for this and so let's check out this link here this is the data set we'll be working with so as you can see it's just a comma separated values page and let's actually load it in so we can take a look at what it looks like the way you load in data frames in pandas is using the read csv function so our data frame is going to equal pandas.read csv and then we have our csv url and actually pull it off straight from the web i'm going to pass in this other argument so it reads correctly you don't really have to worry about what that does for this workshop and cool here is our workshop as you can see here is our data set as you can see each row contains a different pokemon and each column contains different attributes about each pokemon so let's actually just look at one row so it's a little bit easier to digest sample takes a random row from your data frame cool we can look at you know things like what types of pokemon are what abilities they have their various statistics so they're attack defense etc based at total which is when we'll be looking in this workshop we have things like what moves they can learn and just various properties that you may be interested in you can look at into this a little bit later i am going to paste the magical piece of code into the notebook i promise i will minimize this pasting but this will just make it a little bit easier for us to visualize what's going on on a high level what i'm doing here is consolidating some of the different categories so that our visualization is a little bit cleaner cool and now we can get started with altair let me make a little header visualization let's make our first simple plot here i'm going to type in i'm going to paste in more magical code that will build up step by step so here's what we're going to be working with in the beginning of the workshop we're going to be trying to generate this static plot before we move on to the dynamic stuff and so as you can see on the x-axis we have the base stat total which i mentioned is an indicator of a pokemon's power we have here on the y-axis the number of moves that pokemon can learn let me and we also have different colors for the type of tier they're in so you can easily see what categories each of them belong to let me remove this interactive tag here so it's just a static plot cool and as you can see as we hover over things we can see a tool tip which tells us the name of the pokemon which is pretty cool and so this is what we're actually going to build up towards cool first thing is the first primitive in altair is something called a chart and i can access this using alt chart and then pass in our data frame here and we get an error so why do we get an error what is this plot object altair is a library that in python that interacts with a javascript front-end and essentially how this works is it produces a specification for a plot which is interpreted by the javascript in order to be displayed on your screen now in order to produce this specification which is called vega light altair needs to both produce it and verify the schema and we can see here that this schema is not um is not correct because it's missing some elements and in particular we're missing something called a mark so let's look at what that is mark point and ta-dah we've made our first altair plot so now we have a chart here and essentially we're saying for every data point in this in this data set we want to use a point to represent it and so just to remind you each row in the data set is a pokemon and so each pokemon gets its own point and there are several things we can do for example we can adjust the size of this point we can change this from a point to something else like a tick but this all is overall kind of boring because all these points all 500 of them in the data set are kind of overlaid into one place and so we need some more information to tell altair how to actually display this data for on our screens cool so the next thing we can do is mark point so we're telling altair okay we want to represent each pokemon with a point and then we need to tell altair how to encode this on our screen how do we want to display this information and the way we can do this is using encode and we can tell it okay i want to actually show each point and along the x-axis where the x-axis represents the base that total of that pokemon cool so now we have this x-axis here and these points are laid out so that each pokemon is in a position kind of with respect to what it's based at total is cool but we still don't have the plot that we're interested in we wanted a scatter plot not just this 1d timeline type thing and so what we can do is now we also want to specify how to encode our y-axis so we can do y equals number of moves like we did in our original plot and there we go now we have a scatter plot where the x-axis encodes this base stat total whereas the y-axis encodes the number of moves so this is pretty cool and it makes it pretty simple right it's a kind of an intuitive grammar where you specify what kind of point you want for each row in the data set and you specify how you want to lay these points out on the screen and like before we can do things like for example we can set the color based on the tier and then once again altair does a good job of choosing a good default color map and then splitting each each sets of points based on the tierratin so this is our first pretty basic and simple plot kind of in the coming few plots we're going to see that just using this very basic syntax we can create a variety of different plots including histograms and well histogram's the next one so it's the only one that's coming to mind right now but we'll see that you know using this simple syntax actually creating a histogram is quite similar to creating a scatter plot and if you come from any other plotting library that might sound really strange to you because you know histogram scatter plot bar plot they're all different function calls but on altair this is not the case altera gives us the syntax where we just specify how each point should be represented and how the data should be encoded and this will lead to different plots don't worry if that didn't make any sense to you we will see how that looks cool so let's remove this color just for a little bit of simplicity and make another plot here so now here we have the number of moves encoded on the y-axis but let's say instead of the number of moves we wanted to encode the count that is we want to encode how many pokemon have each base stat total here and just to remind you we're working towards a histogram here which is why we're interested in counts and now this is pretty interesting we can see that the x-axis still has the base stat total but this y-axis counts the number of recurrences of pokemon in each base that total and this is interesting there's a ton of pokemon with a base at total of 600 and and so we can see that here but it's kind of difficult to see what's going on in other places in the plot obviously because normally with a histogram you would have like bend buckets and so how does that work well an altair that is pretty simple just pasting this code here we can actually bin along the x-axis so basically whenever you want to specify more options to some argument instead of passing in a string you would pass in this specific object so here for this would be alt-y for example let me start with this here so you can see that this is exactly the same as what we had before but we can pass an additional argument so for example we can pass in bin equals true cool so now we've actually bend our x-axis and so we have a more reasonable distribution of what our base at totals looks like so we can see that there are very few pokemon with a base stat total of 100 a few more sorry between 100 and 200 there's a ton between 400 and 500 there's also many between 500 600 and it kind of decreases after that but i did skip over something a little bit i told you how to bend the x-axis but this y-axis this count it seems kind of strange so actually if we remove the x-axis here altogether we can see that we get the total number of elements in this data set which is 550. since we didn't split our data along the x-axis in any way we can see that the y-axis can it just contains the count of all the data now we can do things like if we add color equals tier here for example we can see that we get the count of each pokemon in each tier but instead of kind of faceting on the x-axis we get it all on the y-axis and in colors and obviously we can split it up by tier as well and as you can see it creates a different plot here where in the x-axis we have the tier and the y-axis we have the count of of the number of pokemon in each tier that's just kind of an aside but this tier this count function is not magical well it is a little bit magical because it aggregates the data but essentially it just counts things in each category that you give it so let's go back to this plot we were working with here we have a we have a plot that we've created here and obviously this is not such a conventional histogram so we have these points that represent the counts but normally we would use a bar and so let's just count bar and then boom now we have a histogram with almost the very same syntax we use for a scatter plot so again just to kind of refresh how did the scatter plot look well we just used mark point and instead of binning we didn't we just said binning equals to false because that doesn't make sense for our scatter plot and instead this y-axis being the count we have our y-axis being the num moves so with this very consistent syntax we're able to make a variety of different plots just by specifying how these points should be encoded on our screen cool so we have this histogram now let's keep moving forward and getting some more information using this nice expressive api so one thing we may be interested in is looking at the distributions of these base stat totals based on the tier that the pokemon is in so let's kind of do what we were doing before where we assign a color to the tier oh let me get a comma here nice so this is okay we were able to split each of these bars up based on the composition of how many um pokemon were in each of these tiers but it's not particularly nice to look at yet and so one thing we can do is instead let's copy this over here instead of encoding the y to be the count and the color to be the tier let's try flipping that around so now we have an interesting plot here where we have the tiers on this y-axis and we have the base stat total once again on the x-axis in the bin fashion but now the color represents how many pokemon are each in each of these categories so in particular we can see that you know between for a base that total of between 400 and 500 in this pu tier there's about 150 pokemon in that tier which is pretty cool again this is not a perfect plot but it's kind of an evolution of what we had before we may think well we want to see more dense based at total bins and we can actually do that via alt.bin so in general whenever you want to pass more options to a specific argument you have dot that argument and then you can specify various options and this is really convenient for your auto completion and your ide because you can see all the different options here as you can see or we can use a technique we used before where we have a question mark all dot bin um and we can see here all the arguments that it takes cool so we can actually set the max spins equal to 30. we have a slightly more dense slightly squarer plot with a little bit more information so once again i do want to emphasize that we have exactly the same syntax here but just by changing the encodings of the data we have we're producing completely different plots but again this is not a perfect pile let's go back to [Music] um the y containing the count and the color containing the the tier and once again altera provides a very convenient api for actually faceting these plots out so if i actually specify column equals here now we have a bunch of plots where each of these plots represents one of the tiers that we were interested in and it shows the distribution all in a different color as well with a nice little legend on the side so again a very simple syntax again you can use something like row and it would fasten along the row but just to emphasize like these options just tell altair how to encode the data and it produces a schema there's no special function calls for different kinds of different kinds of plots now just chugging along we kind of see that we've made a variety of different kind of plots here let's say we want to summarize the information so we are interested in the average base stat total for each of these categories so here we're seeing the distribution of the base at totals for each of these categories now we're more interested in the average so we can actually continue off this plot here and now we don't want to bin but instead we're interested in a different aggregation it's called average how do we encode that well we're interested in the average base at total we'll keep y on the tier here and let's actually use the color to encode the tier as well so now this plot is kind of similar to what you've seen before where we tell it exactly how to encode each of these variables on each of these axes the one new thing here is the average which is kind of weird this is like another magical function just like just like count so let's actually take a closer look at how this works and how this is processed so count and average are both examples of data aggregations and let me copy an image and paste it here oh are you going to be able to see this oh no yes nice let him let's take a look at this in in light mode so we can see this little clearly cool so here's what is happening to the data here we specify that we wanted the average based at total based on the tiers and so we want an average within each of these tiers and so internally in altair this is what happens so pretend that instead of tier we had x this x variable here and instead of base that total we had a y here essentially what's happening is we treat this x as a key and we split up our data based on this x value so for example we group all the a's we group all the b's and we group all the c's down here and then we apply some aggregate transformation to y so here for example it is an average it could also be a count it could be a sum there are many examples and so we apply this to all the data points and so for example the average of two and four here is 3.0 and then we recombine this data into one data frame again where we have one entry per key and then we have the value here so just like here instead of having 500 pokemon we have these six entries where we have the average for each of these things if you're just looking at raw appendis code the way that would look like is so we have our data frame here let me just pull up a sample so you can take a look at how that looks cool we have a data frame here where each row is a pokemon once again we can group them by the tier and so the tiers right here as you can see we group by the tier we the number we want is the base stat total and then we take a mean within each of these groups and the information in this table is exactly the same as the information in this bar plot don't worry if you didn't catch these pandas commands that's not the focus of this workshop it's just to kind of show you what the equivalent is in a different kind of syntax this all this information is in kind of the links in the syllabus but that's kind of what's happening under the hood when you use these magical aggregation functions i'm going to switch back to dark mode because i cannot stand light mode let's see here dark nice cool cool cool so you can read more about these specific data aggregations in the altered documentation also linked in the syllabus to this workshop but average and count i would say are the two most common ones let me zoom zoom in a couple of notches so we can see what's going on cool so what have we covered so far we've taken a look at how we can you know turn a simple plot into a variety of different plots we went from a scatter plot to this histogram to this little heat map representing the base stat totals to this to this faceted bar plot histogram sorry showing these different amounts of data and finally to this bar plot showing the averages within each of these categories i think that's enough to kind of cover the basics of plotting in altair and obviously you can kind of look into more of the examples on altar's website but we kind of use the basic building blocks of defining what kind of point we're going to use or what kind of mark we're using and defining how we're going to encode the points in these marks before we move on there is one more thing i want to discuss and i will copy and paste i promise i'd minimize this but i have to and that is the idea of data types so so far in our workshop we haven't really worried about this because altera was pretty smart about determining things so for example here altera was pretty good at determining that okay count is a kind of a continuous value and so you should use a gradient for this scale whereas the tier is more of a discrete value so we should use really distinct colors to indicate how they look and so altair is able to do this automatically but we can actually look at the categories and also specify this manually ourselves so there's five different categories according to altair and the five categories are quantitative ordinal nominal temporal and geojson so quantitative is kind of what you're used to those continuous values oops those continuous valued quantities ordinal is a discrete quantity so for example the tiers that are ordered so actually we're not treating them as ordered so actually maybe ignore that but maybe an ordered example could be if there are five stages and your stage one two three four five that would be like an ordered discrete category another option is nominal which are unordered categories so for example if you have blue green and red those are like unordered maybe colors are a bad example but if you have like um well blue green red is a pretty good example where you have like unordered categories there and to plural is kind of what you expect time or date and geojson kind of represents um geographical locations and so you can create like heat maps for example on actual maps so if you i gave a really brief description you can kind of visit this website here so this is the altair documentation the first time i'm pulling it up and the different kinds of encodings are all listed here and there's also a ton more information on the site it's a really great resource to learn more about altair so it's all linked in the syllabus but anything i talk about as workshop is definitely going to be in here so why do i bring this up we can actually specify these things explicitly so let's make another chart rdf we can we want to make a spider plot and we want to encode um x equals base that total y equals not moves and then we can say color equals tier chart cool and we can actually specify what the data type should be here and so for example we know that's quantitative we know that's quantitative and we know tier is ordinal for example and so um oh not we're not sorry nominal actually technically tiers are ordered but we're going to pretend they're not ordered for the purpose of this workshop and so we have the same color mapping here but if we said for example that tears are ordinal then as you can see alter chooses a different color map because it makes more sense to represent ordered values using a gradient as opposed to as opposed to distinct colors and the reason i mentioned this is because pandas data frames kind of give us the advantage that we can infer some of these types from the data frame but we actually pulled this from a csv and we can actually plot data straight from a csv as well oh no i don't want to leave so without specifying these specifiers we can just look at csv url which is straight from this url and if we try to plot it here what happens is we see that the tier like each of these encodings each of these sorry each of these features don't have a specific encoding and so it's not able to infer what kind of color map to use for example or what kind of points to use and so if we actually have to specify these explicitly in order for altair to figure out okay i can treat these as a certain type so instead of ordinarily nominal and we can see it chooses the appropriate color map based on what you specify and so as you can see um in the actual data set there's a bunch of tears this is actually what i cleaned up earlier so we can have a little bit of a cleaner plot to look at so that concludes the static visualization part of this workshop where we basically constructed a bunch of different plots from scratch using altera and it's really just a few lines of code to create these really nice plots in the actual workout we would take a break but let's just move on to interaction so let's start with our first point of interaction here using a really basic plot so we can make alt.chart df again once again we want to go back to our standard scatter plot here we can encode in the x we can have the base stat total and on the y we can have the num moves and then we can color by the tier cool we have the plot and at our first interactive element we can just literally have a interactive here and boom we have a super basic way to zoom into plots zoom out of plots scroll it around it's really great but this is pretty simple and we saw some really complex interaction before so let's see what we can build out of this the first thing we can do is build up an interval selection and so that would be like selecting an area on this plot and having some behavior happen based on that selection and the way you do that is make an interval object which is an alt dot selection interval cool so far this plot looks the same but what we can do is we can add a property to this chart where the selection is equal to this interval here nice we can create these little rectangles and sorry for the artifacts because of the lag but we can select various areas of this chart but this is not particularly interesting yet we haven't had any actions happen because of this interaction so let's actually do that we can actually instead of having the color just be the tier we can have the color be gray when it's outside of the selection so how would you do that we do this via something called an alt-talk condition where our predicate is this interval and for the points where this is true we have the color of the tier and for the points where this is false we use light gray so i will explain those elements after we see this plot here cool and so now we can select areas of this plot and everything else is gray while what's inside the box is colored so what's going on here this interval object basically tells altair and javascript what points are inside the interval versus what points are outside the interval and so for the points that are inside this interval we want to color them via this tier for the points outside of the interval we want to color them via light gray and so as you can see it's light gray outside the reason we have to use an alt dot value instead of just typing in this string is because if we just type in a string all we'll assume that is a column in our data set which light gray is not we can also do things to for example constrain what's going on so for example if we don't want this freeform rectangle we just wanted to slide across one axis we can do something like encoding equals y and so now our sliding window is only we can only slide on the y-axis we can't change the size on the x-axis it covers the entire x-axis so i mentioned that this condition acts on the data points not on the plot itself so this condition is independent of the plot it just tells us which data points are inside and which data points are outside and so this means that we can actually tie together multiple plots and have behavior interacting between these plots so let me actually show you how that's done so once again same code we have this plot here we can select and let's actually also make a bar plot let me comment this out so i can show you what that bar plot looks like so this is going to be the same bar plot we were using before where we have an alt dot chart of this data frame we use the mark bar so we want each data point to be represented by a bar and we can basically encode in the x axis the average based at total and the y-axis we can encode the tier and once again just to make the colors more consistent we can also encode the color on the tier cool so we have these two plots here and we can actually assign this to a variable and once again we can for example display this in a different cell so if i paste this variable it'll display here but let's bring back this plot assign this to a variable and we can actually display them side by side and the syntax for that is scatter or bar so this would be displaying them side by side oh that doesn't look so good so we can maybe display them top to bottom so scatter's on top the bar's on the bottom and this ampersand is what denotes that if you're not so comfortable with using this operand to do so you can also use alt.v concat and this will vertically concat the two plots as well cool so we still have this interactive component in our top plot but it's not doing anything to the bottom plot which is what we wanted and this is where the magic of altair really comes in in my opinion and so i told you that this interval object is kind of a predicate which tells us is a point inside this interval or not and so we can actually filter the points that are displayed on a plot based on this predicate and so a transform filter will basically filter points so either include them or exclude them in the set that's being visualized and the predicate is its interval so it's asking is it within this interval or not and so now once again we can select on the y-axis here but now on the bottom plot we can see that the points that are included in this average are changed by this slider and we don't have to just visualize the average here obviously we can visualize things like instead of the tier we can so instead of on the x axis we can actually just do count instead of the average based out total right so once again now we're counting how many of these points are inside of this selection interval and how and so this evolves as you change the slider so i don't know about you but this is really amazing to me like just a couple of lines of code and we have such a complex interaction and this is just by a very intuitive building of ideas and so again just to reiterate these ideas we built the scatter plot up based on these primitives of what kind of mark and what kind of encoding we want we added a property which is a selection so we're able to select intervals inside of this plot we have a condition here which says for points that are inside the interval we want this color and for them outside we want them to be light gray and we were able to tie this to a whole other plot by saying okay for this plot filter out points that are false for this interval or at least or i guess in other words include points that are true for this predicate and the predicate is is it inside of this selection that's pretty cool so there are other kinds of selections you can use let's kind of continue to build off this code here and so now um actually let's build from scratch it might be easier to test out there's a different kind of selection called a multi selection so this allows you to select multiple points instead of selecting an interval and the fields uh i won't specify this yet so we can select multiple points and so once again we can create our scatter which is a chart with a bunch of points um and basically we want to encode the base stat total on x i could have probably copy and paste this but oh well and then on why we want to code the number of moves and then again we want to have our bar chart let me copy and paste it to save you from listening to my typing so we have this barge shot here and then once again we can vertically concatenate these two and show them together so cool we have these two plots up and down and we have no interaction yet oh i also want the color to be in the tier as well cool so now um we can let's let's take a look at how this selection works so once again let's alter the color based on what's going on in the selection so we can have the condition where if it's inside this if it's currently selected we just deter and if it's not selected we want them to be light gray gray i have my closing brace here nice and let's select this selection equals multi for example there are other properties you can add like for example you can add a title here some pst yes but anyway selection is one of them but just want to show you so now you can select individual points and by using shift i can select multiple points this is maybe not so fun so instead let's actually add the selection onto our bar plot no nope this is not what i want cool and so now we can select oh so again oh this is kind of strange yeah so we're trying to control the color in this plot based on what we select on this plot but actually we probably want to have the same color show up down here so we can see what's going on and i'm going to specify the fields should be tier because otherwise it's not clear what we're selecting down here are we selecting the average base that total is that are we selecting some point on this bar graph are we selecting the tier uh it's not clear so we need to be specific that okay the fields that we're selecting are actually the tier here on this y-axis cool and so just based on the points that we have we can see that the color above is controlled by the plot down below which i think is pretty cool and so again pressing shift you can see multiple colors here and once again to tie this all together we can kind of have the two-way interaction we had in the initial plot we were looking at where we also have an interval here where the interval is equal to an alt selection interval and the encodings equals y and we can have a property here again once again making the selection equal to the interval and here we can have another transform filter again these functions i'll compose in arbitrary orders the only thing is going to make sure i believe you have to make sure the mark and the encoding come first and anything else can come afterward in some arbitrary order so here we can have a transform filter and interval here no no we have a little bug where is it has no attribute property properties plural cool so now we can kind of control what's on the bottom with what's on the top on the top we can control what's shown up there oh maybe we want to show only what's in between these two lines and so just like boolean logic we can say that we want to be in the multi selection and in the interval selection here and so now we can kind of select these points and only the points inside the interval and selected below show up here as you can see by default nothing is selected on top um you can actually change that to empty equals all and so by default all of them would be selected and i think that should actually fix our problem no it didn't i will not debug this now but again you can specify defaults via this alt dot selection interval here cool and so i only briefly went over these various selection things but again if you look through the workshop there are tons of different things you can do with these selections this is like the interaction page of the altair documentation that i would encourage you to work through in your own time cool so we've kind of gone through the first two portions of this workshop which were kind of creating the static plots and making these interactive visualizations from those static plots and it's overall pretty simple we basically use some very basic building blocks to create some very complex interactions and some very nice plots in my opinion but these plots are not helpful just sitting in your own jupiter notebook where no one can look at them so let's take a look at how we can actually deploy these plots deployment lovely lovely so for the simplicity for ease of demonstration i'm going to use a smaller data frame than the current 500 samples we have and i will explain why i'm doing this later so i'm taking a sample of two different pokemon from this data set and as you can see we chose two random pokemon vaporeon and solar sol rock and so uh just to confirm the length of this data frame is only two because we only have two things in there and let's let's take a look at a simple version of the plot we're using above and let me actually just copy this here paste this down here i'm going to remove the multi selection and i'm just going to have this normal interval selection just for the purposes of demonstration so once again we have this scatter here kind of does what we expect we can even add the interactive property i don't know if these two will actually work together they should let's see here oh yeah i can zoom in and i can select i know the selection is kind of weird with dragging but oh well i just want to try that and let's actually print out what this so i mentioned before that this altair plot is producing a specification called vegalite and javascript is interpreting this specification to actually produce this plot so let's take a look at how that looks so here is the actual json that represents the schema for this plot here so there's a lot of things to scroll let me why am i scrolling scrolling up or down i can't tell oh yeah here's why so that's why i wanted the small data frame but i didn't use it small df cool here is a much easier to digest plot but it tells us okay here's the schema which is this vega light schema i was talking about it gives us the height and width it gives us the name of the data and it actually encodes the entire data set into json and so it has every single point here so that's in this case only two points and then we have all the information we gave it before which is the encoding the color we have a condition here we have our x encoding we have the type of mark et cetera et cetera so this schema and this encoding is what makes it so easy to deploy this visualization so um basically this vegalite specification which is the json json i showed you is it's becoming a standard on the web where even websites like wikipedia you can upload this vigor light specific specification and you can have interactive plots on wikipedia just rendered right there in javascript so it's really great to give you a high level overview of how that works i won't talk about in too much detail but the specification is basically um it's this json representation called vega light which is then converted to vega which is a more complex specification which is then converted to d3.js which is the javascript framework and that's what renders it onto the screen so this fact is what actually makes it super easy to um deploy onto a web page so let's actually do plot.2 html oh no scatter there we go that's the plot i wanted and so now we basically have some html code which we can embed on a web page so let's actually oh can i copy this directly please let me copy this let's hope that worked let's open a little text editor here and paste that all in let's save this as index.html let me open my file browser here and ta-da well we only have two data points but we have the exact plot that we are working with let me zoom in a little so again i only included two data points just for the speed of this demo but we have this exact plot working independently in this html file which i think is pretty cool so a couple of caveats before we get going i only use two data points and this data set is relatively small it's only 550 points since altair encodes the entire data set into this json specification it doesn't work so well with very large data sets so by very large i mean probably over 5 000 points in the syllabus in the notebook i give you a link explaining kind of how you can get around these limitations but in general altair is kind of very good for these smaller visualizations rather than huge huge data sets so that kind of concludes the content we were interested in working on in this workshop um to kind of summarize what we did we essentially started off with this um i will not scroll this will take too long to scroll essentially we started off with basically making simple static plots including scatter plots turning them into histograms bar plots we added interactive elements to them where we were able to get two plots to interact via various mouse clicks etc and finally we figured out how we can deploy these by basically using this to html method exporting as html web page and viewing it on your own so in terms of next steps i will we installed this vega datasets package but we didn't actually use it and so what we can do here is load up various data sets for example i have a cars data set and these are great data sets to explore and visualize with delta are actually designed directly for this specification and to be used and so i think this is these are some great data sets to check out and visualize you can find a full list of data sets at this web page here um this big data set has a ton of them here again all these things are on the actual the workshop notebook and so i won't i won't give you too much time to look at them right now there's also a ton of examples in the altair example gallery and i would encourage you to check these out kind of make your own plots tweak them i think this should give you some really nice inspiration on how to how to proceed with your projects there's also this alter notebooks page which has a bunch of notebooks and tutorials on kind of how to use altair and in depth understanding how each of these components worked there's tons of things we haven't talked about in this very brief workshop and this workshop was kind of heavily inspired by this talk by jake vanderplast who's the creator of this library and he did like a three hour long workshop kind of explaining and going through this process kind of doing the things that i did but in much more depth and so if you're really interested in this i would check out that workshop and we can also briefly talk about some alternatives so i mentioned that altair is not the best at really working with really large data sets if your project involves really large data sets i would check out things like seabourn and matpotlib which are kind of really battle tested libraries that have stood the test of time that are really great for and robust for making high quality visualizations and there's also things like plotly which can also do interactive plots and maybe handle data set sizes that are slightly bigger than what i'll tear may be able to handle and with that thank you for attending
Info
Channel: Hack the North
Views: 1,889
Rating: 4.8730159 out of 5
Keywords:
Id: x-iU2UwgVf0
Channel Id: undefined
Length: 46min 37sec (2797 seconds)
Published: Mon Feb 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.