Stephen Elston - Data Visualization and Exploration with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so hopefully everybody's got the notebook in front of them and has Seaborn either was installed or or we're able to install it so obviously I'm going to talk about visualization today using Python but a lot of what I'm going to focus on is just general data exploration and what we call exploratory data analysis and a little more about me the data set is in that git repo it's an old one as you can probably tell when you look at the auto the prices of cars being a few thousand dollars on average or something like that but it's from 1985 but that doesn't that isn't that important what we're really interested in here is how do we take you know a reasonably complex data set that has a lot of different columns of different types and visually understand it and also tools that you can use to then present observations you might have from from visual analysis to your colleagues and bosses and customers whatever so so this whole idea of visualization exploratory data analysis in the history of analytics and statistics and what is relatively new John Tukey who I was privileged to know as a graduate student at Princeton published this book in 1977 and this was sort of the culmination of research he had done for about almost 20 years before that and before that graph you know statisticians and other types analysts there were no data scientists obviously in 1977 that's a fairly new term you know would put a graph or two into something into a report or into a paper but it was mostly about tables and it was very theoretical and Tookie stepped back and said wait there's a lot you can learn just by visually examining your data really trying to understand the relationships in your data and you know using a combination of maybe simple summary statistics and in graphical methods and that was a big revelation believe it or not I mean nowadays I think it's it seems so obvious but maybe it wasn't in 1977 another influence on me at least but also on the whole field was this work by built by Bill Cleveland who was then director of the stats math group at ATT Bell Labs someone else I had the privilege to work with when I was at a company called statistical sciences where we commercialized Bell Labs s which you know eventually became open-source R and so bill very systematically went about trying to find different you know he basically tested a lot of different ideas about graphics and how you can subdivide and present data and how people perceive it and he tested that a lot of times on not just an average person on the street you know sample but on fellow scientists at Bell Labs so it was amazing how many times they didn't understand the plots that other people were showing them even those very sophisticated aliens I don't do a lot with this here but I just want you guys to be aware of obviously Edward Tufte and Yale professor still for a long time this was his original sort of seminal book in the early 90s also and again he he looked at a different aspect of this than where Cleveland and Tukey came from and he was more interested in the clarity of the presentation and so he has if you look at that book you'll see there's some interesting rules about you know what's the ratio of actual information to the amount of ink you're well in those days ink because it was all on you know mechanical plotters in those days but screen pixels I think nowadays I guess so so our goals are just what I've outlined here to explore complex data using visualization look at different chart types you can use because when you're exploring complex data there's not just one chart type you know you don't just keep doing the same thing over try a lot of different things expect to fail a lot and keep trying different things and then we hit this problem immediately which is a lot of what Cleveland especially was working on is no matter how you represent computer-generated graphics it's on a 2d surface regardless of whether it's printed on paper or nowadays we project it on to a computer screen it's basically flat right it's 2d and you know maybe you could do something fancy with VR headsets or something and get a third dimension but complex data has many more dimensions than that and so a couple of ways around that we'll use what we call plot aesthetics which I'll get on to and also a method which has been reinvented by many people and I'll talk about that conditioning or faceting where you you basically do different group bys of the data it's like a group by operation so there's some resources here I put in the notebook you know in two hours we're just going to like scratch the surface if even so just on how to do stuff there's a lot of useful information like matplotlib which is as I'll show you it's kind of the base package for almost all Python graphics there there are a few exceptions like bouquet which we don't have time to go into that do other types of you know plotting base packages but if you go to this website that I gave you on its matplotlib org /resources /h index there's just a lot of stuff from tutorials and videos and it's it's a books and and what and this is pretty highly curated this isn't just like some random list this is pretty high quality material some of you may have been with us in the last session in this room for the pandas tutorial and there's pandas Paideia org slash panda slash types anyway you can see it there's this tutorial on visualization which goes through a lot of the basics of how to if you have your data in a panda's data frame just a lot of stuff you can do with very few lines of code I'd say that's the main advantage of this you can get sometimes a long ways with not doing a lot of code and then a relatively new package that we're going to take advantage of here is Seaborn and which is the one I wanted to make sure everybody had installed it is take the people who are doing C boarded I think this is still a sort of an in-flight project that there's missing pieces and things that when I work with it I feel like gee it really there should be something here so I hope they're continued to do new releases of this they seem to be active there seem to be some active contributions going on it takes a lot of more sophisticated plotting ideas a lot of them from the our world a lot of this stuff that that was done earlier on tended to be in our and so are plotting was generally a little bit ahead of Python world and so they've tried with with this package PI a seaborne but also another package called ggplot which tries to imitate the ggplot2 package that's revolutionized graphics and our but we're not going to you know I can only do so much in two hours so we're going to just focus on Seabourn which is enough so we're going to do three different kinds of plotting but but don't get discouraged because the base plotting is always matplotlib for all these in and it's for the ggplot package as well so by the way if you have questions please interrupt and raise your hand also I think on the live feed we have somebody monitoring the Twitter if you're online watching and you'd like to come up with a question please do and we got a microphone we can pass around for the questions so so the first thing is we got to just load this data and how many of you are somewhat at least somewhat comfortable with pandas I hope or have been in the pandas tutorial just now okay so it's a majority by far so so I'm going to load this data into pandas dataframe I don't want to spend a lot of time on pandas because we just had a pandas tutorial that some of you may have been in but so the basics are you know Pete you know so I import pandas as PD so PD that reads the CSV the file name and I've assigned file name and then I'm going to give the cup there are certain columns that are that have missing values and so and they happen to be in these columns these so that's a list of my column names and so and in this data set they don't use an NA or a Miss or a null or something like that they use a text question mark as the missing value that's just how it was coded so so first I can I convert all those to a numpy nan so basically a missing value and then I drop I drop them you know if there and I convert those also to numeric because they show up because of the missing values being coded as a text or a string question mark they show up with string columns whereas they're really numeric so we'll run that whole thing and with any luck it's going to work you guys can run it on your local environments - all right so whenever you're exploring data you should get some idea of what like what's in the data right don't don't just start randomly doing stuff so let's just look at the head of that data frame with the head method so got head is the head method I could say N equals something here you know like N equals 10 or something but oops but for obvious reasons I'm not going to do there no idea what that happened so let me just run that and you've got and so what we see is just the first five rows got some columns we're not going to use I think the original idea this data set originally was put together having to do with insurance losses on different types of automobiles but we're going to focus on their price a little easier to understand so so you have the make you know who made it what type of fuel does it use aspiration aspiration means like how does the air get into the engine like as a turbo or standard number of doors body style drive wheels you'll see what how this works as we go etc and then some things about some features about the engine and finally the price of the car let me just fix that okay and we can if we just use the describe method that'll give us again some summary statistics on those columns and let's just look at a couple of these here like lengths okay so we can see there's 195 cars in the that where we have a length value the mean length I think these are in inches is 174 the standard deviation isn't actually that much considering and and the range is actually from minimum of a hundred and forty one to a maximum of 208 so like is this an idea that cars don't have you know they're not super short they're not extremely long they're in a fairly narrow range and you can look at these quantiles you know the 25 percent the 50 percent which is the same as the median which is pretty close to the mean so that also tells us that distribution of lengths of cars is probably fairly symmetric right whereas if we scroll over here to price we'll see the mean price is $13,000 which like I said this is kind of an old data set and and we've got the standard deviation is quite wide actually 8,000 so we've got a big range of price in fact prices go from five thousand dollars to forty five thousand dollars and and the median is ten thousand dollars whereas the mean is thirteen thousand dollars so so in terms of exploring this data we know that price is highly skewed so so those are just some things there's other things you can look at here but those are just some things we'll keep in mind as we think about what charts what visualizations are useful to to digging into this data set so so let's so now I'd like to just go through some basic plot types and and introduce you to how we make those plots which is you know one of our objectives here so that you come away with a working knowledge of Python plotting and how you use it to explore data and also how you can use it to create presentation graphics so a really basic plot that everybody works with of course is the scatter plot and let's just do the first one in math plot libe and the recipe I'm showing you here is really simple so obviously we have to import matplotlib dot pie plot because we're doing it from Python right and we're going to use the plot method and we can just say what values of x and y we want and I want dot red dots okay to worry for each data point okay so the type has to be R for red and an O small oh for an O or a dot okay and notice there's one thing since we're working in a in Jupiter notebook and it runs an eye Python interpreter if you don't include this magic command matplotlib in line with the % in front it will not plot your graph in line I mean you would normally think that would be what should happen right you run a plot command you'd want to see the plot in the ipython you know whether you're running in a night Python shell or an O in a Jupiter notebook that runs ipython but that's not the default you have to say so so you can see what so just to show you in code plot you know so I've imported matplotlib pipe lot as plot PLT so I've got PLT dot plot and then I've got X Y and type so it's pretty simple right so go ahead and run that and we got a plot but and so we have price on the vertical axis presumably and city miles per gallon so some measure of fuel efficiency of those cars on the horizontal axis consider can people see there's a couple of obvious deficiencies of just taking this kind of default Matt plot live plot can what are a couple of them that you guys see anybody yeah you know and label the axis that's a big annoyance yeah anything else you guys see yeah huh yeah there's over plotting that's a good one we'll get to working on that but there's another one anybody else see it yes yeah the x-axis is chopping actually the y axis is just scaled to the max and min so the dots are chopped down here on the bottom and and that's just because we took the defaults we didn't try to do anything fancy by the way this business of not labeling axes is is a huge when I you know I I think was mentioned in my introduction I teach data science classes for both UW and for Harvard and it's a huge pet peeve of mine because you get something like this and you don't know whether you know it's average shoe size versus cows born or what you know what's this plot you have no idea right it could be anything and and to keep that in mind so let's do a little better so I have this in a panda's data frame as you saw right that's how we loaded it and so I can take that data frame Auto dot prices and apply the pandas plot method and I'll say the kind is a scatter plot so kind equals scatter in quotes and then I just say X what my x and y are just like I did before and go ahead and run that and you see you get a somewhat new and improved plot right it's got axes labels the these circles aren't chopped off you still can't tell how many points are over plotting here very well we'll get to that but but at least for one line of code we've got a sort of more miles for her better better output for just the one line of code using the pandas plot method that's because pandas plot method is built as an abstraction layer over the mat plot line and you get more so I want you guys to go ahead and try this yourself I'm going to give you a couple minutes to do this so do Auto price versus curb weight and there's somewhere up here if you look at the column names there's one called curb weight which is C which is as you can imagine sort of the weight of the car when it's empty okay so just give that a try [Music] oh and to execute the code and if you don't know in a Jupiter notebook cell just hit shift enter with the with the cursor anywhere in that cell and it'll execute the code actually how many of you have never used a Jupiter notebook before only like two okay so hopefully it's not too overwhelming [Music] oh there's a question way in the back could you run the microphone back there logging is oh oh yeah I can so I can't hear you it's on hello yes so fun y-axis your y-axis starts with zero for me starting at 5000 hmm that could just be different versions of pandas plotting I'm not as you can see you can force X limits and Y limits as attributes which we'll look at a little bit about attributes right now I'm not controlling them so it's probably just a different version okay was there another question way in the back - nope all right okay well hopefully you guys got something that looks about like this it may not be as we just discussed exactly this but okay so let's get a little more sophisticated here this is a really simple recipe right we're just using one line of code we can specify an X Y a kind and you know we'll get some you can get some mileage out of that so to speak but let's use some of the math plot live capabilities so so I'm going to define a figure which I'm going to call fig and it's plot dot figure and I can you can if you look at the dock for this there's all sorts of attributes about gridlines and all sorts of stuff right now I'm just going to say fig size equals six by six so it's going to be sick in at least the way I've got my notebook set up it will be six by six inches and then now that I've defined that figure I can say get a let's set up I'm just going to do one axis on it so GC a is a method it defines an axis and so now I've got you know Auto underbar prices got plot so this is all the same as you just saw before except now I say what's the axis so I say ax equals ax here right so now I'm now I've said okay don't just use any default axis use the axis I'm telling you to use and okay so now that I've defined what axis it is I can set all sorts of attributes including as we just talked about excellent limits and Y limits and plotting but I'm just going to say ax dot set title so I give it a title set X label so I give it an x axis label I give it a y axis label and you see it doesn't look a whole lot different than the plot I had before but I do have my title here I do have sort of more human readable axis labels so is that clear how I did that to everybody because that's kind of an important it's simple but it's also important if you what we're going to do sure oh so the question is why are we setting all those attributes after the plot I think the answer is it doesn't really matter I suppose you could set them before the plot it's just the kit but because you're just setting up different attributes of how the plots going to be displayed and you're saying use this access to the pandas method so I don't know you could trick I don't know let's try it let's see what if we turn it around I think it won't matter yeah okay you tried it and it doesn't matter no it doesn't matter because it's just it's just going to run through all that and you've only set one figure so anything that applies to that axis it'll just keep you know it's like a compound operation it's going to plot them and I suppose the order they're listed in the code but the end result isn't going to look any different okay so here's another hi can you explain the difference between figure and axis like can you switch the order of those in the code the order of I'm sorry the order of which big equals plot figure and axe equals fig GCA what's what is the difference well I had to know I have to define a figure first which is just a plot area the figure that I'm defining when I set that figure size is just a plot area you'll see later we'll have multiple axes on a figure so yes I have to create a figure before I can say that it gets one or more axes what if you create another figure after that well then you're starting another figure it'll be everything I'm plotting here if I define a new if I run another figure method I'll be working on a new figure was there another question behind there no okay yeah hopefully you'll see that because well we'll get to making multiple axes on a plot I hope I hope hope I have time this okay so a little exercise so take your plot that you did before which is recall was the plot of the price versus curb weight and try to decorate it with a title and some axes labels and what and also where you can control the figure sighs you don't have to do 6x6 and you can do your own if you want to see 4x4 or 8x8 whatever whatever you know it doesn't even have to be rectangular to tell you the truth I mean you could do by [Music] timeline did everybody kind of get that then and you can also cheat and see what I did I changed mine up a little bit I said oh I'm going to make mine wider so I made it eight by six so I've got a slightly different aspect ratio than then the wreck square we have before for no good reason at this point just doing it to do it okay everybody with us oh here's a question the style will mean the shapes and the colors those are aesthetics we'll get to aesthetics so the question was how do you set the types of points but we'll talk about aesthetics once we've gone through the basic block types so another plot type is a line plot Oh the units at least the way I have my notebook set up it's it's it's inches but it's inches based on some scale it never quite comes out in real inches on my screen so it's a little inches in some buddy scale but it's a little hard to know usually if I'm trying to do something like for more presentation quality I'll wind up just messing around a little bit till I find you know a set of a figure size that looks good well I mean scale is a sense like here I did 8 by 6 so obviously this axis is physically longer than the vertical axis and you know pandas laid out the numbers correctly you can do things to control the I'm just taking the default tick marks and labels but you can do there's lots of stuff you can do too if you want like more granular tick marks you can also do things like rotate the if you have a lot of tick marks that need labels you can rotate the text so it doesn't all jump you know if you keep getting into these deeper things you can do a lot of stuff of you can do just about anything you want given enough code and pain is the answer so so I'm just going to I'm just going to create a new data so so why do you use a line so a scatter plot is good when you have a lot of data where each you know they're not ordered in any way right there's no order to these cars in terms of like price and weight or price and miles per gallon or anything like that they're just they're just whatever they are right it's just whatever order I actually I think they're ordered by manufacturer alphabetically but that has nothing to do with the actual data values right so but but there are cases particularly like time series plots and some other cases where order matters and so I'm going to create just a simple data frame here and all I'm doing is using the data frame method on a simple dictionary that has an x and y column and x is just a list of a hundred numbers and y is the square though so it's just it's just a going to define a parabola and so you so the only thing that's different here now is notice I don't even have to say what what the kind is with my plot method here on this data frame because line plots turn out to be a default when you hit four to four for a two axis plot in in pandas so and there you have it I mean it's not it's not very beautiful but I just wanted to show you guys the method for doing line plots we're not going to do much with that today but it is an important plot type for certain types of data so do be aware of it so bar plots are a little tricky if you're used to other plotting packages because there is a bar plot method and you can see that in this second set of code counts type plot that bar but but it needs to be based on counts not on you can't just give us the column of from your original data frame it there there is no that method that bar plot method literally just plots counts so what we got to do is create counts and we're going to do it we're going to look at the make make is like the manufacturer and we're going to use the value counts method okay everybody with me on that so we wind up with this series of one a single column pandas data structure is called a series because it's not a data frame it doesn't have there's not multiple columns so the index of this series is just the manufacturer and notice that it nicely orders them this is just a the way pandas works it nicely orders them from the most frequent car to the manufacturer with the most frequent cars to the want manufacturer with the least frequent and so now that we have counts I'm going to apply the counts dot plot dot bar method to but to that counts series not to my original data frame ok little tricky and notice I'm using yet a different figure size here but all the rest of this should be familiar by now and well other than the fact I picked to let me make my figure size smaller so you guys can see it actually ok so now we've got a nice ordered and and something in terms of perception your perception your colleagues perception bosses and customers are whoever you're going to present to it's always very nice with a bar plot to order it either this is descending order or it could be a sending order but if you order it people don't have to look and go okay I see Toyota is the most common but is it Nissan or Honda in this Honda more common than Mitsubishi you know you can hardly tell right and they're only different actually I think they're those three all have exactly the same number of cars but if they were just in some random order people would really have to study your plot to understand it you yourself would so which is one of those things you want to avoid when you're doing good data visualizations so order order for the lesson is order order your counts either ascending or descending and it may matter you know depending on what you're trying to do right so a histogram any questions up to that point then are we okay okay so a histogram is like so a bar plot notice this is discrete or categorical data a car is either a Toyota you know it has a manufacturer and it's one of the in this particular data set it turns out to be one of these names from this list and but those are discrete values a car can't be one point to Toyota or something like that right but also notice to somebody else's question it pandas plotting does do this nice thing when you get too many axes labels notice how it new to change it kind of figured out that oh I'm going to get a lot of if I put those labels horizontally which is the normal way to do it there's going to be a lot of over printing so it turned them vertically so that you can see them better but you can do that you can force that too so so a histogram we have a continuous variable like engine size engine size and these data are in cubic inches and there's there's no categorical you know car could have 144 cubic inch engine and could have 143 or 141 or you know whatever so or 144 point six that you know it is a numerical you know it's a it's a continuous variable it's not a categorical variable so therefore we use histograms and histograms you simply around some bins create counts nicely pandas doesn't unlike bar plots where they force you to compute the counts yourself the histogram does it all automatically now one trick is if you just did Auto prices dot plot dot hist a it would try to make a histogram with all the numeric variables plot it one on top in in some funny way it which probably isn't what you want I mean maybe sometimes it is so you need two subsets I'm subsetting that data frame to just the column I'm interested in right now engine size and so so that's why that little extra bit there with the square bracket subset operator on column is is being used and all the rest of this is the same you know where we set up a figure we define an axis we use that access for the titles and the in the part new person once again you know I should have gone through of sorry I should have gone through and made these all reasonable size you guys can see so here's a histogram of engine size and so in terms of when you look at this there's a feature of this histogram that really stands out what what is that anybody but the nature of these autos anybody because you've got really tiny engines down here and some really ginormous engines up here but what do you see about that distribution it's very skewed right auto manufacturers at least in the sample tended to have cars with pretty with on the small end of engine size there was a question could you do this with like curb weight and engine size like well if you want to play to factor or to yeah that's where you could you could you could play around with having like a I don't know if this'll I don't know if I can do this in real time but I think you do something like you have to insert a list here let's see not just hmm I think that'll work you need one more bracket on the right oh oh right balloon there's oh yeah but you get to see this is the problem you get this clunky thing right where I was wondering if you would like group them together almost like a no no it tries to plot them aside in the same set of axes if I was to you know so this is so if you want to look at engine size versus curb weight for example that's probably a better better option to use the scatterplot this is a case like what I was talking about right at the beginning that you don't you know you need to think about what are you trying to view what of you know which when you reach into your tool boxes of plot types which plot type actually is going to display with what you're hoping for and you know you could do things you could you could have two axes side by side or one above another you know you could show the two histograms if that somehow meant something to you and your analysis but if I was going to compare those I would I would want to look at a scatter plot that would be my first thought all right so let's talk about box plots how many people know what a what I'm even talking about okay close to half the room good so we just make some so so the idea of a box plot the original ideas I think to keep published lists originally in like 1960 or something so it's been around a long time and the idea was if you have a bunch of I think this may actually go to the question we just had of like comparing distributions of two different you know four different variables so so what if you have something like engine size that we were just looking at and then you have some categorical variable like fuel type in this case I think these cars either have gasoline or diesel engines so there's only two fuel types and and so so what we can do is we do we select those two columns engine size and fuel type and we use the boxplot method and we say by equals fuel type so that's like a group by so we're going to group cars by fuel type and we're going to display this box plot which I'll explain in a minute and once again I got ridiculously large you know one you know it is probably that the projection is lower resolution oops I didn't I meant to do it the other way okay so so what you can see is on the left hand plot here we've got the diesel engines the right-hand plot we got the gas engines and then we have engine size so so how do you compare these things so the red lines are the median value so you can see the median value the median diesel engine is a bit bigger than the median gasoline engine then one quartile the data above the median and one quartile of the data by quartile I mean 1/4 of the values below the below the median that's what defines the box you can see the box is almost overlap here then we go out another well either till we run out of values or up to one and a half quartiles and that's called the whisker and then if we have outliers beyond that we put these plus signs so we can learn a little bit from a plot like this it's it's not totally definitive but it looks like maybe sort of the bulk of diesel cars have maybe slightly bigger engines you can see that box is a little higher than this box probably not that significant given the overlap but you can also see the gasoline cars have a much wider range of engines from you know barely 50 cubic inches way up to some pretty ginormous engine that's like 300 and I don't know whatever that would be 25 cubic inches or something much bigger than any diesel car because you see those plus signs for the outliers and so that's a way to compare and you could have multiple we'll look at some examples where we have more categories and you can quickly compare so so now we're going to talk about seaboard and Seaborn works a little bit different from pandas plotting by the way is everybody with me up to this this point now okay good alright so assuming he has Seaborn successfully installed which generally doesn't seem to be a problem and I'm not quite sure why all anaconda distributions don't include it was there a question okay so the simplest recipe that I've laid out here for making a seaborne plot is to import the package which always have to do set a style for the plot grid and that's kind of like setting up the axes for the other plots we've seen although it's not quite the same and you'll see why that is I think in a while and then you define and the plot type and and you have to say which columns you want that to apply to so in this case so we import import Seabourn is SNS and so then I have this method on SNS called set style and I'm going to pick a style called white grid that's just like the most generic style it's just a white background with a with a grid it you know gridded axes so pretty pretty pretty simple pretty generic so then I'm going to use a method called KD plot for kernel density and plot and I just going to do that on engine size just on one column now again if you had several you know you if you had columns for engine size for you know diesel and gas or something you could do different things here but let's just do this real simple thing oh and notice that sometimes you get this weird deprecation warning so so this looks a lot remember the histogram we had of engine size so imagine that we took a what's called a kernel estimator which is like a Gaussian shaped curve weighted curve and we ran that over there and we found sort of a smooth fit over that histogram instead of bidding we tried to get a smooth fit over the density of curves and there's one other thing they do it's a density it's a probability density so the integral over this whole curve has to add up to 1 which is you know one of the axioms of probability right so but you can see the in this case I don't think we're learning a whole lot more we're seeing that as we discovered before manufacturers have a tendency to build cars with much with smaller engines it's only a few sort of outlier cars that have these really large engines okay now notice also what went away here because I didn't specify I did this in a real simple way what's what's missing yeah my axis labels and my my title and all that because I didn't do anything about it so let's okay so you can probably guess that's what we're going to do next so again we can define it define a set of AXI the fine a figure define a set of axes and when we say that we're going to do this KD plot method we say ax equals x and then we can just as we did with like the pandas plots and you can do this with mat plot live plots so that's why it's important to kind of understand all this stuff is coming up from math plot libe this whole business of the set underbar title set underbar our xlabel etc is all just matplotlib stuff that can be superimposed on another any other some of these other plots and so now I've kind of decorated this plot with proper access labels and a title okay so so before we get off kernel density estimation plot I just point out that there's another thing you can do which i think is actually sometimes really cool and this is partly someone pointed out that we had a lot of over plotting when we looked at engine size versus price and this was partly a way to get around that oh yes what was the question that's a deprecation warning some versions you get that it actually ran it down once and they said oh it's a bug so after reading like pages and pages of discussion so that's why I made a note earlier on to just ignore that if you get it I don't know is anybody else besides me getting it no a few people ok so it depends what version of manic on they've got I guess or which version the Seabourn you got I'm not sure where it's I think it's actually coming from math plotline not even from Seaborn but feel free to ignore it and that's why it's a warning it's not an error message it's a warning and it also turns out to be an incorrect warning because it's not actually a problem so okay so 2d plots to to eat so it's the same KDE plot but now I just give it a list so I have square bracket than square brick you know then a list with another square bracket so I just have a list of so I've got two columns engine size and price oh and I'm also going to someone asked about colors so the sea map is like a like a color I'm going to pick from and and you'll see why that is and if you look in the in the Seaborn documentation or the tutorials somewhere down here there's a whole menu yeah color palettes there's a whole yeah a lot of stuff about color palettes they they do a really nice job of color palettes with Seaborn there's a lot of flexibility a lot of cool stuff you can do they've also thought about like that a lot of a big percentage of people are red-green color blind so they try to steer away from red-green palettes and stuff like that so it's whoever did this really knew what they were doing yes in the last tutorial when they were doing the plotting it was a method of the data frame itself what's the difference between doing it this way versus the other way well okay so we're not using pan this is now Seabourn so it's a completely different thing but you could also do some way back to the pandas plotting let's see what was our last like this one I think I could do cow counts dot plot and then let's see well we can try it it should be let's see I think like kind equals and then I think bar has to be in quotes and then I need a comma alright oh and get rid of that I think that will come up with the same plot yeah see it's the answer is it doesn't matter it's a style thing I you can do it you can explain Quentin did his pandas tutorial he was using a style like this counts dot plot and then he would say kind equals whereas I'm doing counts plot dot plot method like this I don't know do you like do you like arguments to your methods or the list of methods in your code I it they as far as I can tell they're interchangeable I've never noticed a difference but it's a good observation of the detail okay so let's get back to yeah here we are so so so I was explaining so I've got this list of two columns here that's so I'm subsetting my data for my auto prices data frame to just those two columns and I've set this color palette so let's let that go and there you go there's other things in that I'm sorry you know I'm I tested this on my standalone display which obviously has more resolution than the projector so alright so now you can see it so you can see that there is in fact you know this is like a contour map like a topo map or something and you can see what that color palette is doing it starts out the lowest rungs are close to black and the lightest runs there's such a dark blue they almost look black and then the top rung is this really light light blue up here right in and so when we talked about over plotting and that there were a lot of data points so let's go back to the scatter plot you'll see what I mean way back sorry so see in some of these areas here there's just so many dots you can't really tell how many are on top one on top of each other and we call that over plotting and this is only you know we only have as we saw 195 cars in this sample but imagine you were doing this on customers for an e-commerce site where you might have millions or something this over plotting becomes an overwhelming issue and so you need other methods to deal with it so one of them is this Colonel Vincent the studio --all density estimation plot like this another which I don't think we go into is is called a hex bin plot hex bin plots well if you look at my github you'll see actually I think I just have a link to it but some work Ryan happen and I did where we were plotting making scatter plots of every home sale in the United States for the last 20 years or something like that was a fairly big data set not giant you know not big data big but big and we use these hex bin plots and it gave a similar effect whereas if we had done scatter plots we just would have had blobs it would have been so many dots one on top of the other it just would have been an you know an uninterpretable blob so when you're making plots if you if you wind up with blobs think about other methods like this 2d kernel density estimation or hex bin plots so is it clear to everybody why this is giving us what it's giving us and what do you conclude from looking at something like this oops so the question airline life you mean this plot oh yeah right because we're doing engine size here it gives a legend but in this case I've given it to I've given it two columns that we're sort of contouring over so it's price on the vertical engine size on the horizontal you don't really need a legend Oh a 3d plot yeah you can do that in matte flat libe I don't think there's any 3d plotting and Seabourn effect I'm pretty sure there isn't so so the answer is you can do it with enough work with matplotlib primitives in fact I think in those the resources I gave you if you click through that look at some of those tutorials there's actually some examples of that but it's a lot of code so be more if you really need to do it do it I also make the comment one of the I'm not keen on 3d plots because they they're so dependent this is something cleveland looked at and some other people i think i think i think to fee even talks about this that it depends on if you have a 3d surface imagine you have to have an eye point of perspective and how people will view how peaky and rough your surfaces depends on you know how its oriented and so it can be very tricky in fact you know pipe you'll notice I don't I mean there's a number of types of plots I don't use I don't use 3d plots generally I don't use pike pie charts because I think I think it was Tookie said they're an excellent tool for a few skating almost any any point you're trying to make even though if you look at like infographics and news sites and stuff they're very popular but but if you think about it I have some examples for another class I teach which is longer and we look at some having to do with elections in various countries there's a lot of political parties I mean it looks like some sort of psychedelic pattern you cannot tell who which party has what percentage of the vote in the parliament or something like that it was just it's just useless so so anyway what but what do you guys when you look at this plot it's it's it's actually telling us something pretty interesting I mean assuming you're that interested in the price of cars but what is it anyway yeah well I wouldn't say they're the same so I don't think that's quite true because the units are quite different for one thing but they do they do tend to be ellipse I mean there is a high covariance right they get down to be elliptical you know so anything with ellipses sort of indicates you know where they elongated ellipse on these contours but there was just something even more obvious than that yeah so hundred well cutting its cubic inches but yeah there are lots of cheap cars with hundred cubic inches roughly engines on the market in 1985 so because it's there's a clear peak and there's only one peak in this sir I think this question way in the back okay yeah yeah so you'll get yeah so the question is if we did this with city MPG you get to because you actually get a little more complicated so first off there's obviously some outlier cars that that oh well this is city miles per gallon I didn't change the label apologies anyway so there's some pretty low fuel efficiency expensive cars there seems to be a cluster of those those actually I can spill the beans those are luxury cars so they're not built for fuel efficiency up in that corner and then you see yeah there's two modes of this where manufacturers either want these really cheap high fuel efficiency cars or kind of mid I imagine they're sort of midsize cars or something like that so yeah so you can you know it this is a good point though that you know you want to look at lots I mentioned this when we started you always want to look at lots of views of your data and this is a you know there's only 20 20 columns or something that we're working with in this data set but you can see even with just 20 columns there's a fair amount of complexity here that if you were trying to really understand this and maybe you were trying to build a machine learning model to predict the price of cars based on certain attributes or something like that maybe for your competitors if you work for a car company or something you know you you would really have to work quite a bit to fully understand this and be able to create a deep understanding of what what these data are telling you so another interesting plot is the violin plot so it's very similar to the box so it kind of combines the best of box plot and density plots and so kind of like the box plot we have an X which is a categorical variable and it should always be a categorical variable fuel-type and then we have a Y which is a continuous variable so we'll just use engine size but we could use price we could do you know you could use lots of different columns and like I say in practice if we were really doing this exploration as a real project we'd be doing probably hundreds or thousands of plots till we figured it out but let's just do this one okay and so you can see it looks like you can imagine that kernel density estimation plot we looked at for engine size before where we just looked at all cars so you can imagine so it's got the same thing on the left and on the right it's just symmetric so you can look at either side and it just shows you there is a distinct peak in the gasoline cars and kind of the diesel cars their engine size is sort of more uniformly distributed right and you can actually see they've done a little box plot II kind of thing in here the dot the white dot is is the median and you can see the upper the upper enter the upper inner quartile the lower inner quartile you know etc and you can also see as we saw before with the box plots there's a lot of outliers in gasoline cars is apparently a lot of gasoline cars with really big engines okay so everything we've done up till now everybody okay up till now okay so so let's so we're going to now switch gears we're about halfway through our time or a little more and we're going to start looking at ways to extend beyond two dimensions everything we've done has been one or two dimension one dimensional plots like histograms and the 1d kernel density estimation 2d plots like the scatter plots and the contour plots and things like that so let's start looking at aesthetics and so the first is so we're going to look at color transparency size marker shape and some plot specific aesthetics and so color color is a very tricky aesthetic it seems very simple like oh I can just color but but there's a lot of probs I already mentioned the issue that a lot of people are red-green especially men for some reason our red-green color blind so when you think of what palette you're going to use four colors consider those kinds of things Seabourn actually does a pretty good job of suggest the palettes that are kind of the standard palettes try to avoid ones that are hard for colorblind people but there's a method in Seabourn called and also don't pick too many colors I've seen plots I just saw one the other day where as we were working on and someone had done like 20 colors and you couldn't tell the difference between all these shades of red and all these shades of green and we were going back and forth and back and forth between the legend and the and the dots and the you know it was very dependent on every everybody saw something different you know so so don't go crazy with any actually that's true many of these aesthetics you can show a certain number of other dimensions but you can't show huge numbers of values so this LM plot it just means linear model plot but I'm actually going to set rig fit to false because I don't want to you get regression lines is what happens on the plot which can be really useful but in this case I don't want to confound what we're really looking at which is we're going to set the hue as fuel-type and I'm going to use this set to palette okay so right away so you can see what happened here fuel type sir gas or diesel and you can see the the red dots are the diesel and the kind of greenish dots okay I guess they did pick red green in this case but or at least I did the blue I guess they're more like blue green but anyway so so you can see something now interesting about this price versus miles per gallon which which is what what does that tell you about gasoline versus diesel cars anybody yeah you get more mileage for your for the amount of money you spend on your car if you buy a diesel car which I mean everybody probably knows that but this demo you can see that in almost every case the diesel cars are to the right of the gasoline cars at almost every price level and remember before we saw this clump of kind of very high cost low efficiency cars those are these luxury cars if you don't believe me you can you can make a plot and just do like you know Mercedes and Porsche and Jaguar and a few makes in a select and you'll see that's true so or you so it gets a little complicated I don't want to spend the we just going to have the time to go through this but I wanted to guys to see you can get a lot more control and you can use the pandas plotting methods in this case I'm going to create but I just want to give you the general recipe so what I did is again I define a figure we make it smaller see guys so it doesn't spill out again I set some axes and I basically subset my data frame for gas so I've created two data frames here for gas and diesel and if I had four categories I'd have to do four data frames and then I apply and I also make sure that there's something in that data frame so the shape the first shape shape is the the number of rows and the number of columns in the data frame so shape zero is the number of rows so if the number of rows is greater than zero I'll go ahead and plot it because I don't want to error out if I have to in this function if I happen to have it but that's a this kind of a technical detail so I'm going to plot both of these and you can also do this as a list I mean it's there's probably more elegant ways to do this but I thought this was more straight and so I've got kind equals scatter I guess to someone's question that was before Y so in this case I'm not doing the plot method as a dot and doing it as a kind and I'm going to give those colors dark blue and those red and everything else is oh and then I'm doing some I'm creating some legends here so I just do that and I have essentially the same plot you just saw using Seabourn a lot more code but I just wanted to show you guys for future if you you can get a lot more control this way at the expense of a lot more code so let's look at transparency so we looked at over plotting in one method which was the 2d scatter plots but so now now we get to something where you have to do what the recipe I just showed you because I can set something called alpha and by the way alpha is the same in Seabourn it's the same in math lot live it's the same in pandas so alpha is a transparency value so alpha of zero is completely transparent you wouldn't see the points at all alpha of one is what we've been working with it's perfectly opaque you can't see through it at all so we're going to try point three and you see these signal points up here look pretty pretty fuzzy now and you can start to get some better idea of down here where there's over plotting where there's really a lot of points or maybe just one or two on top of each other okay so transparency is pretty useful so so here's a little code exercise for you yeah we have time for that so copy go ahead and do copy-paste and do this kind of do the same thing for engine size and curb weight so this is like price and and city miles per gallon but go ahead and create a similar plot for engine size and curb weight you can you can try the pandas or you can I'm gonna I think I'm going to try it with Seaborn see if I can get it to work so it would be Oh for those of you new to notebooks if you need a new just go insert cell below and you get a cell for your code what did I say Injuns are that's the problem doesn't attitude okay so you see that's why okay so see born doesn't know about alpha there's a way too much huh keep it didn't like it give me the Alpha Oh let's turn so that's the argument oh yeah okay I'm remembering this now yeah but it's I'm trying to remember is it is it an argument - LM plot or something else you apply where do you have to do another call on SNS with a different method yeah I think it seems like something you ought to be able to do but [Laughter] yes hahaha okay just guess ah all right so I kind of did it the hard way although it sounds like there's a better way to do it with with Seaborn but you get the basic idea but it's just a different view of the data engine size and curb weight not too surprisingly pretty highly correlated and generally diesel engines if you turn X you can flip the axis around you'd see that cars tend to have somewhat smaller engine size per weight if they're diesels okay so the last thing we've got here is marker size and marker size is is M is again one of these tricky things that you don't want to get too carried away but you can use it on a continuous variable and so this is the same code we've been working with and I'm just going to and what I've done is S is for the size attribute you see and it's and I did point five just to get a scaling of Auto prices of engine size so now we're going to so well let me make this plot and then you'll see what's what's interesting about it so you see the size of the dots now varies with the engine size you see we've got gas and diesel cars we've got price and cities miles to gallon so how many dimensions so I advertised that we were going to look at multiple dimensions using using these aesthetics so how many dimensions are we actually plotting here now from our multi-dimensional data set I think somebody's saying four yeah that's right so we've got type of engine fuel we've got engine size we've got price we've got cities miles to the gallon but can you guys see that there's kind of a little bit of a problem it's a little hard to say I mean these dots aren't that different in size like from these these luxury cars with the really big engines and clearly you can see for some of these economy cars they're smaller but it's not that distinct right so one trick is to to go with engine size scoop squared so I've got auto price and so I just multiply the two columns together here you could create a new column in the data frame too and you can see that don't you think that makes well let me let me rescale that for you so I think I mean that to me to my eye and probably to yours doesn't that look a lot different that you can see these cars with these really big engines over here and a few down here and really small engines in some of these other areas so so keep that in mind to the differences this is now engine size is now proportional to the area of the of the shape and you can change shapes to we'll talk about that in a sec and as opposed to the linear dimension of the shape so this is work people like Cleveland and others have done that the human eye is more sensitive to the area of the shape you're looking at than the linear dimension and so finally we are going to use marker shape so so I know it's getting kind of confused so we've got size we've got color we've got alpha and then we're going to do marker equals m'kay so I've got this list of zeros and pluses and a list of colors and you can see turbo cars are dots are and standard or and stand I'm sorry yeah and standard cars or crosses so if it's a turbo with the circle if it's a standard car so yeah once again so I did not think to I didn't think about how how different the resolution would be on the screen so there you go so how many dimensions are we now projecting on to this two-dimensional surface anybody just lost five yeah five is right because we've got whether it's standard or turbo we've got whether it's gas or diesel we've got the engine size we've got the price and we've got the fuel efficiency so five dimensions on a two-dimensional so we're using three different plot aesthetics here which at least the way I did it here is a little bit quite a bit of code but that's not so important I'm going to skip a little bit of this yeah it could be I should have used triangles or something yeah so his question is the plus shapes maybe don't look as weighty is the equivalent dots yeah maybe to some people sighs so you could use triangles or squares or something like that there's there's ways to do yeah that that might have been a better choice perhaps something that you see an area so let me just go here and do so plot specific aesthetics are lots of things like here I'm just going to do this real quick with histograms so I've done is change the number of bins so we're back to pandas plotting here so in one case I just took the default in one case I said bins equals for you so I think the default is ten and you can see they're pretty chunky bins right but it looks but the histogram looks kind of smooth and easy to interpret what's the problem with 40 let me see more detail but but does that really tell you more what do we think the problem is there it's just looking at noise is just you know it's just it doesn't mean that much that it's just jumping from one little bin to the next by so much so obviously I've picked too many bins so when you're creating things like histograms density the density plot you can set up what's called the kernel with think about and try different things to find something that suits your data now if we had you know ten thousand values instead of 195 maybe even smaller bin widths would start to make sense for a histogram but we only have 195 data points so we can't push the push it too much so a cool thing you can do we already looked at color with Seabourn with some of those other plots but here I'm doing hue is aspiration I'm doing this for the violin plot so it's the same violin plot we looked at before with fuel type so you've got you know so again we just set set our grid you know we we apply the violin plot method fuel type price aspiration so the new thing is this aspiration and when split equals true so you'll see what that does in a minute so what it means is now we get remember before we had just the two sides of the violin plot we're just mirror images of each other they weren't different at all but now we can see whether they're standard or turbo on either side and it shows that in a nice different color so this gives us a way so we've al got how many dimensions on this plot three yeah three so we've got whether it's gas or diesel we've got whether it's standard or turbo and we've got the price and you can do the same thing for engine size etc or with box plots let's do something a little different so body style there's a lot more body styles yeah sorry I can't so in the violent large after in this plot yes there's a median and and oh the median is gained for the whole group it's not it's it's not giving you like two medians and two you have to look at the density plots on the left and the right to get the differences yes so you could create other subsets or other variables too so I we don't need to go through this but you can see if you buy because there's five body styles you can see you get quite different box plots for gas and diesel depending on whether it's like a I think one of these is a convertible etc so I just wanted to show you that okay so yeah okay so we've got half an hour so I think we've got enough time to do a little exercise here so and so so we just looked at that violin plot so try it with Drive wheels so just make the violin plot a price so it's priced by aspiration but make the hew drive wheels and you'll see it'll be different there's three types of Drive wheels I just want to show you just see so it's pretty quick to do that just do some cut and paste I think that's right what's the variable a I get it all the way back out I have wheels oh I'm sorry yeah it's not the Huey all right I should have kept the hue the same it's the its this should be Drive wheels and the X so you can do it it's what you get and this one then should be fuel type okay right okay yeah that's what I wanted to show so yeah and the point is like as the error message is the person pointed out the error message said you can only have two to a two-level categorical variable to do this color split so obviously I could use gas and diesel or fuel type or I could use aspiration but this interesting thing with four-wheel drive vehicles what do you see yeah there's no there's no diesel four-wheel drive vehicles in these data so but you know and you can see that every one of these for front-wheel drive vehicles you can see that they tend to be the cheap cars regardless of whether they're gas or diesel they're all clumped down there at the cheap end rear-wheel drive cars have this big range and in four-wheel drive cars for some reason or down there so okay so I promise we're going to look at multiple axes types and so let's spend the rest of our time on that so this so everything we've done has just had one x axis one y axis up till now so now we're going to look at multi axis plots and there's a couple ways to to skin that cat and this is a scatter plot matrix it's called a joint plot in Seabourn and we're going to do engine size by price and so a joint plot is just a scatter plot as you see and then it gives you the histogram of what what we would call the marginal densities along this edge so that's why it's called a split plot but you see we we've added these extra x axes which are the accounts or the densities on the edges so that's just a simple example of now a plot that you know tells us something new but is quite you know but you can't create a plot like that with just one pair of axes right so another more complicated plot is a pairwise scatter plot and Seabourn has this thing called pair plot these things I probably see them before they've been around for a long time at least 30 years we're going to do I'm going to define the numeric columns to be these this you can do categorical columns but I'm just going to stick with numeric columns we're going to use fuel type as Hugh and we're going to use that same palette and on the diagonal I'm going to put kernel density plots and I've set a and we're going to do KDE plots those 2d kernel density estimation plots on the upper diagonal so you'll see this this might take a little while to compute here in a warning message which choosing to ignore for now okay so there's a lot going on here as you can see so here we've got so the way you read this is you've got lengths curb weight engine size horsepower city miles per gallon price then you've got exactly the same variable names on this axis so so you've got price by length down in this corner here you've got price by curb weight but if you want to look at curb weight by price you let's see curve would see so you can see the kernel density plot of let's see curb weight by price will be this one here so you can see there's sort of two peaks in that you can see for gas and diesel cars because that's what we used as a as the hue you can see the histogram or the kernel density plots for each of those so so now so now how many how many dimensions do we have on our plot count them up nine because we got eight eight numeric columns plus fuel type as a categorical so we got nine all right so someone asked this question about multiple plot axes and so the way we kind of snuck into this once before but so I can so here I've got my figure and my axis let's typical Python stuff right with the comma because I this function is going to return this method is going to return two things a figure and an axis they do plot that subplots two by two so that means I'm doing a two by two array of subplots and this is like way too big a figure size and one percent I'm going to do eight by eight and then so I'm just going to say so so I've got an X so this little function so I can say what my X column my Y column are and we're just going to do histograms so you can see I subset with some kind of gnarly notation here and then plot dot hist so I'm actually using matplotlib x' plot method histogram method here so I've gone all the way down to the base math plot live for now good I mean I could have used paint I could have created data frames here and use pandas or something like that or Seabourn but it was just easier well at least in my view is easier to just create two lists called X calls and white calls and iterate over over them or enumerate over them but the important thing is then you see I do in fact have it well I've got all sorts of over plotting here so I'd have to work on this more right to get these separated a little bit in all this but I think you get the idea that the key things I've done here is I've defined a figure in a set of axes that's a 2 by 2 array of axes you guys all see that that's like the important thing to notice here and and you can see I'm having problems I've got some over plot printing and I can't really read my axis and my axis labels are a little funny and I'd have to go back if I wanted to compare these histograms they'd have to go back and create code to make sure that the the the x axis lengths were exactly the same on each of these but fortunately there's an easier way to do this which is called facet plotting and so this if you run into this it can be called conditioned plotting which i think is what cleveland originally called it it can be called the method of small multiples which is what toothed d originally called it it can be called facet nowadays most people call it facet plotting it's sometimes called grouped plotting so it's but think about it as a group by operation where you divide your data and then plot it on multiple axes ok and the basic idea is remember we just created a standard grid before with Seabourn but now we're going to use this facet grid method and I'm going to say columns equals drive wheels so I'm going to do a group by drive wheels and so whatever I plot here and I use this map method I'm going to use a rake plot of engine size and price and I'm not going to show that regression line so is that clear with what's going on there that we're creating a grid which only has columns so it's a one row buy however columns of drive wheels which will be three and then and then I'm plotting engine size and price okay so so there you have it and you notice some interesting things first off the axes facet grid methods make the axes exactly all the same length so I can compare one plot to the other to the other so that's really nice it also figures out that I don't really need another set of axes labels here and here I can just use those and so so it just saves me a lot of time and it does the group by map you know the group by I don't have to I had that gnarly stuff in the code here where I was effectively doing I could you know a grouping or a sub setting so and you can do like let's try one with Drive wheels by body style so we go now we're doing columns and rows okay so our grid now is going to have two dimensions this is the one dimensional grid this next set of code it's the same engine size by price but we're going to do it we're going to create subplots by drive wheels and body style all right and you can see there's three types of Drive wheels and there's actually five types of body steps so it's a lot of different plots there's a couple things you can see like like there's no four-wheel drive hardtops you see there's there's no dots there right there's only one there's only one and you see how I'm reading these it's body style equals hardtop drive wheels equals front wheel drive there's only one so most hard tops whatever I'm not even sure what a hardtop is but whatever they are they're mostly rear-wheel drive cars and you see start to see you know some price relationships for these different subsets so this is a very limited data set as I keep saying it's only 195 points but if you had thousands of points and you wanted to explore high dimensional projections of them this gives you a really powerful way to do it this this facet gridding method and you can add aesthetics so like let's do hue of fuel type so this is basically I'm just making the same plot you just saw with fuel type and you don't see so again you don't see many I think the diesel cars are the red dot so you don't see any like there are no convertibles with diesel engines so you don't see any red dot in fact there's no four-wheel drive convertibles so so there's no dots there at all and then but here you start to see sedans with front-wheel drive and they're kind of tightly clustered compared to say sedans with rear-wheel drive which have a wide range of prices and in engine sizes so so it's not the best example of using this kind of scatter plot but so one last exercise if you guys want to hang on for a bit which you have time please by the schedule so go ahead and do this but change one of the axes to city miles per gallon instead of engine size because remember we started with that so you can pretty much cut and paste and do a few changes here but we'll keep all the other the faceting variables and the color the hue is fuel-type etc so it's really a pretty minimal change and you mister so everybody get it and it's just a very simple change here we just had to change change this to city miles per gallon and you can see yes they do you see some interesting stuff like again sedans with rear wheel drives have a big range of price and a big range of fuel efficiency but there are no highly fuel-efficient sedans right there they just sort of in there kind of in the mid-range rear front wheel drive sedans have better mileage and you keep going down here let's see where the hatchbacks oh yeah hatchbacks so you can see the all the really really fuel-efficient cars or hatchbacks so even though we don't have an ideal data set to use this kind of gridding method here you I hope you guys can see that what the value is in terms of dividing data into small pieces small chunks and being able to compare it on exactly the same axes side by side by side there's also a ways if you want it to free one axis and not the other you can set that there's all kinds of specialized things you can do with this and you have to think about as you're visualizing a data set on what to do so in summary I hope I've given you a bunch of tools bag of tools and and also pointed out some places you can go for the tutorials and what to learn more things and given you know so you know how to do this with Python and also looked at kind of the process of how you explore data and why you explore data and stuff you can learn from exploring data when you have complex relationships and you really want to understand it before you you know waste a lot of time doing some complicated analysis where you don't really understand what you know what the simple relationships might be between your variables so that's so are there any questions or anybody Oh got a couple back there yep so the question is can you make these graphs interactive not with the packages we've been looking at you can do some animation like Matt plot life has an animation system that's actually kind of cool so you could like step through a bunch of categorical variables or something so if a student have time to go into that if you want truly interactive Python graphics I suggest you look at the bouquet package which is a whole we could do another two hours just on that it's its own world but it but it's and it's it's the other nice thing about is it's it's directly usable in websites so so it actually has a couple of really cool things about it but like I said it's a whole other thing oh yeah there's lots of stuff on bouquet I think there might even be a bouquet book or to it yeah yeah yeah right someone's there's a talk on Friday yes for sure but not a tutorial in this conference although it'd be nice there was a way in the May yeah yeah that's a good suggestion to the I pi widgets if you want to do simple simple interactive stuff those work pretty close are pretty nice too for in there specifically for Jupiter so if you're doing it in Jupiter than the Lynn you that might be what you need there was a question there yeah hi so I'm relatively new to this visualization with Python and let's say I have an idea for a plot I want to make how do I decide where to look for how to learn how to do that Matt plot live Seabourn pandas how do I decide yeah I know so the question is given all these different packages and we just breezed through like the surface of three of them how do you decide I you know I don't think I don't think there's like a best answer to your question my general rule of thumb is this I look at see born first usually because it's it is more abstract you know they abstract a lot more things as you saw we didn't write nearly as much code to get fairly complex plots out of it second might be pandas plotting especially if your data and pandas dataframe and then I kind of you only use matplotlib if I have to really get into some detail and I hope that some helpful guidelines but it's not you know that there isn't like a absolute best answer it really depends on what you're trying to do unfortunately any other questions or just Punk thanks Steven yeah thank you [Applause]
Info
Channel: PyData
Views: 27,038
Rating: undefined out of 5
Keywords: jupyter
Id: KvZ2KSxlWBY
Channel Id: undefined
Length: 111min 12sec (6672 seconds)
Published: Mon Jul 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.