Seaborn ecdfplot | What is an ECDF plot? And how to code an ECDF plot in Python seaborn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome or welcome back today we're talking about a brand new distribution plot in seabourn it's called the ecdf plot [Music] so for starters what is the ecdf plot well it's a distribution plot and like the disc plot in the hist plot the ecdf plot was released in seabourn version 0.11 its name looks like a rearrangement of the alphabet but it actually stands for empirical cumulative distribution function so you'll be able to use it to look at the distribution of your data to make an acdf plot let's say that we have 10 observational values like these our first step is to sort these values so that we have the smallest value first and the largest value last now we're going to plot two's axes on the bottom or x-axis we're going to put these values and on the y-axis we'll put the proportion of values that we've seen so far okay so how do we plot these data first since 1 is the minimum value of these data our ecdf plot starts from there this observation of 1 represents a certain proportion of our data in fact in this case it represents 10 of our data so we'll draw a vertical line up to 0.1 on the y-axis so so far we've looked at 1 this very first value now we're going to imagine taking that bar and sliding down our values list until we get to the next value which is at 1.5 so we didn't encounter any other values until we get to 1.5 so on our plot we'll just extend our line over to the right until we get to 1.5 now since we do have a value at 1.5 once again this represents 10 of our data so we'll draw a vertical line straight up now since we've sorted these data so far we've encountered 20 of these data so on the y-axis we're currently at point two and we're going to continue to do this but the next time that we slide the bar we actually come across two values at the number two so our plot again goes to the right over to two but now because we have two observations at two we'll actually extend this vertical line upward to point four because we've seen forty percent of the data so far and that's how the ecdf plot works we'll just continue extending our line until we've reached all of the data finally when we get to our maximum value of 5 that will represent 100 of the data so our proportion is now at 1 on the y-axis so what are the advantages of the ecdf plot well unlike a histogram or kde plot there's no bending or smoothing here we can actually see how each observation is affecting that cumulative distribution also because we're just plotting lines here it's often a lot easier to compare different distributions across categories on the negative side of things now we are no longer able to see the central tendencies of these distributions as easily for example the mean or the variance are a little bit harder to detect here it's also a little bit harder to detect when we have a bimodal distribution but given all of that let's go ahead and take a look at the ecdf plot and the seaborn python code to get started i'm going to load in the seaborn library as well as the piplot module from the matplotlib library and by the way all of the code i'm about to show you is available on my github page next i'm going to load in a data set from the seaborne library and this data set is called tips it's about the tip amount that various different people left their server and so we also have other characteristics here like the day and the time that the tip was left if i take a look at the shape of this i have 244 observations and seven different columns i'm going to go ahead and set my styling to be dark grid and now we're ready to create our ecdf plot to do that i'll call up the seaborn library and the ecdf plot then i just need to pass in here what will be my x value so this data set has a column called tip with the actual tip amount that each of these patrons left and i'm also referencing my data that's coming from the tips data frame okay once we hit enter we see our ecdf plot so you can see along the x-axis we have the various different tip amounts and along the y-axis we have the proportion of the amount of data we've seen so far okay so a couple things to notice here you'll see that the chart begins at a value of 1 and it ends at a value of 10. so that's coming from the minimum and maximum values for these tip amounts so if we reference that tips data frame in the tip column and i take the minimum i'm going to see a value of one dollar likewise if i take the tips data frame the tip column and check the max that will be 10 so your ecdf plot is going to range throughout the min and the max of whatever data you're plotting you'll also see that there are certain areas where we basically have a vertical line like this one at two so that's coming from the fact that we have a lot of values that are exactly two dollars so if i take a look at my tips data frame the tip column and i do value counts i'm also just going to take a look at the top part of that so i'll do head as well the very top amount the most common value that people left was two dollars and three dollars and four dollars right so if we go back up and look at the plot we'll see that there are a lot of values happening at two dollars in fact two dollars represents over 13 percent of our entire data set so you're seeing that proportion jump quite a bit when we get to those values so that's something else that you can keep a lookout for in the ecdf plot are these large vertical jumps that just means that there are lots of observations at that particular value the other cool thing about this plot is that you can use it to see what percentage of data are below or above a certain value so for this i'm going to use the pie plot module and i'm going to add in a vertical line so let's say that we added in a vertical line at a value of four dollars and i'm also going to make that line black from the ecdf plot this is telling me that by the time i get to four dollars i've seen eighty percent of the data so far so eighty percent of the tips left in this data set were four dollars or less likewise i can see that twenty percent of the remaining tips are greater than four dollars so that's how you can interpret the various x values along with the proportions that are listed on the y you're seeing a cumulative proportion of the amount of values that you've seen so far so so far we've had the tip amount on the x-axis but this plot also allows you to put that on the y-axis instead you'll just switch your argument to y now you'll see that we have the proportion along the x-axis but the tip amount along the y-axis so this is really just a reorientation of our data so now if i put a vertical line on this plot let's say i put it at 0.5 again i'll change that color to black this is showing me where the 50th percentile is in these data in fact if i take the tips data frame check out the tip column and look at the median you'll see that that's 2.90 cents so when my proportion is 50 percent or the median of this data set that value is two dollars and ninety cents as shown by the y axis you have several additional options when working with the ecdf plot for example you can use hue to split out different categories so let's take a look at those options in the seabourn code like many other seaborne plots you can use hue to show off categorical variables within your ecdf plot so to do that you'll just reference the hue argument and now you'll pass in whatever the column name is that you'd like to split up on so for us let's split up based on time there are two different times in this data set either lunch or dinner so here we're seeing an ecdf plot for both of those separately in the blue line we see that there are lots of people who leave two dollar tips during the lunch hour and in the orange line we see that there are a lot of three dollar dinner tips so this is really nice because you can compare the distributions of these two different categories within your data set overall dinner tips tend to be a little bit larger than lunch tips but using hue within the ecdf plot really becomes beneficial when you have several different categories so now let's split up the hue based on the day here we have four different categories for thursday friday saturday and sunday all of the days that are contained within our data set but we're able to now compare the distribution across those days even though they're kind of right on top of each other we're still able to see nice deviations among these data for example this long green tail is telling me right away that saturday is the day when my server gets the largest tips now one other thing to note here we are looking at proportion so if i happen to have one category that's a whole lot smaller than the others i'll still be scaling all of those values so that each of these different categories ranges from 0 to 1 in terms of proportion but in general when you have various different groups that you want to compare and the distributions are kind of all stacked on top of each other the ecdf plot is great for showing this relationship because we're just getting lines instead of a bunch of boxes and bins like we would with a histogram now so far i've been showing you the proportion on the y-axis so we're looking at what proportion of our data have we seen so far if you'd like to look at actual raw numbers you can switch the stat argument over to count now this will show you actually how many observations you've seen so far and so if you'd like to look at raw counts especially if you have also a hue argument you could use both of these so that now we can see that there are just fewer observations on fridays in orange as compared to saturdays in green so the count is giving you how many observations we've seen so far for example at four dollars we've already seen 200 observations but what if i wanted to know what the rank of a four dollar tip is for example if i want to say that the number one best tip is ten dollars and then count backwards how could i do that with this plot well it turns out that there's another argument called complementary so let's try that this is basically just giving us a complete opposite view of what we had before we're starting way over here with the maximum value and then we're counting backwards to our minimum value but i think the cool thing about using both count and complementary is that we're effectively ranking these tips for example the number one best tip is ten dollars by the time we get down to two dollars that's ranking about 170th down to 200th now the final option i want to show you is called weights so so far what we've been doing is taking the tip amounts putting those along the x axis and our y-axis is just showing us what percentage of those tips have we seen so far for example by the time we get to four dollar tips we've seen eighty percent of our data so far but what we can do with weights is actually tally up how much of our money we have made so far so let's go ahead and try to use this weights option and we're going to set this to the tip column you could theoretically set this to whatever column you want but i'm actually going to weigh these based on the actual dollar amount in the tip so now what this is telling me is how much money have i made so far so for example by the time i get to four dollars i've actually only made about 70 percent of the money that i'm going to make and so let's go ahead and put in a horizontal line here once i get to the 50th percentile of how much money i'm going to make that actually occurs at right about 3.50 so by using this weights sometimes you'll hear politicians say something like well 50 of the money i've made came from contributions that were less than 3.50 that's what we're doing here we're actually saying how much money have we made by each of these different tip amounts and you'll see that these larger amounts now are weighed more heavily since they contribute more to the actual amount of money that we're making like always there's tons of ways that you can style the ecdf plot so here's some examples to get you started one of the first things that you might want to update in terms of styling are the colors displayed on your ecdf plot so for this example i have an ecdf plot and i've broken out my data based on the day of the week but let's say that i'm not happy with this default color palette if i'd like to switch the colors that are displayed here that's an argument called palette and there's over a hundred different palettes that you can choose from within seaborne i'm going to use one that's called summer so i'll just pass in the string summer and that will switch over to these nice green shades besides the seaborne keyword arguments you also have access to various arguments in pi plot so any other arguments that you're passing in here will go through to pi plot's plot function so for example you can change the line width to let's say three but whatever arguments you know how to style lines within pipeline will also work here and there's even more styling you can do by capturing the return object from this ecdf plot so what i mean by that is let's go ahead and let p equal the return object from ecdf if i go ahead and check the type of p we'll see that that is a matplotlib axis subplot so that means that we can continue working with and styling this p object for example if i'd like to adjust the legend here i can reference p dot legend then i can pass in a list with all the different values i'd like to use within my legend so let's say instead of having thor fry sat and sun i'd actually like to print out the actual day names thursday friday saturday and sunday passing in that list you'll see that now those legend names have been updated i can also do things like put a title on to this legend let's say that that's going to be day of week and we'll see that appear or i could potentially do things like move my location of my legend to wherever i like it on my plot so just keep that in mind that this is going to return a matplotlib object and you can continue using that to style your ecdf plot so i hope you enjoyed learning all about the seaborne ecdf plot it's a different type of distribution plot that really shines when you want to show distributions of various different categories if you want to check out some of seabourn's other distribution plots go ahead and take a look at my past videos about those thanks so much and i'll see you next time the sun went down the sun went down come back and seabourn version 0.11 i always forget what's next it's name looks like
Info
Channel: Kimberly Fessel
Views: 9,899
Rating: undefined out of 5
Keywords: seaborn ecdfplot, ecdfplot, what is ecdf, what is an ecdf plot, empirical cumulative distribution function, seaborn ecdfplot hue, seaborn ecdfplot weights, seaborn ecdfplot count, how to make ecdf, ecdf plot code, python ecdf plot, python ecdf, seaborn distribution plots, seaborn distribution plots, seaborn multiple distribution plots, ecdf plot color, ecdf plot, what is the empirical cumulative distribution function, seaborn ecdfplot code, python seaborn tutorial
Id: Twh0w3gcrDI
Channel Id: undefined
Length: 15min 40sec (940 seconds)
Published: Mon Jun 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.