Practical Statistics for Data Scientists - Chapter 1 - Exploratory Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys my name is shashawn kalanithi i'm a data analyst and on this channel i teach you all the skills necessary to become a data analyst and you can follow me on my journey to become a data scientist today we're going to be going over the book practical statistics for data scientist chapter one in this series i'm going to go over the first three chapters of this book because the subsequent chapters the information there is covered in my hands-on machine learning guide which is also in my channel you can see that linked in the video if you're at all interested in following along with this book feel free to buy it using the amazon affiliate link i have below you should get the same price as normal but i'll get a small commission in exchange for um you clicking on the affiliate link before we get started i am going to have some notes that i'll be showing you along with some code and recommending if you're all interested in supporting the channel and you would like access to the notes then feel free to follow me on my patreon it's a great way to support the work and show that what i'm doing is helpful for you without further ado let's go ahead and get started so i'm going to minimize this or you know what let's bring up code editor and let's bring up my notes so i've already written up the notes for this section we're going to be going through these notes and then i'm going to write some code live uh to actually execute on the notes that i've written so what i did is in my notes uh currently i just have the code from the book copied over here but we're going to be replacing it with our own code live in person all right so chapter one is about exploratory data analysis and exploratory data analysis is a surprisingly uh new phenomenon that was actually created in 1962 by john tukey in his uh book i think his book was actually literally called exploratory data analysis and it's basically the art and science of using simple uh of using simple statistics and of using plots and graphs in order to better understand the dataset that you're working with it is a fundamental and potentially the most important step of any analytical process and is where you start to make sure that you fully understand the data before you go ahead and start trying to model it and that's why it's very important for data scientists as well so he starts off by talking about uh not turkey the authors of the book they start off by talking about um elements of structured data uh and what structured data is is it's basically data that can be broken down into the form of a like in a tabular form that's more or less what it is there are other forms of data which are unstructured which we'll be going over in a minute so with structured data you can have either numerical or categorical data and within numerical data it can either be continuous or discrete and within category can be binary or ordinal so let's back it up a couple of steps over here so first of all what is structured data structured data like i mentioned earlier is data data that is in a tabular format so think of like an excel spreadsheet something like that right and then within that what is numerical what is continuous so numerical refers to data that it operates by number so for example things like um age or the year you're born is numerical data and numerical data can either be continuous or discrete so for example the year you're born in is a continue or is a discrete variable because you can either only be born in 1996 or 1952 you can't be born in 1952 and a half at least the way we normally write here um years that's why we call that a continuous variable and discrete variables are discrete numerical variable or sorry that's why we call that a discrete variable my bad continuous variables are variables that can be divided infinitesimally so for example um your weight you can weigh 52 and a half pounds or 32 and a half kilos um you know you probably don't if you're watching this video but you can be and um that is why it's called a continuous variable you can infinitesimally divide the scale that you're working on so let's go ahead and add that information over here numerical data that can be infinitely divided i.e wait so you might be wondering why i didn't put the examples over here so i've actually um gone over information like this in like several other tutorials i've done so i kind of like forgot that i'm like oh yeah this this is a separate section of notes that people might be looking at without looking at all my other tutorials so discrete numerical data that cannot be divided i dot e year of birth and then you have categorical data and within categorical data you have um more than what you have over here but binary is one form of categorical data which basically means that in that category you have um only two options so what is categorical data that would be something like names it's not a number uh and it can be uh anything within a certain uh group of values so for example it's very unlikely that your first name or your last name is literally the only version of it in the entire world which is why it's a categorical variable or for example say we're i'm looking at my camera right now say we were building a spreadsheet of camera brands well the brand of my camera is sony sony is a categorical variable the aperture information would be for example numerical variables where those are actual numbers that um you can use to quantify the quality of the of the lens ordinal is categorical data where the order of it matters um i'm trying to think of a good example of that uh if you cared about um alphabetizing people's names in that situation the data would be ordinal so we'll say data where the order of it matters i'll see if i can think of a better example so those are the two those are the two major types of structured data you do have a couple of you could break it down into more data types if you watch my sql tutorial for example you'll see that i actually break down data into distinct ways but in the within the schema that we're thinking about right now in relation to statistics data need either be numerical or categorical all right rectangular data is basically data um again think of an excel spreadsheet except you only have you have um the column headers and the column headers each represent a variable and the rows each represent a record so an example let's see if we can uh actually illustrate what this actually means so for example let's say i had a data set of u.s states right so i could have like the state name i could have um biggest city and then i could have population whoops can't type here what is the biggest city in california i'm guessing it's going to be la and then rhode island's providence is providence in rhode island um so this is an example of rectangular data and basically what it is is as you can see each of these columns the abc state biggest city population each of those is a variable and each record or each row represents a set of records so the first one is for the state of texas the second one is for california the third one is for rhode island so let's go ahead and actually put this inside our notes over here i decided i'd try doing some of the notes a little bit more live this time that way i can add more information i got some good feedback on the notes i had for my hands on machine learning course which by the way if you join the patreon then you get access to those notes as well um basically all notes and i'm trying to build up the patreon to have more and more information on it um but right now i definitely think it's a great deal all right and then let us caption this example of rectangular data now providence really isn't the biggest city in rhode island i'm gonna be i'm gonna be hearing about this from tons of people so all right so this is an example of rectangular data um and remember that's not the just because data is inside a table doesn't mean it's a rectangular because for example you can sometimes have data that is like this so it'll be like so you can see over here this isn't rectangular data as and and this is a very common format people put their data in where they'll have uh two axes um because the variables over here are the variables are not all on this uh end over here the way you would turn this into a rectangular data is you would have something like this you'd have like name as a separate column income bonus um but people like to have their data like this especially in excel because it's easily make it's easy to make graphs using data in this format in excel but uh when we work with python and other programming languages you almost always want data in this format where every column is a separate variable and you don't have like a blank space like that so i hope that makes it clear so you'll be uh hearing about something called a data frame and a data frame is basically the um i guess the data it's not really a data it's not really a data structure at least it's not a native one but it's the data structure that is used to hold on to um rectangular data and we'll be working with data frames in a minute so also before you start this course i do hope that you have if you don't know any python because we're going to be doing a lot of stuff in python in this course um this book also does everything in r but i personally use python for everything uh i might do a redux of this course at some point where i actually go about go and do it in r as well but um if you don't know any python at all i highly highly highly recommend checking out my um my video on my python course i have a free python course which i'll link like over here or something and in that course you'll learn everything you need to learn in order to be a better uh in order to do everything in this course and more so if you know xero python it's a great place to start or if you know just a little bit of python or you're you know python as an engineer it is a great place to start because i am teaching python for data analysts all right so we are over here uh we've gone over data frames and uh indexes um one important thing to remember is that um data frames will have something called an index which is similar to this uh row number over here and it's basically just a way to speed up calculations on your data frame uh it helps the computer search through your data frame faster um so when we create our first data frame you'll go ahead and see what that is another thing to remember is that statisticians and data scientists will use different words for the same uh concepts so for example statisticians will call things predictor variables and predictor variables predict a response or a dependent variable and data scientists will say features predict a target that's not too important in this chapter because we're not doing any modeling but this is just to keep you aware the data scientists and data and statisticians will have different terms for the exact same concepts all right uh quickly we're going to go over non-rectangular data structures so this isn't really a big deal in reference to this course over here i don't believe that we're going to be going over it a ton because non-rectangular data sets data structures need a lot of processing usually in order to actually derive insights from them and you have to process them in a very specific way and in a way that is independent for each type of data that you go over one type of non-rectangular data is spatial data and in which spatial data you can get object representation and field representation i'm not going to go over this because it's not too important if you're at all interested feel free to look it up yourself this isn't but this is kind of just a note that's inside the book it's not something that is covered over extensively all right so let's get into the actual statistics so we're going to be going over a couple of things the first concept we're going to we're going to go over it are called estimates of location and then we're going to go over estimates of variability estimates and then exploring data distributions and i think that's basically it and then exploring binary and categorical data so that'll be fun you'll see i have some blank code cells that we're going to be filling in over here and then exploring two or more variables so the section towards the end is super fun the section towards the beginning is kind of basic but it is important to just you know go over and know what it is so let's go ahead and go to our code editor over here so i hope that everything is big enough for y'all to see and you'll see what we have over here is uh this is the code that i'll be putting on my notion and on my patreon as well practical statistics for data scientists chapter one and i got some data from kaggle so people always ask me where do you get your data from a lot of the times i get it from chemical and this is a data set by this archer prasad guy and it is basically a bunch of data on um the it is a bunch of data on the tokyo olympics so that is what we will be covering all right so i pulled all the data i put it inside did i put it in here i did okay there it is i put it inside there unfortunately it's in excels and not in csv so we'll have to hopefully i have the right modules installed for that but make sure you have pandas installed so let's go ahead and import pandas oh this is markdown as you can see up here this is a markdown cell meaning that i can't actually write code here i can't execute code here and then i need to switch this to python there we go so import pandas as pd and i'll just use this base one over here uh so for anyone that doesn't know pandas actually stands for panel data and it's the standard python library for interacting with structured data all right so we've imported panda successfully let us go ahead and bring in some data so i think the metal count will give us a lot of good numerical data so estimates of location are normally done on numerical data so this will give us some good numerical data data that we can use so let us read and we'll read excel [Music] and then the i think what is it in oh yeah in windows it's a double backslash metals.xlsx all right so let's read this cool all right so as you can see we just read in that excel file and we got exactly what we were looking for so if you remember earlier i was mentioning indexes um or indices in the book they call them indexes i prefer the word indices though um someone told me they're two different words i thought they were they're the that indices was the correct form of the word but you can see this column over here with no label is actually the index column so that's that's uh how pandas will reference the index column in the data frame and it looks like we have a couple of pieces of information over here so we have the team the rank gold silver bronze total rank by total so we have a couple of information a couple pieces of information over here and yeah this is for the tokyo olympics so let's go ahead and try and calculate the mean number of gold medals one or maybe maybe the mean number of total metals ones so what i'm going to do over here let us go and add some markdown and we're going to call this estimate of location and then we'll say mean and let's take metal count we'll take the total column and then we'll say dot mean is that it yep and it looks like 11.6 is the mean number of metals in the total column mean basically meaning that you add up all of the values and then divide by the number of values um i forgot to mention that so here's something interesting weighted means are the same as means except you multiply every value by some x sub i before adding them up and dividing by the number of instances and the reason you do this is oftentimes you want to understand the or you don't want to get a straight mean you want the mean to be weighted depending on some other factor so for example let's just go with age and salaries for example right you might want to uh compare salaries between different people but then it might not be fair to compare someone who's significantly older years of experience is probably a better way to look at it it might not be fair to compare someone with years of experience to someone with uh with more years of experience to someone with less years of experience which is why you want you might want to take the weighted mean and compare uh weighted means to each other so we're going to import something called numpy which stands for numerical python and it is basically the standard library for like mathematical operations and or not the standard it is the most used library for more advanced mathematical and array operations in python so so important numpy as np that's just the standard convention um this is something i wouldn't change i wouldn't call it numpy something else because people have to read your code and that's what's going to be used the most often so let us go ahead and copy this over here and switch out our information so we're going to do np average and then let's see metal count is our data frame and i want to have the gold but i want it weighted by the total number of metals someone has i think it's just total right and you'll see over here what it's doing is it is weighting the number the uh average goal by the total number of metals out there actually let me do it the other way around that way we can compare it to the metal count over here and you'll see it's 46. so depending on how we decided to do this we could actually start adding um what we could do to this data set is we could give different sets of points to different uh gold silver bronze like the different metals and then compare them like that um one funny thing uh that i would i do want to mention is that in the us whenever because we we i think we almost always win the greatest number of medals but the number of golds um china is always very very close so you'll see the total number of medals were pretty we're pretty solidly over china but with the gold china's always very like close behind so in the u.s you'll oftentimes see reporting on the olympics done by the number total number of total number of medals um like everything's always ordered by the total number of medals but i've heard from friends internationally that like in other countries they'll actually look at golds first they'll rank everything by the number of golds um we don't do that presumably because that would mean we were second place for a chunk of the olympics but i think we usually come out on top after after a while but by if you count by total medals we usually are dominant the entire olympics so that's just a funny piece of trivia for anyone out there who might not uh have been aware of that the next thing we can do is we can do something called the trim mean so the reason we might want to do so one thing you should know about the mean is that the mean is very susceptible to outliers basically meaning that if a value is uh very high or very low or sorry significantly higher significantly lower than most other values it will drastically change the value of your mean which is why we might want to use something called the trimmed mean which basically removes the top top and bottom top top and bottom or top or bottom x percent or x values from your data and then find the mean on that that way your mean is more robust we'll be going over the definition of robust actually we can just go over right now um a robust metric is basically a metric which is not sensitive to outliers um whereas the mean is very sensitive to outliers so by taking the trimmed mean we can make it less sensitive to outliers so let's go ahead and try to do that we did mean and then let's put in trimmed mean and then we have to import sci pi so from scipy.stats import trim i mean i may not have it on this computer but let's try it out trim mean and then we're going to take the [Music] oh but before i do that let me well let me actually put a note for the data over here i don't know why the notion auto correct doesn't work on pc and then let's go ahead and copy this over here all right that way the notes are a little bit more complete all right yes so we wanted to do trimmed mean so trim mean we want to take the metal count uh and let's do the total or let's do gold because i have a feeling that's a bit more of a skewed distribution and let's trim out the top ten percent or the top and bottom ten percent so now you see the gold medal count or the average is 1.96 and that makes sense if you think about it like if you see the oh wait is that does that make sense okay let us take a quick look at trim mean make sure i have this code done correctly if you're ever unsure about anything the documentation the documentation on all of these basic like python libraries is excellent highly highly highly recommend yeah yeah no i guess this is exactly what we're looking for it cuts off the 10 on the top and the bottom which i guess for 93 rows of data it's going to cut off the first 10 values in the bottom 10 values and i guess okay so what happens is you're going to be removing all the zeros most likely and then these high metal counts over here so let's go ahead and look at what this actually looks like i wish i could see more of it i think if i can do print i can see it no okay um i can turn it into a list and i can see it there we go one two three four five six seven eight nine ten okay that actually makes sense if you get rid of uh yeah yeah you'll see that the number of gold medals drops precipitously like there are like six countries that have a large number of metals and everyone else has a very few so that is probably why our trim mean is so low by chopping off the top ten percent you've automatically drastically reduced the number of gold medals available because a bulk of them looks like they're taken up by the us china japan i think that's the uk yeah the uk oh taiwan actually did really well that's interesting oh no i think this no this is the russian olympic committee yeah because the russians couldn't participate um they couldn't participate at like ask russia this year because of like the doping scandal last time so yeah that makes sense i was a little bit okay because i thought this was taiwan for a while i'm like wow that's like really impressive for taiwan um but that makes a lot more sense i know russia does well the olympics all right so now that we've done that we've gotten our trim mean let's go ahead and see what else we have to do next we have the weighted mean uh our weighted median which is the same basic concept um for oh so first let's go over the median median is basically the middle value of all of your values and the median is a lot more robust than the mean because if you think about it what happens is you sort all your values and then you go to the center value meaning that you've already chopped off the 50 like the 49 on the right side and 49 on the left side so it's going to be a very robust value and medians are great when you know that you have heavily skewed values so for example income levels are a great example of a place where a median will give you a better representation of what the average person makes than the mean will because what happens is at least in the u.s we have incomes that are very heavily skewed in uh both polar opposite directions where you have a couple of people who make just like gobs of money and then a lot of people who make very little money and so if you take the mean income it'll it'll skew the average american's income higher than the median will so that's why you want to use medians instead of means or i don't know like there are probably a couple of other great examples out there like even at maybe at your company um the median income is probably a better representation of the actual income than the meat because you'll have a couple of people who have extremely high incomes so let's go ahead and try this out oh let's make sure we replace this code over here okay cool and let's go ahead and try and find our median all right so median metal count total dot median just built into pandas excellent so the median metal count is four and if you look at the data that's probably a better example of the average number of metals that a country has gotten than the mean is all right oh let me there we go let's make this a total that way we're comparing the same thing across different counts so what happens is the next thing that we have is a weighted median which is basically the same concept um you pick the middle value such that the sum of values on the left or the left side of the median and the right side of the median sum to the same value uh on both sides so whichever value is um within that distribution such that the values on the left or the right uh some to be the exact same amount that's the weighted median so this you can actually download the w quantiles package to calculate this but i so i prefer to reduce the number of packages in my environment and i'm not i don't know what i would use w quantiles package with for outside of this so like scipy numpy you know pandas i'm fine downloading because i'm going to use them for a ton of stuff but i'm not sure what i'll use the w quantiles package for so i got this code over here which will basically calculate the weighted median using just pandas so i'm going to be using this personally it's a couple more lines of code but honestly i would just store this somewhere and use it consistently and this is the weighted median uh and then i want to bring in let's see we want metal count and then i want total by gold all right and there we go there's our weighted median all right and then the final thing we want to go over is percentile and the percentile is basically the values that's a p percent of the data lies below it so for example like especially in standardized testing you'll see this very commonly like what percentile are you in uh are you if you're in the 99th percentile it means that you've scored higher than 99 of um people over the uh you've scored higher than 99 of applicants that is yeah that is that is a 99th percentile um it's just confusing because when when people talk about the one percent they're talking about the 99th percentile um and they call themselves the 99 but that yeah no no that's just confusing oh i know i guess they are part of the 99 but they are not the 99th percentile that's that's probably what people mean but we can easily calculate it using numpy and so if i want to get q3 basically meaning the third quarter of the data which is the 75th percentile then it's as simple as using numpy oh whoops and then i should probably declare the variable and the number is 11. let me go ahead and put this code in there as well and let's go ahead and take this put this over here sweet and then an outlier is a value that is significantly different from most of the data they don't go over outliers that much in this section of the book um there are actual like numerical ways to calculate outliers i think we'll be calculating we'll be looking at some through our box plots but um they don't actually go over them in this section right over here so those are estimates of location basically where is your data and one thing you might notice with this is that these can only be used on single columns of data um well i guess you could take two columns for the weighted values but they are meant to describe a single column of data one interesting thing that i think should be noted is that you can take your data set so let's do metal count and then do dot describe and pandas will automatically output a lot of values for you it'll give you the count the mean the standard deviation minimum the 25 percentile 50th percentile 75th percentile um and then the maximum value so it'll give you a lot of data on all of the numerical data in your data set by just putting in dataframe.describe so that's just a small note over there that i don't believe i i think they go over it inside the book all right estimates of variability or otherwise known as dispersion metrics so now that we have estimated where where the center of a distribution is so for example if you look at a single row a single column of data we now know where the center of those rows are estimates of variability tell us how the data is uh how far spread out is the data so now that we know the center of the data we want to know how far spread out is the data and we'll be and i'll go over how this all fits together a little bit later in the chapter there's a small section on that and variability is the heart of statistics and it's where a lot of information from a data set can be gleaned as a small side note when you are performing a data science project of some kind you're modeling some data right the information is typically contained within like where the variability is if there's no variability in a certain amount of data then you can't use the variable to predict something else and and really think about that for a second if you want to predict something you have to have some difference in your variables that way you can say okay this variable correlates or this value correlates to that this value correlates that this value correlates that if there's no variability it is impossible to actually correlate data to anything and i have a video on pcas uh principal component analysis um in my channel i think it's like it's part of the hands-on machine learning uh section or book and pca uses variability in order to actually reduce the number of columns that you have to work with in your data by finding the column or combination of columns that has the greatest amount of variation inside it and therefore the greatest amount of information inside it so that's just a small size feel on the importance of variance in data modeling so let's go ahead and add a markdown thing over here and then let's make this a second header not a first one all right so the first thing is deviations which refer to the difference between observed values and the estimate of location also called errors or residuals so for example excuse me let's say i perform a linear regression over here the residuals or the errors or the deviations are these arrows over here basically how far away is this line from the individual values variance is the square of the deviations from the mean and then uh you divide that by n minus one where n is the number of instances so we will use the standard statistics package in order to calculate this so let's see and then let's see what the variance is in metal counts it's probably quite high or better yet in gold and because this is a standard python package we are just going to surround everything we're going to take this method and surround everything and that's our variance now people also like as a bit of a side thing um i've been asked quite a bit how i have the computer setup that i do right now as far as like you know my visual studio code setup and that is um i have a video um of that which i'll link link above over here which is uh basically how to set up your computer the same way i set up mine i personally like using the visual studio code for my development because it allows me to easily switch between python files and ipython notebooks otherwise known as jupiter notebooks without um too much hassle not so i like it a lot it's updated a lot it's really nice and i like how it's black as well so that is just a small plug for another video i have on my channel as well so that's our variant let's go ahead and get our standard deviation the standard deviation is basically the square root of the variance so from statistics that supports stdev and take the stdev of the same data and if i do if i do oh whoops ctrl c oh i can't copy that okay so if i take the square root of 49 i get about seven right oh what yeah yeah i'm a dummy i could i could have done that in my head um 7.02 yep that sounds about correct awesome so we can see the standard deviation is the square root of the variance next we go over the mean absolute uh let's take this over here next we take the mean absolute deviation which is the mean of the absolute value of the deviations from the mean it's also called the l1 norm or the manhattan norm so you'll see that the deviations go up and down over here but the mean absolute deviation the great thing about it is it is a lot stricter as far as how accurate your predictions have to be and the reason for that is because it doesn't care whether you're going up or down and if maybe maybe the deviations balance each other out it cares about how far away you are net so we can calculate it using the numpy package and there we go there's our mean absolute deviation let's go put that in there and then median absolute deviation from the median is the same basic idea from the median it's the median of the absolute values of the deviation from the median so you know just take a second to digest that so i think it's the exact same code oh let me we have our median absolute deviation over here and we can just go ahead and do oh you know what let me see if i actually wrote this correctly i may have not i like how convenient kindle books are but kindle software has always been really annoying to me it's just really janky it's slow oh way too far yeah median absolute deviation from the median okay no i guess that's correct so actually this should probably say media in that case that would probably that would make sense i don't want the mean of it median is not defined okay so let's go ahead and bring in our median and there we go there's our value right over there so let us bring this in and the next value we're going to go over is the range which is simply the maximum value minus the minimum value um order statistics remember i said ordinal data order statistics are basically metrics based on data values stored from smallest to biggest uh sort of from small to biggest percentile remember that's just the um value such that p percent of value or the value which is greater than p percent of values in the data set and the interquartile range is simply the difference between the 75th and the 25th percentile and what that should do is that should give you um the middle 50 of your data that like the interquartile range represents how big the middle 50 of your data is or how what is the range of the middle 50 of your data and let's go ahead and bring in this metal count over here and let us reveal the iqr and there we go and there we go those are our measures of variance so we've gone over two sections so far we've gone over the estimates of location and we've gone over estimates of variability next we're going to be going over we're going to go over a small section on standard deviation so the mean absolute deviation is one method to measure variability um and it's simply the sum of absolute values of all deviations values minus means divided by the number of instances um other metrics include the variance and standard deviation as we went over earlier and one reason that people prefer the standard deviation i should have gone over this earlier to the variance is that the standard deviation operates on the same scale as the rest of the data because you have taken the square root of everything to bring that and bring it back down to the scale of the rest of the data so when you look at the standard deviation it is a lot more in line with what the rest of the data is supposed to be so let's look at this for example so you can see the variance of our gold metals is 49. well if you think about it like no country even 149 medals like the the number 49 by it like in comparison to the rest of the data doesn't really tell me all that much but what happens is when i take the standard deviation of that uh and i get seven i'm like oh you know that's that's more or less saying that the average distance between the individual values of the metal gold medal counts is seven which makes a lot more sense especially when you look at the distribution of the gold medals and you see that there are a lot like there are a few uh clustered towards the top and a relatively small number cluster towards the and the rest of them clustered in the bottom so that's why people tend to proper sandy standard deviations to the mean um and all three of these metrics that's important to remember are not robust against outliers so let me go ahead and fold that and the variance and the standard deviation are particularly susceptible to outliers because of how they square their deviation so that's important to remember uh the median uh absolute deviation is robust against outliers and this calculator by taking the median of uh the absolute value of all values minus the median so the median absolute deviation is what you might want to use if you know that there are a lot of outliers in your data and these are important things remember this is where data science becomes more of an art and less of a science because you want to make sure that you uh or there are no nes there aren't necessarily hard rules that you need to follow in order to know which one you should be using it's all based on experience and there are some rules of thumb that you may or may not decide to follow next we have estimates based on percentiles um yeah and this is just an overview of like range and stuff range 75th 25th percentile and one very important thing to remember for very large data sets um it's very computationally intensive to calculate these statistics because it requires sorting the data so for anyone that you know comes from a computer science background you'll know that sorting data can be a very computationally intensive task depending on how much data you have which is why machine learning algorithms actually might estimate these statistics over here so that's just a little side note over there so next let's go ahead and actually explore the data distribution so this is where we get into the more hardcore eda section of the of the tutorial so what we had earlier was that's eda too that is exploratory data analysis um but those are our basic statistics now let's go ahead and try and explore the data distribution looks like we're nearing the end awesome awesome yeah we're nearing the end we're at what 43 minutes on this video let's uh let's see if we can get it done within the hour all right so percentiles in box plot so box plots are a great way to summarize the tails of the distribution um such as like the top one percent and base oh sorry my bad percentiles are a great way to do that box plots are a great way to see the center of a distribution so let's see let's go ahead and add in a new markdown cell over here we'll call this exploring the data distribution and let's go ahead and do a box plot i wonder if i can just do it like this okay so i need to put in a series on here so i think it'll work like this huh pd starts serious oh oh my bad my bad my bad and then i put it i put it in here interesting okay so it looks like this isn't working so basically i want to calculate a i want to create a box plot right so let's look up the code to do that so python create box plot i must have gotten the wrong code from the book no problem at all oh there we go dataframe.box but there literally is a box plot method that we can use so this is um and the reason i'm not going to cut this out of the video is because from what i understand people like watching me walk like walk through the process of actually like finding stuff and realizing that and it's one thing i really want to communicate my videos um i wouldn't call myself an expert but i do know what i'm doing and um even when you know what you're doing you are constantly going to be looking stuff off because you can't remember everything all at once or like you know stuff like this will happen so this is just something that i hope uh communicates to people that um even people who are experts or know what they're doing will look stuff up regularly so if you find yourself looking stuff up all the time don't worry too much i mean obviously hopefully at some point in time you do memorize some of this but you're also only going to memorize stuff that you use regularly okay so let me get rid of that and then i think i can just do metal count gold i wonder okay so it's asking for a data frame but i wonder what happens if i just give it a series instead so the series object has no attribute box plot okay so i have to give it box plot and then i have to specify the column over here and there we go there's our box plot for our data and as you can see it points out that the outlier anyone that makes 10 metals or more is an outlier and as you can see like the outliers heavily bias the top and not the bottom end in uh this distribution over here so check this out see so that's why we uh that's what a boxplot is very useful for helping us do like without having to really do anything we just typed in one line of code and i can already tell you that gold medals are heavily skewed towards the the highest earning countries what if i went ahead and turned this into total you can see that the total box is a little bit bigger and it's a little bit better distributed what in a clean set of in a normally distributed set of data the box will be a lot more even but in this obviously gold medal counts are not at all evenly distributed and wendover i think is whenever productions or someone has a great video on like why certain countries win so many golds more or less it comes down to how much money they put in or money um no no it's not just money it's like money and organizational expertise they put into their sports um team so for example the us wins as many goals as this does uh partially because of money it's you know it's a very rich country but also because the ncaa basically the u.s college athletic associations um do a great job eking out the best talent in the country and then a country like china does a great job at the olympics because they have state-run programs that are uh that very intensively train people to do really well at uh certain sports so you'll see especially in uh individual sports china performs quite well but then at the same time china doesn't do very well in team sports of uh most team sports there probably some they do well in yeah i know but like things like swimming and no none of these are team sports not to say it's not impressive but um yeah china doesn't perform at the highest level in team sports all right so we have our box plot over here next is the frequency table and histogram and what the frequency table does is it basically divides uh values of var values of a variable into equally spaced segments um and then we'll count the number of values in each segment so let's just go ahead and do and i'll show you so this creates a series frequency table and then let's go ahead and use our metal count so it looks like we can put in a series over here uh for anyone wondering uh a series is basically a single column of data frame i i'm sorry discussed by that earlier let us do metal count and then we will do gold and this 10 over here what this specifies is um the number of sections we want so ah okay okay okay okay so it took every single so we had 93 values um so you know index to 92 because we start with zero-based indexing and it's basically telling me which um cut it belongs to and my understanding is a bracket means inclusive so that means that this includes 39 and the parentheses means exclusive which basically means that this does not include 35.1 so okay yeah so this give this is a this would give us well this isn't quite a frequency table honestly um so how would i turn this into a frequency table so i think i could do dot group could i do it uproot buy no that doesn't make much sense because this is good uh you know what i would do i would probably you would do it like this so frequency is you create a new you would create a new object called frequency table and then we're going to go ahead and we're going to say equals metal count up copy and then we're going to create a column which is what's called freq and then yeah we should get something like this okay cool cool cool cool and then what i can do is i can do a dot group by the frequency table column and then i'll just count the rank because the rank is a unique number uh whoops this is supposed to be a parenthesis and then dot reset index so as you can see over here here's our frequency table this is how you cut this is how you would actually create the frequency table so let me go ahead and copy this and this is the actual frequency table because earlier what that just gave us which frequency uh which section each record belonged in but this will actually tell me okay like or this will actually give me like a frequency table and then we can do a histogram which i do not have the code for up here so we will look that up real fast histogram should be easy to make i'm sure there's probably a uh the search function's not amazing uh so python histo so i know numpy has one if pandas has one i prefer to use that yeah let's just use numpy's okay so this is the information we'll need so let's go ahead and do histogram and then it'll be np dot histogram and then we will put in middle count gold and i want 10 minutes let's say uh but it doesn't actually make a no no i want a visual histogram so it gives me the data for histogram intensity who has the easiest way to do it fewest lines of code i don't okay so ah this is interesting okay so histograms and pure python so yeah i i'm not going to use pure python although that could be useful if you're ever in a situation where you can't um install libraries in your computer okay so we have to bring in matplotlib so matplotlib is a standard plotting library inside python so they have a very nice histogram over here we're just going to do something very simple uh and then i think i just need this plot.hist x equals so x is the variable we want to actually work with there we go yeah there's your histogram and you can see it's a heavily heavily heavily heavily heavily skewed data set let's say i want oh ben's auto i picked an auto number of bins but i could probably change that to a number and reduce the number of bins so i can probably change to like five and we get five pins over there so cool cool cool cool i'll leave it as auto because i want the computer to do everything for me sweet so here's our histogram shaken in bacon all right so now we have our histogram the next thing i want to go over is something called statistical moments and these are uh there's a link in the in the notes below um i'll just link over here you can see the url over there and basically these statistical moments refer to um like a set of statistical parameters that measure distributions and you have your first moment your second moment um oh which i got a twitter notification your first moment your second moment your third moment and your fourth moment uh the first moment referring to your mean basically the location second moment is variability how closely values are spread out around the mean the third moment is skewness the direction of the tail of the data so for example um you can tell this is what you'd call a right skewed distribution because the smaller val it its tail extends out to the right that's what this is um and it is a there is not a metric for this you just visualize the data and you can you you kind of tell this is where it's kind of like more of an art than a science and kurtosis is the propensity for data to have extreme values um i would guess the kurtosis for this state is quite high because it has it has quite a few extreme values uh uh interesting okay so there's something called a density plot over here which is interesting okay so let's go ahead and try and make a density plot a density plot is basically a histogram with a it's basically a histogram with a um line on it that actually uh represents the density or proportion of values that are in each category so i'll show you what that is in a second let's just do something real fast let's go ahead and add this histogram to the notes and then we'll go ahead and add the frequency table to the notes too because that actually is something that is not commonly known about there we go and let's see so then if we try and do a density plot let us copy this over here and then we want to replace all the state murder rates with metal count so they they're using a different data set inside the book um you know i want to use like my own thing i i think practice is a great way to actually learn this stuff oh wow that actually worked out quite well yeah and so you can see over here this line over here represents the proportion of values that are in each section um and then let me get rid of that label simplify this a little bit that are in each of the sections of each of the categories that we have over here so let's go ahead and bring this in over here it actually worked out quite well and then we'll copy that so this is what i'm going to do is i'm actually going to have this next to me whenever i perform my next eda because this code is awesome i probably would use a library called like a plotly only because plotley allows you to interact with the data and but for now this is more than enough so next we're going to be exploring binary and categorical data this is where stuff starts really interesting um because you know we all knew about like and i'm sure you guys knew about like histograms like means medians modes and stuff like that but categorical data which is like a lot of data out there is particularly interesting to work with so we're at 58 minutes i tried to get this done in an hour but i guess that's not happening all right oh nope this is a double header there we go um and then so you have a couple of values over here you have the mode which is the value that appears the most often i'm guessing that i can just do so let me get a well okay each of these countries is going to only show up once but i could probably do mode with the numerical columns there we go yeah and zero is obviously the most uh common value and then the expected value is basically the value uh the sum of each value multiplied by its probability of occurrence um it is basically what you expect to have so for example if you have a table full of coin flips the expected value of that should be 50 uh of heads or tails um you should neither be biased towards heads or tails so let's see there's an easy way to calculate that let us see oh interesting oh python 3 there we go oh no they actually made a formula for this there's usually a like just a numpy thing or something oh no i don't wanna activity on this uh maybe pandas has one oh the poisson distribution okay so it looks like that might be a separate thing i might go look that up a little bit later and add that to the notes in the next thing we will go over is bar charts which are plot which plots each variable against its frequency or proportion so i'm sure you guys have seen bar parts and bar parts bar plots before so what i would do is i would do so let's go ahead and see if we can plot the country against the number of gold medals they have so i think so let's go ahead and see what the documentation says and plotting a bar chart uh x y okay so i can just specify the x and the y so metalcount and then i'm not going to transpose anything like they do over here i'll just use it the way it already is and then let's go with x equals it's not country with something else what was the name of the variable team slash nfc and then why we'll make that gold set the x label we'll call it gold count and then y label we'll call that country and there we go as you can see this chart is completely unreadable um because we have so many variables in there so what i'm going to do is i'm actually going to limit this data to just the first 10 values but all the columns there we go and as you can see by using this eye lock i was able to limit the data to just the first 10 values and i'm gonna remove this four by four fig size over here and you can see we have a much more readable chart right now us poc japan great britain roc australia netherlands france germany italy so that is our bar chart relatively easy to create and there are a dozen libraries in python that will or like they're like half a dozen libraries in python that'll help you write out this stuff um being able to do it straight out of pandas is super convenient though pandas is an amazing library uh pie chart interesting okay so i don't recommend drawing pie charts at all uh let's see pie chart ah okay they do have a pie chart okay you just want a y okay so let's try a pie chart and what a pie chart does it basically shows you the proportion um that eat like the number the proportion every category has as a percentage of the total pi the reason i don't recommend them unless you have a very small number of values is because it is almost impossible for someone to tell the difference between 24 and 30 on a pie chart even though that might actually be a significant difference over there which is why uh pie charts are generally not a recommended form of graphing your data but in this example what it actually might help us with is it it might highlight the us and china as countries that have a particularly large number of metals but you wouldn't really be able to tell who has more honestly unless the pie chart is ordered a lot of pressures are ordered so let's do a pie chart and let's do the code they mentioned nope so we'll do ax equals dataframe.plot.pi so dataframe would be metalcount dot plot dot pi and then y equals gold oh no what i want y to be uh team slash no c oh no i guess that's to be a numerical column and as you can see in this situation you can't really tell what is what you just know that most countries got zero metals but like as you can see when you have this many values over here a pie chart is not a particularly useful way of actually like looking at our data this is where bartra might be a bit more useful because at least you can see the values separated out from each other and it's much easier for humans to visualize uh the difference in lengths over here then it is a difference in volume so for example if i put the so let's see okay how else could we do this um i want to have a pie chart of the individual country and its proportion of gold metal so i don't know if i can do that using pandas okay so they use y equals mass over here but how does it know which ah okay okay okay okay so the index is what it is what the thing is looking for in order to find what how to like label stuff out okay okay that makes sense so what i would do is i would create a data frame like this so i would do what they did i would create their data frame except i already have data which is good or you know what let's let's just do matplotlib that map will make it a lot easier than like reconfiguring our data frame uh okay so labels and then sizes okay so i think i can just copy this over here and then what i would do is i would change this to be plot up high and then sizes i once i don't want it to explode at all i want labels to be metal count team slash noc and i want sizes to be metal count gold okay let's just go ahead and limit it to the first 10 countries again because as you can see it's impossible to tell what's going on over here so we'll call this pi data and we'll bring in the first 15 countries and as you can see that is easier to to read but still quite difficult and this is kind of the problem with pie charts and why they're generally not recommended to be used if i turn this into a bar chart this would be significantly easier to read than if uh if it is a pie chart but as you can see over here let's go ahead and get rid of this um if we were to get rid of this label over here you wouldn't actually be able to tell the us has more metals than china over here because they're so close together it's just one value apart but on a bar chart you could tell that um that being said maybe that's just a case for the reason you should have labels let's go ahead and make this a seven over here that might actually yeah seven is actually something you can read but again not not very well i generally don't recommend pie charts and then let's go ahead and put this over here oops because i need the white background and then download and then we'll put that over here and let us copy this code there we go it is a nice looking chart though but we do not need it ah the shadow is on too interesting uh whoops i did not mean to do that shoot and then the next thing we're gonna go over is correlation okay so let's see if we can get this done in the next couple of minutes uh and then correlation generally refers to pearson's correlation coefficient and it is sensitive to outliers so be careful if you have outliers and it uh basically measures how one variable will change in relation to another variable the correlation matrix is basically a way of showing the correlation between all variables in a data set so let me show you how that works so we can go ahead and do correlation we can go remove that and then we can go ahead and do metalcount metalcount.corr and as you can see we get a quick correlation matrix that shows us how different variables relate to one another so for example um there is a uh there is a negative correlation between your rank and the number of gold medals you win which makes sense because a a higher rank as in like the 65th ranked is worse than the first ranked and it looks like there's a really high correlation between the number of golds you get and the number of silvers you get which would make sense and the same thing with bronze and especially the same thing with totals so correlations are a great way to get an idea of how different variables relate to each other in a data set so let's go ahead and put this in here and there's our correlation matrix and then let's go ahead and see if we can make a quick scatter plot too so i'm guessing it'll be something like and then let's go ahead and do i think it'll be like metalcount dot plot dot scatter probably and then x equals let's say gold y equals ah it doesn't look like there's a y section but there should be oh cool and as you can see we can plot all the golds versus all the silvers very easily that way and you can see generally speaking there is a positive correlation between the number of gold and the number of silvers that a country gets which you know makes sense there's a section over here i'm going to be adding a little bit later on covariance estimates i just found this super interesting uh it's a little bit beyond the scope of this video but it is something i'll be adding into the notes a little bit later and then finally we have a section on basically uh comparing different uh like like two or more variables to one another so means and variances are part of what something we call univariate analysis which is basically a um it is basically analysis where you're just analyzing one column at a time oftentimes we'll want to explore two or more variables simultaneously that's where things like like scatter plots and correlations uh become are quite useful actually what i should probably do is i should probably i'm going to put the section over here because it makes more sense for this stuff to be under the section where we explore two or more variables there we go and that's where we have a couple of things like contingency tables which basically just tally the count between two or more variables hexagonal build a bidding which plots two numerical variables with records into hexagons this is super interesting contour plot which basically plot a density of two numeric variables similar like to a topographical map and violin plots which basically uh are box plots with density estimates inside them so i'm not going to be writing the code for some of this because the amount of data you need for these things to become useful um is quite a bit so i'll i'll show you the ones i'm writing the writing code from which ones i don't so metal count over there covariance estimate if you want to compare numeric data to numeric data then a hexagonal hexagonal binning is very useful so this is actually something where that might make sense so let us do the same thing over here and then we'll make x gold and y silver and you can see over here it's hard to kind of it's hard to tell but like there are different hexagons that basically represent each um uh set of values and this makes a lot more sense when you look at like some of a ton of values but as you can see over here the highest density of count is down over here so i can go ahead and take that let's just go ahead and put that over here a contour plot what that does is that adds a line on top of the that adds a line on top of your scatter plot which shows you where the highest density areas are and these things make sense when you have tons of data so obviously we have a small amount of data here so we can see what's going on but if you have a ton of data then it probably makes sense to have a contour plot so we have to import seaborn for this which is another it's a great library in python for more advanced visualizations and as you can see i don't have it in this computer so let me go ahead and add it and let us do conda activate minimal ds and then conda [Music] see that's what it is and let us wait for it to install all right looks like we finished downloading seaborne so let's go ahead and try this one more time ah okay so i need to replace this with metal count gold and then what are we trying to do here we're trying to do a contour plot right and then metal count silver hmm okay there we go and as you can see this is what a contour plot is basically it is you have the scatter plot and then it is showing you where the greatest density of figures is so that is our contour plot so it's just another plot you can use especially when you have too many values to to actually show on a scatter plot you can use this contour plan plot instead heat maps are another type of plot that you can use um i think i have i don't have a link to it uh let's see heat map python it was [Music] yeah seaborn is usually how i create mine but it's a great way to show for example like that correlation matrix we had a heat map is a great way to actually show that correlation matrix and what it looks like with some color so that it's easier to like parse out so i think i can do it like this so heat map and then let's do sns dot heat map um and i want to bring in metalcount.core as the data and there we go you'll see that you can get a heat map of the different variables how they relate to each other so for example um rank is highly related to gold silver bronze you know of course this is probably not the best data to show because the data is you know inherently very highly correlated to itself uh let's see like gold oh yeah golden gold is going to be a perfect correlation um gold and silver the correlation is not as strong as gold and gold but it is there which makes sense so a heat map is just another way to display this data and you can go ahead and look at the documentation in order to see other things you can do to it um so for example let me go ahead and add that in there too all right and then next we have something called a contingency table which basically will take the count of the number of instances of two variables within a data set so it's a good way to to compare two different variables to one another so you can use like um percentages and other values if you want to as well but we're gonna go with the counts for now so let's go ahead and do a contingency table and i'm gonna do metal count and i want the how do i want to handle this and you can see what we did over here is we've created a contingency table where basically we have the teams listed over here and the number of golds they got and you know it'll be like one or zero based on whether they got the the gold or not uh because argentina i guess got zero as did armenia but austria got one and then what i might what i might actually do is i might want to do a fill in a in this case zero that way we get our zeros when we need it so this is a contingency table basically it is comparing the number of each team with its bronze um or with the number of gold melts i got and the whole point of this is oh sorry no no no and then um yeah so like did they get zero gold and uh a bronze or not like that and stuff so let's see if i can actually like this one might be a little bit difficult to show yeah i know this is gonna be a little bit difficult to show so what i would actually want to do is i would want to reduce this table to i want to reduce it to just have the two variables there we go that's a lot cleaner and then what i could even do is i could do uh what if i did it like this what if i wanted to see okay like what's the correlation of silvers and bronzes and so i changed this to be silver and that to be gold sorry silver and gold and now we can see how these silvers and golds correlate to each other this is a contingency table zero golds and zero silvers there are 11 countries that were like that um four golds and three silvers there were zero countries that were like that so this is what a contingency table will get us and then we want to do the fill in a because nan in this situation doesn't make any sense if they if they if the value is null that means it's supposed to be a zero so where is let's go put that over there uh we already showed the box plot let's go ahead and show how a violin plot works so i could look that up okay seabourn so i'm not too sure i should probably read up on seabourn because c1 has a lot of plots that are just very useful um that aren't in matplotlib for some reason so within python you have like a bunch of different libraries that do plotting in different ways there's like there's bokeh there's seaborne there's like something that starts with an a or something it's like astria or something um there is matplotlib of course there's plotly um and altair alter this one starts with an a and all of these do different things i'm not 100 sure like you know what the difference between all of them are i just know that matplotlib is kind of like the standard one so for example if you're studying python um i don't use matplotlib too much but i like do know it that way in case like it's for like interviews and stuff like people might ask you standard questions as to um how mad plotlet works and it's just important to know that but seaborn adds a lot of features that matterpod is missing um and it's actually built on top of matplotlib and plotly is a great tool in that it creates interactive charts and that's what i like to use the most personally plotly is amazing um especially recently there seems to be a lot more development in it and so i personally like plotline quite a bit but you'll see that there are quite a few tools you can use so violin plot x and y okay so let's go ahead and try this let's make a violin plot and let's do i don't know what did we do for our box plot oh we should okay so check this out okay so we should be able to do um sns dot violin plot x equals uh metal count gold and there we go and here's the beautiful thing about a violin plot not not unlike a box plot and let me pull up a box spot to show how that works uh middle count dot dot plot box i don't think it'll work like that um i think i had to do it like this eh how do we do the box plot again yeah column equals okay oh it's just dot box plot you have to do dot plot interesting that is confusing so you can see the big difference over here is that the violin one looks a lot better but more importantly is it shows us where like how many values are in each section over here and you can see for example like this just shows us the range of the values but this gives us a little bit more information by saying like oh there's a lot of values there's some over here and there's some over here but there's a lot over here and that distribution is what makes violin plots so key and such a great plot to use and then we have a couple of i have a couple of other applause i added in here like categorical heat maps i don't think they added that yeah yeah this is it is there can i easily make this no this looks a little bit different but this is similar to so what this would do is this basically gives us um the proportion of each variable in the total data set compared to like another variables like you know these different farmers over here versus these different vegetables over here and then there's also facet and trellis plots which are basically the same they're like the same scatter plots and bar charts and stuff like that except you'll see they have a shared axis over here to compare different variables to one another so you'll see the 10 over here correlates to this entire chart eight to this in charge charge six four two zero stuff like that so this in this entire thing over here is um different these are uh what you call facet and trellis charts and they're easy to make using plotly which is the library i personally like to use so in order to keep the video a decent length of time because i think we're at like what are we at we're at an hour and 11 minutes right now i'm gonna cut it off over there again if you're interested in the notes they are all available over here i'm just going to do a quick overview of them um again this is just like my hands on machine learning course i uh do have the notes and they are available for your convenience on my patreon but i i don't want to keep the knowledge away from anyone so just doing a quick run through of them that way if you're at all interested you can just screenshot whatever you're interested in over here and let me know if you guys have any questions in the comment section below i generally don't help troubleshoot all that much but if there's a an actual like mistake that i made or something that like is um incorrect about like what i've done then i usually respond to those types of comments whoops there we go almost done and that is the overview exploratory data analysis chapter one of practical statistics for data scientists thank you guys so much for joining me if you like what you saw here please make sure to like comment and subscribe on this video it's a great way to show the youtube algorithm that you guys care about my content and that is content that's worth me making more of the uh metrics are very important to me actually continuing on with this channel and uh we've been doing some great job recently and been gaining some awesome subscribers so thank you guys so much and i hope you have a great day
Info
Channel: Shashank Kalanithi
Views: 22,109
Rating: 4.9911699 out of 5
Keywords:
Id: wwsizzg6UjU
Channel Id: undefined
Length: 87min 33sec (5253 seconds)
Published: Mon Sep 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.