Exploratory Data Analysis with Pandas Python 2023

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Nice work

👍︎︎ 2 👤︎︎ u/Equal_Astronaut_5696 📅︎︎ Mar 23 2022 🗫︎ replies

Very well done!

👍︎︎ 2 👤︎︎ u/neb2357 📅︎︎ Mar 23 2022 🗫︎ replies

Captions

so you found a new data set that you're excited to explore or maybe you just want to get familiar with using python for data analysis well in this video i'll show you some of the essential skills that you'll need to know in order to perform data analysis with python and pandas hi my name is rob and i'm a data scientist and three-time kaggle grandmaster i've spent a lot of my time exploring new data sets on kaggle in this video i'm going to walk through step by step the process i usually take when exploring a brand new data set if you're completely new to using python and pandas i suggest you first watch my tutorial on an introduction to pandas in this video i'll be working entirely in a kaggle notebook so this means if you don't have python already set up on your computer you don't need to worry just click the link to the kaggle notebook you can fork or copy it and then step through the code along with me or after this video is over so with that let's get started okay so here we are in a kaggle notebook kaggle notebook is very similar to a jupiter notebook just an environment where we can run code and write some descriptions around it in in plain text so in general i always import these packages of course pandas and we're going to import that as pd then we also have numpy which will import numpy as np we're going to use a few different visualization libraries matplotlib so import matplotlib pi lab as plot and then we're also going to import a package called seaborn which is really helpful for doing exploratory plots with our data and we're going to import seabourn as sns a few other things we want to do before we get too far into this is we're going to use a style sheet for our matplotlib and seaborn style plots so the way we do that is we do plot style use ggplot i think it looks pretty nice another setting we're going to add is we're going to expand the number of columns that are shown when we display a different data frame in our notebook so the way we do that is pd set option and then we set the max columns to let's say 200 just to make sure we can see them all now i'm gonna comment that part out now and show you why it's necessary later so let's go ahead and run this that cell has run and now next thing we need to do is import our data and with kaggle we can go here to this tab on the side i've already added our data set what we're going to be looking at is a data set that i created with a bunch of information about over a thousand roller coasters it has information like the speed of the roller coaster what material was used to make the roller coaster and some cool things like that but if you're starting in a notebook and want to look at a different data set you can always click on add data and look through a bunch of these data sets that already exist or upload your own data and then of course if you're working on a notebook at on your own machine you can just reference whatever csv file you have so i'm going to import our data and just call it df for data frame and we're going to use read csv to read in the data so i know this is in the input folder and it's called coasterdb so that we've read in very quickly and we have now a data frame we can work with so the next step is to do some basic data understanding of our data so very simple stuff but the first thing i like to do is do a d f dot shape so what does this tell us this tells us the shape of the data frame or the data that we just loaded in in this data set we have 187 rows and 56 columns of data we can also run a head command on this this shows us the first five rows we could change the number of rows we want to see let's say if we want to see the first 20. and now if you notice here when we do this head command and let's put it back at five what we see here is it shows us all the different columns but then at certain point we see just dots and then it picks up later with the last bit of columns this is because pandas by default does not show you every single column in the data frame but for exploration i find it easier to show them all so going back up here to our pd set option we're going to make this 200 which is plenty enough to see every column in our data set and then rerun this head command and we'll see all of the different columns in the data set now another thing we want to do is just list out all the columns since this data frame has a lot of columns we can just do df.columns to see them all and now we've seen all the columns listed eventually if we want to subset our data set we can remove some of these columns and i'm going to show you how to do that later so the next thing we might want to do is for each of these columns to find out actually what d type that pandas has decided it is so if you remember from the earlier tutorial in a pandas data frame every column is actually a series and every panda series has a type so if we type in dfd types we can actually see for each of these columns what the type of column it is so we have a lot of objects here which are just string type columns and then we have some that are float values actually down here at the bottom i created this data set and i added some of the numeric values of features like the height of the roller coaster the speed in miles per hour here to the end okay so one of the last things i want to show you here is just the describe function so if you type in df df.describe there we go what it'll show us is some information and statistics about the numeric data in our data set so we can see here the height of the roller coasters we have a count of 171 values a mean value of 101 feet and all this information that gives us a good understanding for the data before we dive into it any further okay so now it's time to move on to the second step which is data preparation we have a general understanding of the columns in our data and how many rows we have but we want to do some cleaning before we actually get into analysis and it's very important that we first drop any of the columns or rows that we don't want to keep before we continue on or else we might waste time cleaning up columns that we won't end up using so let's run a df head again on our data set to remember what sort of columns we have one thing to note about this data set is we actually have some columns that have values that are string versions like speed and length and we also have similar columns that i i have created when i created this data set over to the right here that have the numeric version of those so it stripped out all the text of mile per hour and converted everything into the same unit here which is miles per hour same thing with g-force inversions so what we're going to want to do is we're going to wanted to subset this data set just to the columns we want to keep and there are two ways that i like to do this so if we run df columns again we can pull in all the columns and we can actually just subset by copying this list of columns here and then with two brackets we put in the list of columns and that will show us all of the columns again it's pretty much the same data set but now we can start removing column by column what we decided we don't want to keep so i already know i want to keep the roller coaster name i want to keep the manufacturer so i'm just commenting out the lines in this list that we don't want to keep and this makes it easy to keep track of what we've actually dropped height restriction i don't think we want to keep inversions cost trains park section all of these things may be interesting to look at later but um at this time we're not going to take a look at them but we do want to keep some of this information at the end so year introduced latitude longitude i think we want to keep the type of material used here uh opening date clean speed values all right so here for speed i think we just want to keep the miles per hour for the height we just want to keep the value in feet and yeah this looks like a good subset of the data that we want to keep actually let's add in location and status now if we if i run this cell i see that we actually have a smaller data set our columns have been reduced just to the columns that we want to keep we have some information though we will be able to work with now there's a second way we can deal with subsetting our data sets columns and that's by using the drop command so let's show an example of how to do that so if we only wanted to drop one column we can just write drop and provided either one or a list of columns that we want to drop let's say we want to remove status just or or opening date as an example we'll run this cell and we actually have to provide an axis equals one so that it knows to drop not a row but a column now if you look here opening date that column now has disappeared so that's in a way that we can i'll just keep this example example of dropping single dropping columns but we're not using that way to drop our columns here i'm just going to put that up here as an example we're going to use our subsetting of a list of columns and we're actually going to reassign our data frame to be this new subsetted data frame so we're going to rewrite data frame by doing data frame equals this new subsetted data frame and i am at the end here now this is can be important we want to add the copy command to the end of our subsetted data frame this makes sure that python knows that it's a brand new data frame and not just a reference to the old one and when you're manipulating the data frame later on it's going to be nice to run this copy so let's go ahead and run the cell and if i run df-shape now we see we still have 1087 rows but we only have 13 columns and if we run d types on this we see that we have coaster name location and it looks pretty good now one thing i'm noticing here is that opening date clean this should be um it's saying that that's an object column but we don't want it to be an object column because it's a date so we can actually force this to be a date time column by running pd to date time and now the d type is actually a daytime 64 column so this is a way of ensuring that our d types are correct for each column we can rewrite them running to date time now another similar option would be if we had say a numeric column that we want to force to be numeric now year introduced we already know is an int column but let's say it was a string we could run pd2 numeric on this and pandas will automatically try to make it into a numeric column not necessary here but good to know now we're going to learn how to rename our columns there's some columns here that i'm not too happy with the names and there's a pretty easy way to actually rename them so let's make sure we're all looking in the right spot and here we go so we can again run our df columns to see what they are and we have some differences in lowercase and uppercase names so i also think it's important to not have spaces in the names of our columns which we luckily don't have here so what i'm going to do is still rename some of these so i'm going to run df.rename and then we can ask it to rename columns by writing columns and we're going to provide a dictionary here don't worry this is pretty easy all we have to do is put two brackets as our dictionary with the old name and then the new name so this lets us rename the column name to uppercase and we'll go ahead and do this to a few of these other ones to make them in uppercase first letter format so here we go now i've renamed all of these columns into things that i'm more happy with the column names they're all starting with our uppercase i've removed this underscore clean because these are now the only columns we have for inversions in g-force and i'm going to go ahead and rewrite my data frame with these new newly named columns now if i run a data frame head on this we can see beautiful column names and everything looks pretty good okay so the next step here is to try to identify if where missing values exist in the data frame and how often they occur so the command we're going to run to identify missing values or null values is the n a command if we run is is n a on this data frame in every single row and column it'll tell us if there is a null value but we this is a little bit overwhelming to look at at first and we would like to instead see a sum of the number of null values per column so here we can see that for this data set the status is null for 213 of our of our rows latitude and longitude are missing for some g-force is missing for some now we have a general understanding of where we might have uh missing values and similarly we will want to look and see if any of our data is duplicated so there might be issues in our data where we have two rows that are identically the same and we would not want that in our data set so the way we look at that is by running duplicated on the data frame duplicated by default will give us the second or all the second and additional rows that are duplicated in the data frame it'll ignore the first row that is a duplicate and why that's nice is because it gives us this list of true or false if the values are duplicated and then we can just simply do a loc on the duplicated values to see which ones are duplicated and none of them are in this data set which is nice we can also with duplicated run run this on a subset of columns so let's just see if there are any duplicated coaster names by running subset on coaster name and actually it looks like there are some if we run a df lock on this list of duplicated rows it'll show us just the rows the second time they occur and are duplicated so we could see that there are actually 97 rows that are have a duplicated coaster name you might want to think about reasons why that could be let's go ahead and to get an idea of why we have a duplicated rose go and maybe just look at one of these coaster names and see the the multiple values that are duplicated and we'll do this by using the query command so df.query and we'll search in this data frame for when the coaster name equals and then we have to put in quotes here the coaster name itself this crystal beach cyclone indeed does have two rows most of the values look identical though but if we look very closely here oh what is it year introduced this is not identical um it actually has multiple years where it was introduced this might be an error in our data set or potentially the roller coaster was put online then taken offline and put back online and we only want to look at really the first time it was introduced so what we're going to do here is so this is checking an example duplicate what we're going to want to do here is remove any duplicates with certain number of columns that are the same so let's use our same df duplicated to identify duplicates and we're gonna run this on a subset of columns so if the coaster name location and and opening date are duplicated let's identify those we can put a sum on this to see what the count of those are there are 97 rows where it's duplicated and then we actually want to take the inverse of this and select just the columns that are not the duplicates so just the first version of those columns so the way we write the inverse is we use this tilde before and now if you see before where trues were true now trues have swapped with falses and we can locate just locations where um where the values are not duplicates of this subset of columns and we will save this off as our data frame now one other thing i like to do before we save this off is now that we've subsetted our data frame we're actually dropping rows in our in our data frame well before we were dropping columns now we're dropping rows and this will make our index not necessarily jump up by a single number each time a way we can maintain that is by actually running a reset index which will reset our index but when we run that it adds this index column so a way to have it not keep that index column is to add drop equals true when we call reset index all right there we go so now we have our index is going from zero to our final number we have a subset of columns it's looking good i'm going to go ahead and copy this again and make this our data frame we can do a shape now we have 990 unique roller coasters with 13 different feature columns that we're going to explore all right now is where the real fun begins we're done with cleaning up our data set we have a good subset of the data and a good understanding for where missing values occur and we're gonna actually take a look at each feature themselves so this is very important to do to understand what the distribution of those features are and maybe some potential outliers in the data set so this is also known as univariate analysis and we can run it on let's say the year introduced numeric column would be a good um feature to look at one of the very common things we can we can run on just a single column or a series remember a single column in a data frame is just a series is you can run value counts this is very powerful i use it all the time and what it does is it looks for any duplicates in this year introduced or it counts how many unique values occur so it'll automatically order this in from most to least occurring and we can see here in the year 1999 there were 46 roller coasters added to this data set or roller coasters introduced and the next highest is the year 2000 and 1998 so this gives us an idea of what years were the most common to have roller coach roller coasters introduced and what years were the least we can also on this year introduced let's say we want to take this value counts and make it into a plot so we can see the top years of roller coasters introduced we could take value counts and maybe plotting every single year would be a little bit much so let's just run ahead on this and look at the first 10 columns or the first 10 most common years for roller coasters to be introduced and then we can run a plot now these backslashes that i'm running just allow me to break the lines of code i have up into separate lines and it makes it a little bit cleaner to read but this would be the same as me writing this all on one one long line without these backslashes right so we're going to make a plot we're going to make a kind a bar plot what does that do that shows us the each year and the counts but we don't have any axis labels and that's not good and i actually think this would look better as a horizontal no let's do a normal bar plot when we're doing plots we can also add titles here title is top years coasters introduced and we can actually save this as a matplotlib axis by doing ax equals so we're now saving this plot as a matplotlib axis and with the axis we can add some additional information to it we can add set the x label to year introduced now we have year introduced as the x label and we can set the y label as count there we go so we see the year the coasters were introduced from the most the top 10 so let's say top 10 years coasters introduced another thing that we commonly want to do when doing analysis on just one column of data is to get an idea of the distribution of the columns so i'm going to take for example this speed in miles per hour now a lot of these early on roller coasters do not have the speed value so those values will just be missing but for later years it's pretty common for them to have the speed mile per hour and we're going to just visualize this by making a plot so we're going to call the plot command and the kind of plot we want is a histogram a histogram just shows us in different bins what's the count of that value so for a continuous value like speed here it's great to run histogram and sometimes i find it helpful to run different bin sizes to get a better idea of this distribution so right now i'm not sure how many bins it defaults to but if we add 20 bins it's a little bit clearer to see the distribution of the speed and we're gonna go in and make sure we always add a title so our title is uh coaster speed and that's going to be in miles per hour we can again save off this if we want to add some additional features to it save off this plot x set x label and we're going to set the x label to speed in miles per hour there we go now we have a plot that is distribution of the speed in miles per hour now we've noticed some things here we noticed that there's a very similar or very common speed which is between 40 was probably 50 miles per hour which is the most common speed and then there are also some speeds way out here at the end that might be interesting to look at later on so we could run this on all of our features if we want and i would encourage you if you're doing data analysis or exploratory data analysis to look at this for each feature now a very similar way to look at this instead of a histogram is we can look at a density plot similar to a histogram but it's a little less cluttered and easier to interpret it when you're looking at multiple distributions because they're all normalized so i'm going to take this here same code and i'm just going to make one change here which is kde for den kernel density plot and i'm going to plot it here and we can see for the roller coaster speed this is what the kde plot looks like we can see this kind of humpier at at 35 miles per hour and also at 50 miles per hour it's very similar to um the histogram all right getting even more fun now we're going to look at feature relationships so we've looked at each feature individually and some distributions and other characteristics of a given column in our data set but what we really want to look at is how did the different features relate to each other in our data set so there's a lot of things that we can do to compare the different features in a data set and one of them is to just compare two features side by side by making a scatter plot so let's go ahead here and let's take the data frame and make a plot um the kind is going to be scatter the x value is going to be our speed in miles per hour and the y value is going to be the height in feet and if we plot this we can see here now we have a scatter plot where there is a dot representing each of the rows in our data set and on the x-axis we have the speed and the y-axis is our height we're going to also add a title called coaster speed versus height now one other thing to keep in mind here is this is creating a matplotlib object when we run this plot command in pandas and to make it so that it looks a little bit little bit cleaner in our in our notebook we're going to add this plot.show function at the end of our data frame and that'll just make it so it doesn't actually display the object itself at the end and you can see that if we remove this we see this subplot is actually printed out information so plot that show is just a cleaner way of showing it so this scatter plot looks nice but we made it using the basic pandas built-in functionality and using another package like seaborne we can do a little bit more advanced analysis and plots with this data so if you remember before we imported seaborn as sns and we're going to use the scatterplot function in seaborn to plot a very similar plot to what we have above what this asks for our similar values to what we provide above if we go into the scatter plot command this is a nice trick and you hold shift tab you can see the actual docs string for this function and we can see that it requires an x value a y value and then we also provided the data which is going to be our data frame so let's take let's copy some of this from before we want our x to be our speed our y to p our height and then data is going to be our data frame there we go very similar looking plot to above but there's some cool stuff that we can do with seaborne we can't with matplotlib out of the box we can actually have the year introduced or other features be the color and the way we do that is by adding these as as the hue in our scatter plot method so look here we can see that now we have different colored dots based on the year that the roller coaster was introduced there we go just a different type of scatter plot where we were able to color it based on another variable so far we've looked at comparing two features against each other in our data set but what if we want to compare more than two well seaborn has a pretty nicely built in functionality called a pair plot where we can compare multiple features against each other in our data set so again i'm going to hold down shift tab within this function we're going to see what variables are needed to be provided to this function for it to work so instead of x and y we can actually provide multiple x variables and y variables and allows us to show what type of comparison between them it defaults to a scatter and then along the diagonal which you'll see see it'll also show the distribution of the individual features so let's go ahead here and add the data is going to be our data fram frame and let's see what features we want to compare so let's compare um this might take a little while to run and while it's running i'm just gonna do a plot show here to make sure that similar to what we did before it shows it correctly and what do we find here okay so now we have and i'm gonna have to zoom out here a little bit for you all to see this but now we have similar to what we did before where we have the distribution of each feature and then we have the relationship between pairs of features using a scatter plot but we have this in a matrix form so for each of the features that we provided how do they interact with each other pair plots are awesome and we can actually take this up a level by adding the hue to this pair plot with the let's use the type of material used as the color of the dot so now it's going to be a very similar plot but with the color of the dots in our scatter plot represented by the type of material used there we go so we can see here on the right side the type of the material red is wood blue is another type and the purplish is steel and you could see that year introduced really early years there aren't much steel and then they start being introduced around the 1950s and then ever more popular what else can we see here a lot of interesting stuff we can see here just from this pair plot so pair plots are really nicely used i'll zoom out again so you can kind of see that it all at once so now we know what a pair plot is and how powerful that can be another thing you might want to do when comparing features against each other is to look at the correlation so luckily in pandas it's very easy so we can run just on the subset of features that we know are numeric and then we can let's go ahead and drop any null values and we can run on on this just a core function what does this show us this shows us the correlation between different features so between speed and height this value it has a correlation of 0.73 for some have a negative correlation and things like year introduced might not necessarily have a good correlation with with anything and another way i like to look at this is by using seaborne's heat map so if you we run sns.heatmap we can pass into it this correlation data frames so let's call this df core and we can still print this out so we can see what it looks like we can feed it in the data frame correlation data frame and now we have a heat map showing how correlated the values are to each other so just another way of seeing it and i also like to within this add in the annotations so we could see the raw values of what the correlation is remember every value is going to be an a perfect correlation with itself that's why we see ones across the diagonal but otherwise this kind of lets us see interesting correlations and relationships in the data okay we're here in our final step of the exploratory data analysis process and that's can be one of the toughest parts and that's asking a question that we want to answer with our data set now that we have a good feeling of our data set i think we can go ahead and ask a question and i'm interested in a column we haven't really looked at much yet and that's the location so we know that for each location there could be multiple roller coasters but what i'm curious about and i'm going to write the question out here is what are the locations with the fastest roller coasters so if we wanted to go to a uh park and have all fast roller coasters i have the fastest roller coasters which would those be and let's say with a minimum of 10 coasters at that location now what so we're gonna do some things that we've learned before so one of the first things i'm going to do is go here to location and let's go ahead and just do a value counts on this and we notice here that there's an other location other location is not truly a location so we're just going to ignore that for this analysis and we're going to query where location does not equal other so this will basically just filter out those other locations and then we're going to group by these locations all right so now we have each location is grouped and we can do our look at this speed in miles per hour count column and what we want to do with this speed miles per hour column is find out a few things about this per location like what's the average speed and what are the number of coasters that we have at that location we can do this in one step together pretty easily using the ag function this will aggregate and we aggregate by that location and we can get the mean value and the count value so what do we have here now we have all these different locations the average speed and the count and we're going to run another filter on this so we're going to filter this we're going to query where the count of this is greater than or equal to 10. so minimum of 10 roller coasters what's which locations have the the fastest values and just to make this a little cleaner we're going to sort the values by the mean speed remember this is the mean speed and this is great now we have for each location with more than 10 coasters what's the average speed of the roller coasters and let's go ahead here and make this a plot it'd be a lot better as a plot so plot as kind as a horizontal bar plot and we're only going to plot here this mean value and we'll give this a title is the average coaster speed by location we can save this axis as we've done before save this figure so we can put the x label x label as average coaster speed and there we go now we have a plot that shows that busch gardens has the highest average coaster speed followed by cedar point and so on and we've answered our question here of by location what are the parks with the highest average speed with a minimum of 10 coasters all this in these lines of code now when you ask a question like this it's going to take you a while to know how which steps to go through to to get this result but by asking the question you're going to be forced to search for how to use pandas to give you this sort of solution thanks for sticking around this long i hope you enjoyed the tutorial by now you should have some basic understanding of how to use pandas and python to do simple data exploration if you enjoyed it please give me a like and subscribe also follow me on twitch where i do stream live coding from time to time see you around

Info

Channel: Rob Mulla

Views: 113,318

Rating: undefined out of 5

Keywords: Exploratory Data Analysis with Pandas Python, python, tutorial, exploritory data analysis, data analytics, data science, machine learning, ai, numpy, data, Matplotlib, kaggle, exploratory data analysis, exploratory data analysis in python, exploratory data analysis python, pandas for beginners, exploratory data anaylsis in python, exploratory data analysis project, data analysis for beginners, analytics in python, data anaylsis python, jupyter notebook, kaggle for beginners, rob mulla

Id: xi0vhXFPegw

Channel Id: undefined

Length: 40min 21sec (2421 seconds)

Published: Fri Dec 31 2021