Pandas for Data Science in 20 Minutes | Python Crash Course

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today we're going to be taking a look at pandas no not those kinds of pandas pandas for data science so pandas is a ridiculously powerful library that's used all around the world for data science it helps you work process and filter tabular data so you can get your machine learning models out there a whole heap faster and it's probably one of the most useful libraries to have under your belt as a data scientist let's take a look in detail as to what we're going to be going through so in this video we're going to be covering crud for pandas now you're probably thinking what the hell is crowd well it stands for create read update and delete we're basically going to be covering everything you need to get up and started really really quickly as a data scientist using pandas we're also going to be treating this like a bit of a crash course so we'll breed in some tabular data and we'll go through all the steps that you need in order to work with candice so in terms of how we're going to be doing this we're going to be working with python and specifically we're going to be using the pandas library and we're going to be coding inside of a jupyter notebook because that's what data scientists all around the world are using so you ready to get into it let's do it all right so let's dive straight into it so in this video we're going to go through everything you need to get up and started super fast to work with pandas for data science so we're going to break up the video into four key sections so creating data frames reading data frames and specifically we're going to be taking a look at how to work with all the data within them we're then going to be taking a look at how to update them so dropping rows and working with columns and last but not least we're going to take a look at how we can output all of our resulting data so the first thing that we're going to want to do is import pandas now we can do that pretty easily using the standard python import functionality all right so we've imported pandas using the import function so import pandas as pd now if we open up our pandas library and hit tab you can see that we've got a whole bunch of stuff to work with there we're not going to go through all of this we're going to really focus on the 20 that's most important as a data scientist and probably the most important function is read csv so this allows us to create a data frame from a csv and work with it in pandas so in this case we're going to be using read csv to open up our telco churn csv here so if we open this up you can see that it's just a regular old csv with a whole bunch of data and it looks like we've got about 3334 rows 3333 when you exclude our columns let's go ahead and read this in and create a data frame alrighty so we've now just created a data frame so in order to do that we use pandas and we used the read csv function and then we pass through the name of our csv now you can't actually see the data frame yet because we haven't actually gone and visualized it now we're going to skip on over to our read section just for a second to take a look at what this data frame actually looks like in order to visualize our data frame the easiest function that you're probably going to want to use is the head function so this allows you to view the first five rows of data within your data frame and you can see that we've got our data frame that we just read in now as i said the head function allows us to view the first five rows of data but if we wanted to view more rows from the top we can just pass through the number that we want to see so by passing through 10 to our head function we're now going to see our first 10 rows of data also just to note when working with pandas our first row is going to be represented by the index 0. alrighty so we've now read in data from a csv and we visualized the first top 10 and top five rows now what we want to do is take a look at how we might create a data frame from a dictionary so dictionary is just a different python data type but it allows us to work with a bunch of different methods also when you're working with pandas you've actually got a whole bunch of read functions here as well so if you wanted to read in from your clipboard or excel or read in from html sql a whole bunch of other stuff you can do that as well here we're going to be creating a data frame from a dictionary so what we can do is create a temporary dictionary first and then what we can do is create a data frame from this dictionary so in order to do that we just need to use pandas access the data frame class and then call from dict you can see we've now created a data frame from a dictionary so again we can use the head function to visualize it so this time we've created a data frame called df so what we can do is visualize that and there you go that's our data frame that we've created from a dictionary so let's just quickly recap so in our create section we took a look at how to create a data frame from a csv we then also created one from a dictionary but as i said there's a whole bunch of other read functions in pandas that allow you to read in data from different data sources now within our read section what we're going to be taking a look at is how to view our top five and bottom five rows so we've already taken a look at how to read in our top five using the head method we're also going to take a look at how we can view our columns create summary statistics filtering and then working with indexes so using the eye lock and lock functions so let's go ahead and finish up our section 2.1 so we want to view our bottom five rows now that's pretty easy all we need to do is use the tail method to do that alrighty and you can see that we've now used the tail method and this allows us to view our bottom five rows again we can pass through a different number of values if we wanted to view a different number of rows so again say we wanted to view our bottom 15 again this allows us to view our bottom 15 rows alrighty now on to section 2.2 so viewing columns and different data types with pandas you've got a bunch of different columns so you can see that we've got state account length so on and so forth and we've got enough that we can view on the screen but when you're working with really big data frames you might want to be able to see all the columns that you've got available to do that all you need to do is use the columns attribute and you can now see all the columns that we've got within our data frame so state voicemail total day charge blah blah blah so on we can also take a look at the different data types that we've got available within our data frames using the d types attribute so within our data frame we've got a number of different data types so we've got objects integers floats we've also got a churn value as an object sometimes you'll also see booleans and a bunch of other different types of data types but we've pretty much covered the majority of them here so this just tells us all the different types of data that we've got within our data frame now whenever you're working on a data science project probably the most important step of the chris damn life cycle is understanding your data a great way to get a handle on your data is to calculate summary statistics so we can do this using pandas and specifically using the describe method so using the describe method we're able to calculate our count our mean standard deviation min max as well as our different quartiles so this gives us a good overview of what our data actually looks like but if you pay close attention when we've called the describe method it's only been called on the data types that are either integers or floats what happens if we wanted to calculate the summary statistics on objects for example well we can do this simply by passing through a different parameter to our described method so specifically if we call df.describe what we need to do now is pass through include equals object so this allows us to calculate summary statistics on our non-float and integer values so for state international plan voicemail plan and churn we're now able to get our counts the number of unique values as well as our top values and the frequency so that pretty much covers summary statistics and now let's take a look at how we can filter on columns so we took a look before at all the different columns that we've got how do we actually filter on the specific columns that we want to visualize well first up let's actually take a look at how we can grab a single column so we've got our state column here how do we go about grabbing that stake column well that's pretty easy all we need to do is access that state value so you can use it like a bit of a key now this can get tricky when you've got keys that have a space in them because if you take a look at international plan here for example we can't actually type in df dot international plan it's going to throw an error in order to get columns that have a space in them we need to do this a little bit differently so all we do is pass through a square bracket and treat it like a key that way so again it's doing the same thing it's just a different way to grab a column so that allows us to grab a single column now what happens if we wanted to grab multiple columns well right now it doesn't look like we can grab two columns using this method or this method so in order to grab two columns what we can do is a similar method to what we use for our international plan except this time we're going to pass through an array of the columns that we want and you can see by passing through an array within our data frame sort of index we're now able to grab both our state and our international plan i should also note that i'm going to share this entire jupyter notebook with all the code completed inside a github repository it'll be in the descriptions below i'd also love to know what you're using pandas for definitely leave a comment in the comments below alrighty so back to our column so we've now gone and shown our two different columns now again whenever we're performing at exploratory data analysis step using pandas what we might like to do is find our unique values within each one of these columns so what we can do is call the unique method to grab our unique values and just like that by calling unique on our state column we're now able to get all the unique values within our data frame likewise say we wanted to call it on our churn column for example which we've got here we can just change that state value to churn and we're now able to see all the unique values within our churn column so that covers filtering on columns so we've taken a look at how to grab a single column how to grab a single column where there's a space in between it as well as how to grab multiple columns and how to grab unique values now what happens if we wanted to filter on rows well let's take a look at the rows that we've got so say we wanted to grab rows that had international plans set to no well what we can do is use condition based filtering to grab those values so by passing through our data frame then selecting the column that we want to filter on as well as the condition we're able to grab only the values that we want so this is particularly useful when you want to filter your data frame or grab a subset of values now you can also filter on two different columns or have multiple conditions you just need to tweak this code a little bit so let's copy this and say we wanted to take a look at values within our data set where customers churned so customers left our business and they weren't on the international plan so international plan is going to stay as no all we need to do to do that is pass through our different conditions inside of parentheses so let's put in our first condition inside of a set of parentheses then pass through an ampersand to represent n so we're basically concatenating all of these conditions together and then we want to pass through our second condition which is customers that didn't leave the business so now what we're going to do is specify a different column so in this case we're filtering on churn and we want customers that didn't leave so false and now you can see that we're actually filtering on both our international plan column as well as our churn column if we wanted to see customers that weren't on the international plan and did churn all we need to do is change our second condition to true and now you can see that we've been able to filter on rows so what we've done is we've taken a look at our top five columns we've filtered on rows where international plan equals no and then we've also done a dual condition filter in order to pass through additional conditions to that filter again you can stack more of these together if you wanted to build bigger filters now what we're going to take a look at is how we can index with ilock so indexing allows you to filter through your data frame but instead of using specific values of specific conditions we're going to be using integers so say we wanted to grab our 15th row within our data frame well we can do this pretty easily using the i look method and there you go we've now grabbed our 15th row so what we've done there is we've taken our data frame and then we've used the ilock method and then passed through 14. now you're probably thinking why 14. well remember at the start that i said that your data frame begins at row 0. so if we wanted our 15th row we just need to subtract 1 and this is going to give us a data frame now the iloc function is really really powerful so it allows you to hone down and get the specific data that you want so say for example you only wanted the state value as well well we can extend our iloc method and just pass through our first column so when we pass through two different parameters the first is going to be your row the second is going to be your column so by passing through column zero we're going to get our first column and again you can see we're now getting our state likewise we can also change this and pass through negative one and it's going to go the other way so now we're getting our last column which is churn now we can also use the iloc function to do slicing so say we wanted a subset of our data frame so rows 22 to 33 for example well what we can do is take a slice of our data frame using the ilop function so we've used the ilok method the same way except this time what we've done is we've passed through a segment of our data frame that we want so we want to start at row 22 and grab everything to row 33 excluding it so you can see that we've now gone and grabbed that slice so that really covers how to work with eye lock so it's really an integer based filtering function now we can also work with the lock method the lock method works a little bit differently to the ilock method in that it uses keywords instead of integers so in order to do that first you need to have an index set on your data frame so right now by default when we read in our data frame our index is just going to be created as a set of integers that's why we've got these numbers here what we can do is actually set an index on a specific column that we want so say we want to set it on state we can do that alrighty so you can see that we've now gone and set an index on state so in order to do that we took a copy of our data frame we then went and set our index and passed through the column that we want our index on and we also pass through the in place parameter so this is going to allow it to do it in that data frame without having to create a copy of it then we've gone and visualized those first top rows and you can see that state is indeed our index now what we can do is use the lock method to filter through our data frame and grab the specific rows that we want so say for example we only wanted rows which were in ohio well what we can do is use the lock method to do that so there you go so we've just used the lock method to go and filter through our data frame and just grab rows that are situated in ohio and that about wraps up our read section so we've now gone through how to show our top five and bottom five rows show our columns and data types create summary statistics filter filter on rows and we've also done some indexing with ilock and lock now it's time to get to updating so the first thing that we're going to take a look at is how to drop rows so whenever you're pre-processing your data frame one of the steps that you're probably going to want to do is drop values that have nulls now what we can do is actually calculate the rows that have nulls using the is null method so by training together is null and sum we're able to see the number of rows that have missing values in this case it's represented by the different columns so it looks like we've got 10 rows missing total day minutes 10 missing total day calls eight missing churn whole bunch of others there now with pandas we represent these as nas so what we can actually do in order to drop these rows is use the drop n a function so by using drop n a and then passing through in place equals true we're able to drop all the rows that have missing values so if we go and visualize this again you can see that we now have zero missing values within our data frame so what we did there is we just visualized the number of rows that have missing values and then we've gone and dropped all of those values now what we can also do is we can drop columns so say we didn't actually want a column for example say we wanted to drop area code for example what we can also do is drop a column using the drop method so let's do that so simply by using the df.drop method we're able to drop a column so in this case what we've done is we've dropped the area code column and we've also passed through access equals one because in this case we're dropping a column we're not dropping a row what about creating columns so say for example we wanted to create a column which was the sum of a total night minutes and total international minutes so what we can do is really quickly add a new column and add those together and there you go that's how you create a new calculated column so what we basically did is we grabbed our total night minutes plus our total international minutes and then we've added those together and we've put it in a new column called new column so if we go and take a look at these now you can see that our new column is there on the far right hand side so what we can also do is say we weren't happy with the values in this well what we can do is update that entire column so say we wanted everything in here to read a hundred well we can grab that column and set it equal to 100. now if we go and visualize it everything within that column is reading up as a hundred so that's gone and updated everything in that column what happens if we just wanted to update a single value for example our first value here well this is where we can use ilock to go and replace those values so let's go and grab our first column and then set it to 10. now if we go and visualize that column you can see that it's now reading as 10. so what we did is we first grabbed the value using ilock so we grabbed as first row which is 0 and then we grabbed our last column which we can access by passing through negative 1 and then we set that value to 10. so you can see that that's now reading as 10. now the last update methodology that i want to walk you through is how to use the apply function so the apply function is super powerful because you can basically iterate through every value within a specific column or within your data frame and make updates appropriately using conditions so say we wanted to take our churn column and rather than it being represented as falses and trues we wanted it represented as ones and zeros in the new column called churn binary well because there's different conditions there what we need to do is have a little bit of filtering in order to determine how to update that value this is where the apply method comes in and there you go so we've gone and used the apply method on our churn column and then we've used lambda to basically iterate through each one of these values and we've also used a expression here so we basically said it's very pythonic so we've basically said if x equals true set it to 1 else set it to 0. so if we go and take a look at our columns now you can see that we now have a churn binary so let's double check that we've done that for our true values as well and you can see that if churn equals true our churn binary equals 1. so apply is really really powerful because you can even pass through functions in here and make updates that way just a key point to no doubt x is going to represent the value that you're iterating through at a point in time that's why we've said x equals true all right and that about wraps up our updating section so what we did is we dropped some rows and we specifically dropped our nas we've also dropped some columns we've created new calculated columns so specifically we added our night minutes plus our international minutes to create our new column here we took a look at how to update an entire column how to update a single value and last but not least we took a look at how we can use the apply method to provide conditional based updates the last thing that we're going to do is take a look at our delete and output section so say we wanted to output our data frame to a csv output it to json or output it to html this is where we're going to do it so the first thing that we're going to take a look at is how to output our data frame to a csv this is pretty easy all we need to do is type in df.2 csv and then name it so what we'll do is we'll just call it output for now dot csv this is going to output in the same folder that our jupyter notebook is in so you can see if we scroll over here we've now got an output.csv dataset so you can see that we've got our columns that we made updates to those are all represented within that dataframe now likewise we can also output to json for example if you're working with different web components working with json is obviously super useful that's pretty easy all we need to do is type into json and it's going to create all of the json for our data frame likewise doing it for html sort of similar method and and now we've got a html representation of our data frame so again we've just passed through to html to get the html to json to get the json and if we wanted to delete our data frame all we need to do is type in del df so this is a python method that allows us to get rid of a specific object and that about wraps up our last section which is outputting and deletion so we outputted csv to json to html and last but not least we deleted our data frame and that about wraps up this python crash course so what we did is we went through how to create data frames how to make updates to them how to read them and change data and last but not least we also took a look at how we can output our data and delete it thanks so much for tuning in guys hopefully you found this video useful if you did be sure to hit subscribe tick that bell so you get notified of any future videos that i released and if you've got any questions at all i mean anything be sure to drop a mention in the comments below and i'll get right back to you thanks again for tuning in peace
Info
Channel: Nicholas Renotte
Views: 13,262
Rating: 4.9809222 out of 5
Keywords: pandas, python
Id: tRKeLrwfUgU
Channel Id: undefined
Length: 23min 6sec (1386 seconds)
Published: Fri Aug 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.