A Gentle Introduction to Pandas Data Analysis (on Kaggle)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey youtube my name is rob i'm a data scientist who specializes in machine learning data visualization and all things python i also spend a lot of my free time on the website kaggle where i'm a three-time grand master from time to time lately i've been streaming live coding on the website twitch so if you enjoyed this tutorial please be sure to give me a follow there in this video i'll be going over using the python package pandas pandas is by far the most popular data wrangling package out there for python pandos has had a lot of updates in recent years so i'm going to show you the latest tips and tricks of how to use it everything i'm going to be showing you in this video i'll be doing in a notebook hosted on kaggle a free website for data scientists so you can click the link in the video description create an account in a few minutes you can follow along with me and actually step through the code line by line i hope you're as excited as i am with that let's get started okay so here we are in an interactive session in a kaggle notebook um like i said you can create a kaggle account very quickly it's free for anyone to use the nice part about using a kaggle notebook is you don't have to worry about getting set up with your python environment locally and there are also a lot of data sets that are already publicly available out there that you can just quickly add on and start working with manipulating it's a great way to learn the tool without getting bogged down with all the setup so i've created this notebook here made just an introduction uh title but the first thing we're always going to do when we're working with pandas of course is importing pandas so in python the way you import is with the import statement and we're going to import pandas as pd this is always the way that it's done importing pandas as pd and along with this it's pretty common to import numpy as np numpy is what pandas is built on top of and it's pretty common to import them both together so we're going to execute this code cell by hitting shift enter and our notebook is going to load up and now we have pandas loaded so before we get too far into using pandas we need a fundamental understanding of the two main data structures that are used and those are a panda series and a pandas data frame so what we're going to do first is we're going to create a panda series we're going to do that by first creating some fake data that we can feed into the panda series so let's go ahead and create those right now to feed in our series some data we're going to just create some fake data called my data and we'll call this boat we'll make a list called with uh values called boat car bike truck so this is just a python list and we'll create a series panda series from this list by doing pd dot series and feeding in our data and we'll call this my series one all right now if we print this series what we'll see is a few things the first is that we have all of our values boat car bike truck and then we have on the left side what's that called the index so every series has a index that relates to each value in your your um panda series you also see it prints out here at the bottom the d type and that's the type of objects in this series when we feed it in a string like words like boat car bike truck this is going to be an object type let's create another example that's not that type so let's do my data equals 1 55 99 and 43. so now our data is in a different format where we have integers and we'll create my series 2 pd series to create the series with my data and then we'll print my series 2. so the only difference we see here is our data still appears 1 55 99 43 we have our index but the d type is now an integer it's a n64 because we fed in an int we're going to show some examples later with different types here but this is just to show you that a series essentially is this list of data index and then it has a type all right so moving on we're going to talk about the next data structure which is a data frame data frame is where you're going to spend most your time working with pandas because what it is is essentially a grouping of multiple series all linked by the same index so let's give an example of that so what we're going to do is make a data called my data frame data and this will be let's use the same examples from up here so we'll do a boat which is a boat one a car 55 a bike 99 and our truck will be 43. now when we create a data frame very similar to creating a series but we're gonna do pd dot data frame and feed it in my data frame data let's call this my data frame or df for short now if we print or if we look at my data frame we're going to notice a few things we're going to notice that there's still an uh index on the left side but now we have two different columns for each of the values that we're feeding in from our initial data and actually these are each two different series within the data frame they're linked by the index here on the left side so data frames you could think about these as spreadsheets where you have multiple different columns and they're all linked by having similar rows the rows here on are the index and the columns here are on the top the columns now the data frame that we have here has uh dummy column names 0 and 1 because that we didn't feed it in any column name so let's go ahead and change this a little bit by actually giving the data frame the column names when it creates the data frame so we'll do uh columns here when we create our data frame now and we'll make this equal to um thing and count so now we're adding the column headers here thing and count when we create the data frame with this data and you'll see thing and count are there as the um column names there's a lot we can do with this it's very powerful but this is a very basic interest intro to the two main structures series and data frames the main things to remember are all of them have their own index and they have a d type or um or each series has a type that it is given based on what values are in it so just to show you that this data frame is actually consisting of two different series we're going to take my data frame and we're going to look at the thing column the way we do that is by using these brackets and calling the name of the column and now we see what we have here is actually a series so if i do type here just so we can inspect and see what type this is this is actually a series from within our data frame it has the type of object and then similarly if we do the type my data frame count we have this these count values with type in 64. we can also do some cool things with a data frame to to inspect what's in it so this is a very small example data frame but um what if we want to see all the d types of the different columns in the data frame by typing in mydataframe.d types we now see that we have two columns and the types of these are an object for thing and an integer for count very simple easy but um important to know these foundations before you dive into using a panda's data frame or series all right the next thing we're going to look at is reading in data so we're going to take a big jump here from the basics of what pandas data frames and series are and actually start working with some fun data because that's what you want to do here right reading in data the nice thing about pandas is it has a lot of different ways you can read in data where they've figured out all the complexities for you the main file type that are used re to read in data are csvs but you can also read in excel files and a handful of other formats when you want to read into pandas so the way we read in using pandas is we'll do the same pd um dot read and if i do underscore here i'm actually going to show you that there's a number of different ways we can read in we can read in from the clipboard we can read in from a csv an excel file it does get a little tricky when you're reading in excel files because you'll you might have multiple tabs and things not formatted correctly there are other different files here that i won't go into but the main one that we're going to be using today is a read csv you can in excel always save your file as a csv and read in that way later so read in csv and what are we going to load in as some data nice thing about kaggle is that we have a bunch of data sets that we can just read in at our disposal and i'm going to show over here on the right tab i've loaded in a data set that i've created that has all the mr beast youtube video statistics for all of his videos the view counts and all that sort of stuff for so we're going to go ahead and read in that data frame and do some examples with it so the way we're going to read it in is by referencing a location where the data file sits and i know it's here uh in our input directory mr bc youtube statistics and this is the name of the file it's a csv as you can see and we're going to call this csv data frame okay so give it a second to read in and there we go there we go we have a data frame read in now one of the common things we do is inspect the data we want to see what's in this data frame right after you loaded in what was in the csv one of the first things i like to do is run ahead on this what head does is by default we'll show you the first five columns in the data frame so if i go ahead and run this cell by doing shift enter i can see now the first five columns you can see the index here is on the left zero to four you can also see at the bottom it says how many columns and rows we have now this is only five rows because we ran ahead on it and it showing the first five rows for us because of that and then we can kind of scroll through and see what we have here okay so we have some dates when it's published we have uh how long it is in duration the view count the light count a lot of good stuff here that we can start manipulating the data and understanding so um head and also there's tail so if you wanted to look at the bottom of this data frame we can do df.tail and see the very bottom of the data frame these are the last last rows in our data frame and similar to head you can just see what are the last rows and what they look like another thing you often want to do when you're loading in a data frame for the first time is know what kind of d types each series is in this data frame so you can write df.d types and see each of these columns has different types of formats that they're in so the d type of the title column is an object or just a string but the duration is an integer so it's it's a number and it's very nice to just see the columns here and see what types they are and then finally what what i like to do is when you read in a data frame there's a built-in method called describe so df.describe and this will only work for numeric columns in your data frame and but what it does do is shows you some basic statistics about each of those columns so we can see in the duration seconds uh column that there is a unique count of 24 or 247 counts of different duration seconds what the average value is what the standard deviation is the minimum value all this good stuff that we would want to know about the the data frame okay so the next thing we're going to look at here in our example data frame is columns and the index or rows within the data frame so there are a few ways to reference these referencing data frame columns or rows and let's just make a header here columns and rows and take our example data frame and if we remember we can run ahead on this and see what some of our columns are like let's say we want to look at this view count column we can reference and pull out just the series for view count just the panda series within this review count by doing a bracket and then putting in the name view count for the column name now if we run this cell we can see now we have the series which is this column view count now let's say we want to look at a specific row within the data frame ideally our indexes would all be unique and we can look at a specific row by using the loc method so in our data frame we we can do dot loc and why don't we look at first to rob a bank wins that hundred thousand dollars which we see is index number four we can do loc four and now what do we see we see all of the rows or sorry all of the columns for this row shown to us in and we've selected just that row of data so this is how we look at a column and a row of data within a pandas data frame now one thing to keep in mind here is when we when we apply loc to index number four in this data frame we already knew that we wanted to look at index number four and what it was called but in in pandas you can actually have the index be anything it doesn't have to be just number zero to whatever row number you have and the way you can do that in a data frame is actually with the set index method so we know that in this data frame our id column is the unique identifier of the video within uh this data set of youtube video statistics so what we're going to do is we're going to actually use the set index method to set the index to be id and then um let's say we knew the id so we'll do that and then we'll redo our data frame rename our data frame and then we'll overwrite our data frame with this set version with the index as the id column so if we look now our data frame head our id is the index column and what we're going to do is actually the loc again but this time we're gonna search for what we know is the id of this this video and it'll show us the same information this is very useful for things like time series data where your index is the time stamp of that row and you can easily jump to that time stamp and see what data you have for that timestamp okay the next thing we're going to talk about is subsetting our data so let's say we have a data frame with too many columns or we want to filter it down into just a subset of rows in that data frame that's what we're going to look at here so this is a great example because our data frame that we were looking at has a lot of columns that we aren't necessarily interested in so one way we can look at all the columns that we have are by is by doing df.columns and it'll just list out all the columns that we have in this this data frame so let's say we want to only keep a subset of these the way that we can do that is by putting a bracket after our data frame and then actually providing a list of the columns that we want to keep so in here we want to keep um we want to keep the title the description publish time um not this kind stats column duration view count like count and comment count we're going to go ahead and overwrite our old data frame with just the data frame containing these columns and i'll show you here after i run this now we have a smaller data frame with only the columns that we're looking to keep similarly if we want to look at filter down just to one row or a certain number of rows of this data frame one thing that we can do is we could see the shape of the data frame by doing df shape so we can see that this now has seven columns and 247 rows um so this is subsetting columns and we'll put this sub setting rows we can do this by applying a filter to our data frame the common way to do this is to actually put in so the the the common way to do this is to do something like this let's say we have our view count column and we want to only look at rows where the view count is over a certain number so let's only look at rows where our view count is over one thousand now this doesn't actually or let's make it ten thousand let's make it a million this is mr beast's video so now we have have a new panda series with a boolean representation so true or false of if this is correct or incorrect for that row and we can actually use this in combination with our loc to subset the rows to only ones with over a million views so now they're only 203 rows as opposed to the 247 that this data frame has before we query so this is subsetting using loc um subset i'll call this df subset one and but there's actually a newer way that we can do this with uh subsetting using query and query is a new functionality that was added into pandas not too long ago and it makes things like this very easy because we can actually just use the dot query method on our data frame and write in exactly what we want to query so now we want to query view count greater than 1 million and we will see all the rows that are greater than 1 million uh where the view count is greater than 1 million and you can see this is a little bit cleaner to write than this whole statement up here um it doesn't work in every situation but i find myself using this subsetting a lot more it's a lot more common for doing quick queries on the data set another thing we like to do when we're looking at a data frame is understand what values might be missing in that data frame so if i uh look at this data frame that we've been working with i'm noticing that there are some columns here that actually are not video information and because of that they have null values for the view count the like count and the comment counts and i want to drop those columns i want to remove those from the data set so there's a nice way that we could do that um by actually looking for the val anytime the values in one of these columns is null and just an example we're going to take the view count column and we can run the is na method on this and what this will do is tell us true if the value in this column is a null value and those are the columns that we want to drop so to do the opposite of this if we did a df lock on this we would actually only keep the null values which is the opposite of what we want to do we can add this little tilde before the df view count is n a to do the opposite of that so this will give us if i run this in a different cell this will give us true for the columns that we want to keep and false for the one with null that we want to drop so let's uh run this we've removed that null value column and we will replace our data frame all right so we've been able to subset our data into just the information that we care about we removed all the columns and only kept the ones that we were interested in we also dropped some rows by dropping the rows that have null values and uh because we know those are not videos and next thing i want to talk about is casting a d type of a column so casting d type and basically all that means is you want to make a series or one of the columns into a different type than it already is if we remember we can run types on the data frame and it'll tell us what d type each of the columns are our title description and publish time are objects our duration section seconds is integer and then view like and comments are floats now we actually want these view columns view count to be an integer because we've removed the the call the rows that have null values in them and we can actually cast it as an in uh as a rule any numeric value that has null value in it has to be a float but now that we have no more null values we can cast as in end let's do at and the way we do that is by typing in after the column that we've selected as type and then we cast it as an int very easy now this is an int type column but we haven't replaced what's already in there in our data frame so we're going to actually select the view count column and overwrite what's already in that column with the as type int and let's go ahead and just do the same thing to the like count so now we can see that our data frame d types is we're not going to do it to the comment count because that actually still contains null values but we've done it to our view count and our like count columns there are now integers there's another column here that i'm interested in and you'll run into this a lot and that's the publish time now this is currently being represented as an object or a string for each of these but if we look closely closely at the published time we can see that publish time is a date time format and pandas comes with some pretty awesome built-in functionality where you can actually cast these type of columns easily into a date time column which is what we're interested in seeing so the way you do that is do you can call the pandas to date time method and now we see that the d type of this column is a date times 64 and in nanoseconds with the utc clock but the main thing to understand is this is now not a string representation but it's actually a date time representation of the published date of each video now another thing that i want to show you that you can do similar to casting these like counts as integers let's say we have uh the light count as type string so we have a a numeric value that we we know has an object d type and we want to make it into a numeric um d type we can instead of uh casting it as a specific d type we can let pandas do the hard work by doing using the the two numeric method so we type in pd.2 numeric to this string version and now it has made it automatically into an n64 the pd2 numeric way of doing this actually will handle the fact of if it should be an int or if it should be a float and it can be very handy because of that all right so we've casted our d types we have the data frame in the format that we like it and now what we want to do is see if we can create a new column creating new column is pretty easy if you're just manipulating an existing column so the way that we can do that is um for instance let's look at the head of this data frame we say that we have a view count a like count let's it might be interesting to see what the ratio for each video is of the like count to view count so the way that we would do that is we can actually apply normal math like dividing one column by the other as we would to to to numbers we can apply it to two columns so we'll take the like count column and we're going to divide it by the view count the result is a panda series with the ratio of likes to views for each video and then we can create a new column with this just by writing like to view making a new column named like to view ratio let's call it and now if we look at our data frame head what we have here is a new column called like to view ratio that is calculated off of these two other columns next we could talk about how if you wanted to add a new row to the data set we're just going to do this example um adding new row and the simplest way to do this is to actually concatenate two existing data frames concatenation just stacks the two data frames on top of each other and we're going to take an example here where we just take that the tail if you remember the tail method we're going to take the last row in our existing data frame and we're going to call this df2 append and this is going to be the data frame we want to append to our existing data frame and we're going to use the pandas cat function like this pd can cat and it takes a list of data frames that we want to add on to each other we want to combine so we'll take the existing data frame and df to append and we'll run this through pd concat and what do we have here if we look here at the bottom now we have a new row added to the end of this data frame with the same values here this is just an example to show you how you can append onto an existing data frame we're not going to save it let's just call it df concat as an example all right as i said this is a very basic introduction to pandas but i think it would be interesting if we just uh make some plots out of this data so what i'm going to do is maybe a little bit more advanced but it's going to show how you can take the information in your data frame and look at it visually only using pandas which is great so we're let's do a plot examples i'm just going to show you a handful of plots that i use on a day to day basis so main one is looking at a histogram of the data so we know view count is one of our columns which is the number of views per the video in our data set and we can just call the plot function on this series and inside plot we could say kind hist this will make a histogram of our of our data this is a histogram of all the different view counts that we have and how what the frequency they are in um in histogram format we can also in our histogram say the number of bins we want to see so let's say we want to see a hundred that might be too many so let's say uh 50 50 bins this shows us that okay for view counts there's a lot with zero but then there are a few way up here in the many of millions views we can also add a title to this by saying title view distribution of view count let's clean this up a little bit and let's say we don't like the the sizing of this we can add a fig size figure size to this to make it a little bit bigger to see another plot that that is often used so uh would be a scatter plot so you want to see the relationship between two variables in your data frame so we can do that pretty in easily by actually calling the plot method now instead of on the series on that the data frame itself we can say kind is scatter and then we'll provide it the x column which will be our view count and y is like count and let's look at this so now we have a scatter plot of uh the view versus light count you can see it's somewhat linear let's put the title here view versus like count and it's interesting here okay so this might cause us to ask the question why do we see these outliers here why is there one video that seems to have much more likes than the others the remember this data set is all of mr beast's youtube video stats so i'm going to go into my data frame and remember we learned how to query the data frame we're going to be method that we learned earlier we can actually query this data frame so df query where like count is greater than i don't know how many zeros i should add here what is that uh 10 million more than 10 million likes and we could see here that this like count of this one video is very high 190 no wait 19 million likes and what's the name of this video it's make this video the most liked video on youtube oh well that might explain why we have such a large value for light count on that one data point so kind of interesting here we're already being able to manipulate the data see what's going on um and make some real interesting observations on this data set all right the last thing we will want to do is save our output so once we're done with our analysis and we want to save our new data frame maybe we've manipulated columns added new columns run our analysis and we want to save it we could save it similar way to way we read it but using the um 2 csv method instead of read csv so we're actually applying this on the data frame to csv and this will be our processed data.css now one thing to know about when you save a csv file it'll save this index column here if many times you don't want to save the index so the way of not saving the index would just be index equals false but we're going to go ahead and add the index to what we save here and that's it now if we look in our output directory we would see a file there called dataprocess.csv that you could then use at a later point in time so i hope you enjoyed this tutorial this is a very basic introduction to using pandas specifically for data science i'll make this notebook public so that anyone who's interested can take a look fork it make their own version of it and start working with pandas if you made it this far in the video i just want to say thank you so much for watching i hope you learned a lot about pandas for python it's an extremely important tool so what you've learned here will be valuable in your data science journey i encourage you to subscribe to watch future videos like this video if it was helpful and also follow me on twitch where i do some live coding streams from time to time that can be a lot of fun hope to see you around bye
Info
Channel: Rob Mulla
Views: 82,600
Rating: undefined out of 5
Keywords: pandas python tutorial for data science, 2022, kaggle, jupyter notebook tutorial, data science for beginners, pandas tutorial, pandas tutorial 2022, pandas for beginners, kaggle for beginners, pandas code, how to use pandas library in python, data wrangling with python, data science tutorial for beginners, kaggle dataset tutorial, pandas getting startled, pandas for data science, getting started with pandas, pandas dataframe tutorial, pandas data, rob mulla
Id: _Eb0utIRdkw
Channel Id: undefined
Length: 38min 45sec (2325 seconds)
Published: Mon Dec 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.