Methods for Cleaning Data in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's up everybody welcome to this tutorial on how to clean data these are a couple of methods that i personally use in my data science job to clean data so what we're going to need is we are going to need a pair of gloves and some soap and we're going to scrub all the data clean it's going to be great it's going to be squeaky clean and it's going to look awesome and yeah anyways so there's a couple of methods that we can use to actually clean data and a lot of the dirty data i mean if we're living in a perfect world you know data is just going to be clean it's going to be structured but what you're going to find um one on the job market and two in the real world is that most of the data you're going to be encountering is very unstructured and messy so being able to learn how to clean data is probably the most overlooked skill by people when they're learning how to code and learning data science machine learning whatever so clean data is going to one help you be much more organized and gonna give you better structured data sets it's gonna be a lot easier for you when you're creating databases and models and the famous saying is well bad data in you're going to get bad results out it's kind of like if you eat bad food if it goes in your body it's going to make your body feel sick the same thing with models and machine learning and algorithms if you throw bad data in there well you're going to get really poopy results that was a dumb joke i'm sorry anyways what we're going to be doing is we're going to head over to kaggle they have or i don't even know isaiah is a kaggle anyways head over to kaggle and they have this data science job market in the us it's a pretty simple data or data set um they're six hundred thousand seven hundred and nine rows i believe somewhere around there but what we're going to do is we are going to look at this data and kind of see okay what's wrong with this data how can we clean it and what can we do to better understand this data and get it ready for any sort of database entry or any sort of algorithm that we're going to be creating so let's head over to our jupyter notebook and oops a little too far zoomed in and the first thing we're going to do is we are just going to import our data so we'll say import pandas is pd and then we'll just call our data frame df i'll say df equals pd dot read underscore csv and it is in i think so it's in this uh path right here it's cleaning all data csv and it'll just import then it'll import the data so my first step when and the thing that you should probably do in cleaning data is kind of do a little bit of exploratory data analysis and this is a little bit different than just regular exploratory data analysis this is we're trying to locate the bad data and kind of what we want to do so the steps for cleaning data start with this exploratory data analysis finding the bad data and then i'm going to show you how you can one control for missing values and kind of get rid of missing values and kind of how you deal with that and then as well dealing with outliers this is really helpful for numerical data so number three we'll look at repetitive data and kind of duplicate data so how we can control for when we're actually getting duplicate data in our data sets and then number four we'll look at inconsistent data so inconsistent data kind of has to do with for example a name sometimes names are capitalized sometimes they include their middle name sometimes they don't include their last name or something sometimes they use initials so what we're going to do is i'm going to go through and walk you through this data set and kind of show you how you can once start with exploratory data analysis and then have a clean data set at the end so let's jump into it so the first thing we're going to do is we're going to import seaborn this is seaborn as sns so this is a visualization package in python that we can use to create just a simple bar chart if we want to do that so we'll start actually by saying df.describe so let df.describe does is it gives us kind of some base stats on all of our numerical columns that we have so we have reviews is i guess is going to be our only numerical column as you can see there's 5 326 rows so that's what the count is the mean so the mean reviews is 3 179 standard deviation min max and it gives you kind of the quartile ranges 25th percentile 50th percentile 75th percentile um and then as well a thing that i like to do just say df.head so we can get a quick look at just kind of the first five rows so right now let's zoom out a little bit right now we can see that the five rows we have are position company description reviews and location so there's already a couple things that i can see just by doing this number one the position sometimes it looks like sometimes they just have the actual position and then on this one it looks like they have some sort of description within the position um company i mean we can't really see too much right here i going to guarantee that there's a lot there's inconsistencies in the company description um looks like sometimes there's these random you know these random words like department and description reviews we can already see that there are missing values and reviews then location it looks like um some have the zip code while some do not so we'll type in df.tail just to see if we can see anything on the last five rows so yeah looks like it's kind of just the same issues positions kind of seemed to be a little bit different zip code reviews nothing really looking bad there but so that's kind of a quick overview of how we do that and then as well i like to look at the data types of the columns so if we say df.d types this shows you all the columns and then as well what their data type is for that column so object usually just refers to a string reviews is a float 64. so that's kind of our exploratory data analysis section right there we just want to kind of get familiar with the data this is a big kind of important step and getting familiar with your data getting familiar with your raw data before it's been processed makes it a lot easier to understand okay here's what it is here's what i actually want it to end up looking like so so first let's talk about missing data and what we do with missing data so the first thing we can do is we can just drop the row from the data frame from the data set number two is we can drop an entire column if we want to if you remove an entire column that's a little bit risky you need to be 100 sure that that column is actually not of any sort of value and then the third thing we can do is we can actually replace the missing data with just an arbitrary number or string or whatever so the first thing we'll do is we'll actually find out where our missing data is this is kind of part of the exploratory data analysis so we'll use cborn for this so we'll say calls equals df.columns so this is just creating a list of all of our columns and it will say sns.heatmap df calls and then we'll say dot is null so the dot is null what the parentheses is showing us is going to show us everywhere where the there are actually null values so if we click it and as we can see we only have null values in reviews there's a couple of steps that we can take to actually like i mentioned we can drop the data or as well we can drop the reviews column if we wanted to we're not going to do that because reviews is probably important to whatever goal we're trying to solve um and then as well we can fill reviews with just the number or something so what we'll do is we'll just create a separate um data frame and we'll say dropped underscore rows equals df dot drop n a so what this is doing is this is just dropping every single row that has a null value so we can actually so we can actually check to make sure that these rows were dropped so if we say print dropped underscore rows and then we say reviews dot is null dot sum like that we now can see that we have zero rows that are null so we can do as well is we can actually tell the same drop n a we can tell it to only look in a certain column so what this is going to do is with the original dot drop n a it's going to look in every single column and wherever there's a null value it's just going to drop it on this data set it doesn't matter as much but say we had a bunch in position and company and description it would end up dropping all of those rows you could end up with a very small data set if you don't tell it which column to look in so if we say dropped rows and we say df.drop in a we can say subset and here we say subset equals and then a list and then we'll just say reviews and then if you wanted to you could add location as well or something or you could add all the just whatever column names you want and then you run that and as well it's going to give you the same result but this is how you would tell it okay we only want to look in the reviews and location columns alright so the next thing we can do is we can actually just drop the entire reviews column if we wanted to so we'll say df underscore dropped equals df.drop and then you pass in a list and you just say reviews and then you need to tell it which axis to look on so like if you want to look in the rows or the column names so you just say axis equals one that tells it to look in the column names so now if we look at df underscore dropped dot columns as you can see we only have the four columns reviews is no longer there and then another really common practice and kind of what we should do in this situation is we can fill the null values with just an arbitrary number or string or letter or something so what we're actually going to do is because since they are reviews and they are if they're nan that means they have zero so what we're gonna do is we're just gonna fill all null values with the number zero so we can do this just by saying df dot reviews so you say df and then dot reviews is the column name equals df dot reviews dot fill n a with parentheses and on the inside of parentheses you just put what you want to fill it with so we want to fill it with zero we run that and then if you just run df.reviews.head as you can see those nas that we had before are now zeros so these nas right here were turned in to these zeros so the next thing that we actually want to look at is looking at taking care of outliers so outliers usually only pertain to numerical values and numerical columns so what we want to do is we actually want to use seaborn to plot a box plot for that reviews column so we'll just say sns dot box plot and we'll say x equals df.reviews so if we run that we can see wow okay we have this outlier clear out here which is probably going to affect our data a lot and if we have this it could skew results it could do a lot to the data so what we're going to do is we're just going to pick a number there's a couple of different things that you can do here is you can just pick a number you can say okay we're going to move everything about forty thousand sixty thousand we'll use sixty 000 you could also use z scores so if you're familiar with what a z score is and kind of eliminating data based on outliers and standard deviations you can do that too what we're gonna do is we're actually just going to remove any reviews above sixty thousand so we'll say df equals df and then we'll say df inside these brackets again brackets oops some more brackets with the column name so we want to say reviews is less than 60 000. so it looks like zero isn't going to be an outlier there's a lot there anything above 60 000 i mean i'm really playing it safe here when removing outliers because you don't want to remove too much at the same time so this is just kind of a quick and easy way to kind of like if you're if you just want to look at a certain number of reviews you can do that so now we'll just look at the statistics of the reviews column so if we say df.describe and as you can see the count went up because we filled the nas so when we counted before it doesn't count nas so since we filled it with zero we now have six thousand nine hundred fifty nine rows and the max is only thirty eight thousand eight hundred forty four where before it was one hundred and forty eight thousand so we've definitely eliminated the max and kind of helped make this and kind of standardize normalize this data set so the next thing to be weary about and kind of know is actually repetitive data so this is duplicate data that's in the data set and a lot of times in the real world like i mentioned you're not going to have clean data sets you might have duplicate entries which can affect results it can make one result more popular than the other so a lot of times what you're going to do is you're actually just going to drop the duplicates so this is going to say if any of the rows are exactly the same then we'll just drop that duplicate so this is used it won't drop anything where for example the um like for example if a review had if there were two companies with five reviews it wouldn't drop those two companies because most more likely than not it will they'll have like different company titles or different job descriptions but if they were the exact same across both the rows then yes it would drop them so the easiest way to do this is just say df equals df dot drop underscore duplicates with parentheses and then we can say df.describe again to see if anything changed and as you can see here we actually lost 10 rows so we had 10 duplicates in the data set the 50th percentile went up a little bit max stayed the same min stay the same standard deviation is up a little bit so that's how you would drop repetitive data any sort of duplicate data so the fourth step and kind of the final step that i usually do in cleaning data and kind of the fourth process here is taking care of inconsistent data so as you can see in our data set we have a lot of inconsistent data in the location for example the zip codes are there when we really don't need them for this especially where we can't get zip codes for these ones unless we went and manually entered them so this happens a lot there could be capitalization errors there could be any sort of bad error that was either entered manually or the way we capture the data the syntax of the code from one script to the other is a little bit different so it tagged things differently so what we're going to be doing is we're going to be using usually this is going to happen with strings as well so we're going to go through and i'm going to show you how you can take care of this inconsistent data on the location column and what we're going to be doing is just eliminating this zip code and kind of making it more uniform so i usually take four steps when cleaning data like this first i'll lower case all the data second if we need to split the strings or the data we'll do that we'll split them up and only get the parts that are relevant third we'll replace the data with um we'll get rid of basically any sort of colons or apostrophes commas that are unnecessary and then as well we'll strip all the white space off of the data so that it's all has the same spacing and everything so it's a lot easier to aggregate and it's the same throughout all the records so we'll do is we'll say df.location equals df dot location dot str we use str to tell that it's a string operation on the column and it will say dot lower with parentheses and it gives us this warning we don't really need to worry about it but if we click this again and we run the location.head you can see we can see that everything is lowercase now so the next thing that i would normally do and this is kind of a personal preference is to split on the string so i would probably go through this and i'd say k the easiest way to do this would write some code that would split on the last space and just get rid of any of the ones that had numbers in them unfortunately that's a little bit more complicated for this right here so what we're going to do actually is just replace any numbers in the strings with just empty with just nothing essentially so the easiest way to do this is we just say df.location equals df dot location dot str we're using the string operation again then we say dot replace and to do this we're actually going to use something called regex it's regular expressions this is kind of used in natural language processing it's built into python so it's pretty nifty but we just enter in parentheses and then we say slash d plus so the slash d means all digits so zero through nine and then plus is more than one so it's going to go look for all patterns of numbers and it's going to use that as what it's going to replace and then we put a comma and then we need to tell it what we want it to replace it with so if we just pass in an empty string it will then replace all numbers with empty strings a little confusing um regex is definitely something that's really cool and the string dot replace is something that i use at work i use it literally every single day when replacing and kind of cleaning up unnecessary parts of data so run that and then here we go if we do this dot location again boom no more numbers last thing is we're just going to strip the data so we say df dot location equals df.location dot string dot strip this gets rid of all of the white space we run that and now we have uniform data and it's looking a lot cleaner than before so this is how i would go about cleaning data this is something that is definitely necessary to learn as any sort of person who wants to become a data scientist or a data analyst or data engineer it's definitely something that i overlooked when i started coding and i really wish i had picked up on this kind of stuff before i started my job before i started really diving deep into coding it's a great thing to learn as a beginner and somebody new to coding because these techniques and these skills will help you in your career and all throughout your coding path so some more stuff that we could do i won't do into it just to not make this a super long video is cleaning
Info
Channel: McKay Johns
Views: 639
Rating: 4.9069767 out of 5
Keywords: coding, python, data science, programming, code, data analytics, sports analytics, tutorials, data science beginners, data science clean, clean data, how to clean data in python, python clean data, pandas
Id: _6a1AZ8R7cI
Channel Id: undefined
Length: 21min 42sec (1302 seconds)
Published: Thu Jul 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.