How to Clean Data Like a Pro: Pandas for Data Scientists and Analysts

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
using data to create insights is incredibly beneficial to society from being able to forecast sales for a business to predicting the weather however the key challenge that data analysts data scientists data engineers and you'll be facing is cleaning the data that you will create these insights from hey folks I'm Trent I've been a professional data engineer for 5 years now and in this video I wanted to take you through how I clean data using pandas I'll take you through several techniques that I've used and at the end of video I want to discuss a key principle in data cleaning that helped me deliver massive amounts of value to my clients and save me hours of work all right let's get started I'll be focusing on common problems that I've seen over my career those being things like data duplication missing data poer data formatting and poer data standardization I've just jumped over to my editor over here and I'm going to take you through the data that we'll be using for this but just as a note the first thing that I've done over here is imported the packages I'll be using those being pandas M plot lib and numine I'm also just setting some options over here to make things a little easier to read in this juper notebook all right so let's run that and import all of the needed modules and to kick things off let's take a look at the data itself so I've got the data stored in a folder over here called data and in a file called Data to be clean. CSV uh I'll share this data on my GitHub profile if you want to follow along looking at this data we can see we've got a set of transactional data this represents different clients sending money to beneficiaries we've captured the client name the beneficiary name an amount of money being sent the transaction daytime that this occurred and a client tier we can also see that we've got a transaction ID over here to represent each transaction now just skimming through this set of transactions we can see a couple of problems so the first thing to note is in the client column over here there are some funny characters hanging around on the client names for example Frodo baggin has got angle brackets around the name over here you can see a bunch of dots appearing Harry Potter has got a star in between the first name and last name and Walter White has some dashes around their name then further we can also see that some names are capitalized whereas others are not we've got missing values here in the client name and we can see a couple of missing values in the amount column as well as well as the client Heering column so these are the issues that we are going to have to deal with in the data on top of these issues we can also note that there are a few transactions that look like duplicates for example this transaction over here from Tony Stark to Bruce Banner looks exactly the same everything is exactly the same except for the transaction ID for example the transaction that's being recorded is is the exact same amount at the exact same time so we could assume that there's also duplication in the set of data we can see that happening here as well between Sherlock Holmes and John Watson these two transactions look identical in everything except for the transaction ID so we've got some duplication as well to just get an overview of how much missing data we have I'm just going to use the DF doino method over here to take a look at how much data we have and how many non-null values we have for each column so we can see that client is missing a couple of values because we have 40 entries in total client only has 38 that are not null client tier only has 37 and amount USD has 38 non-null values as well this just gives us a high level overview again of how many missing values we have when we're faced with the task of data cleaning generally we either want to drop the erroneous data or the unclean data or impute and infer what the data should actually look like in either case it's important to note that you are going to introduce a certain amount of error in the data so it's important to keep track of what you're doing and how much error you're potentially adding to the data set by doing this type of inference or dropping of data all right with that out of the way let's start actually cleaning this data set the first thing I want to deal with is dropping data so as I said two different techniques broadly speaking is either we're going to get rid of data or we're going to infer data so this the simplest one is of course to get rid of data in this case the data that I want to get rid of is actually the records for which we have a missing value for the client right so for example transaction 15 over here we don't know who the client is sending money to Luke Skywalker so I'm going to drop this record and the reason why I I'm going to approach it this way is because it's really hard to infer who sent this money to Luke Skywalker perhaps if I data set was bigger we could infer that by looking at all the different transactions that Luke Skywalker is receiving and then infer that it might be the most common common sender of money to them but in this case we've only got 40 records and I'm going to say that actually I think it is logical for us to get rid of these records simply because we don't know and by assuming the sender of a transaction we're introducing a whole bunch of error I'm going to stick with dropping this data or dropping this record I think there is another record over here that also has a null client and that is this record whoever is sending money to Tyrion Lannister let's go ahead and drop those records first thing I'm going to do is create a new data frame called cleaned DF which is going to be a copy of my original DF or my original data frame the reason I'm creating a copy is so that I have a copy of our original data or keep track of our original data and then have a completely new data frame that we are going to use for our cleaned data set in our clean DF I want to take the cleaned data frame or clean DF and specify that I'm going to drop na or drop null values so using this drop na method I am then going to specify a subset the subset I'm going to specify is the columns for which I want to do this meth meod or drop na values or drop null values specifically I want to drop null values in client right and then I'm going to specify in place equals true so this keyword of in place just means that I'm going to do this drop on this data frame in a single line over here I don't have to reassign my data frame instead it's just going to operate on this data frame directly so after doing that let's go ahead and print our result and see what we get all righty and we can see that the null values have been dropped so if I scroll down transaction 15 is no longer there and further down we can see that transaction 27 is no longer there either so we have successfully gotten rid of our missing data in the client column great the next problem that I want to deal with is getting rid of these potential duplicate records over here so for example transactions three and four over here I'm going to assume that these are duplicate records and go ahead and keep only one of them to do that I'm going to specify cleaned DF and use the dot drop duplicates method over here and simply run this well first I'm going to specify that I want to do this in place again set in place to true and then let's take a look at C DF all righty so that didn't work right because we still have our duplicate records here the reason this didn't work is because the transaction ID is different for these two separate records over here so what I need to do in my drop duplicates is specify a subset once again and in this case I'm going to specify the subset of all the columns except for transaction ID so let me go ahead and do that all righty I've specified the subset of columns all the columns except for transaction ID over here so client beneficiary amount transaction daytime and client tier as my subset for which I want to drop duplicate rows let's go ahead and run that and this time our result is as expected we've gotten rid of that duplicate record the other one was between Sherlock Holmes and John Watson you can see we've only got one record left over here which is exactly what we wanted something interesting to note about the drop duplicates method though is there is a keyword that I'm not demonstrating here but I still think is pretty interesting and that is the keep keyword now what keep lets you do is specify whether or not not you want to keep the duplicate records or get rid of them outright so the keep keyword can be set to first which it is set by default meaning that I want to keep the first instance of a duplicate I can also set it to last if I want to keep the last instance of a duplicate and I can set this to false if I want to get rid of all my duplicate records outright again I'm not really using this in this case and I'm happy with keeping it as a default the default value of first but just interesting to note and has has been pretty useful in the past so now we've gotten rid of our duplicate records the next problem that I want to deal with in our data set is the formatting issues we can see or the poor data formatting in the client and beneficiary columns so what we want to do first is get rid of all these weird characters that are appearing in the client column to do this I first need to establish what weird characters are appearing now this data set is only 40 records long so we could just eyeball it but in the real world you're going to be dealing with much bigger data sets and if you have a 100,000 records going through each one and figuring out what weird characters appear uh is not going to be practical so let's take a more General approach and to do this I'm going to be using Rejects and if you're unfamiliar with rejects it can get quite complicated but really all you need to know is that it is a pattern matching Syntax for Text data or for string data I'm not going to be doing anything too fancy reject can get quite deep the theory can get quite deep I'm just going to be using some basic rejects to match some patterns in our string data over here let's take a look at that so let us grab our cleaned data frame again and in the clean data frame the column of interest for me is the client column first thing I'm going to do is specify Str Str for string the string Methods or use string operators on this column and then I'm going to do a replace string. replace on this column and what I'm going to replace is all of the alpha numeric characters with nothing so I'm basically going to subtract out all of the alpha numeric characters to subtract all the alpha numeric characters out of the client column I'm going to specify a reject pattern like this the first thing I want to subtract are all lowercase alphabetical letters or the lower Cas s a bit should I say so A through Z then I'm going to specify the capital A through to capital Z as well then I want to specify numbers so 0 through 9 as well as the space character as well so I want to subtract all of these characters that I consider valid out and see what is left what I want to do is also specify over here that I'm going to be replacing this with an empty string and the fact that I'm using rejects so setting rejects to tr Ro let's take a look at what the results of this is you can see that I get a column that has all of the names subtracted out and what I'm left with are just these strange characters now I want to get a unique list of all these strange characters and to do that I'm just going to add on over here a DOT unique to give me a list of all the weird characters that have appeared in this column all right so that looks a lot better you can see the first thing over here is the blank character I'm not too worried about that but the ones of Interest are the angle brackets the dot the star over here and the dashes so this has given me a list of characters to search for and get rid of so using this information the first thing I'm going to do is take clean DF again I'm going to be working with client client specify string again for the string operators and I'm going to use this method over here called strip what strip does is it focuses on cleaning the leftand side and right hand side characters of your string so for example it will focus on these types of characters over here the other issue that I've seen that's pretty common is Extra Spaces being put on the sides of different Text data so strip can really help you out there because you can also strip out or get rid of extra spaces on the left hand side and right hand side of your text data so strip focuses only on those characters it's not going to affect the middle of our text so for example we won't expect it to clean this text over here because the star in Harry star Potter is in between the text strip will only work on the outside of this text to use strip what we can do is specify a set of characters as single string which is a little weird but it's all right we can we'll work with it but we're going to specify the weird characters as a single string so I'm going to specify the angle brackets the full stop the star and the dash as well as the space character so I'm just going to hit space over here as well and specify space to be trimmed out as well or stripped out as well all right so let's run this and see what we get all right so our resultant column over here looks pretty good we can see that photo Baggins no longer has the angle brackets John snow no longer has the dot and Walter White no longer has the dashes on their name to finish things off what we want to do is insert this resultant column as our client column in the cleaned data frame so to do that all I'm going to do is specify clean DF again specify the column client and use the assignment operator to assign the result resulting cleaned client column as my client column in the cleaned data frame sorry that came out really confusing but you'll see what I mean so if I print this out uh we can see that now our client column is the cleaned column the issue that remains is Harry star Potter to deal with that I'm going to use again very similar syntax specify the client column we're going to assign again this time specify client but in this case I'm going to use the replace operator do string. replace operator to replace the star character not with an empty string but with a space right because I know that when the star appears over here and I've already done the strip when these weird characters appear I don't want this to be Harry Potter in one word I wanted to be Harry space po all right so let's run that and we can see we got the result we wanted awesome so we've dealt with the data formatting issues the next thing to deal with is data standardization over here we have doth vaa in all caps whereas no other names follow that that standard right similarly in the beneficiary column we have Gus spring who is in all caps but all of the other names do not follow that standard so what we want to do is standardize The Columns to do that I'm just going to arbitrarily pick the standard of making all of my elements or all of the names upper case so to achieve that I'm going to take our cleaned data frame again take our client column or select our client column use string the string Methods again and I'm going to specify upper in this case so just to show you the result of this you can see that all of the names are now in upper case consistently in uppercase right so I'm going to steal this over here and go ahead and assign this and then just take this again but this time I want to do this for beneficiary as well so let's go ahead and select client and replace that with beneficiary all right and then to see the result what we've done let's take a look at clean DF again so we've got client and beneficiary now all the names are in upper case you could do something similar using the lower method as well if you prefer lower case this is just personal preference the main point is that all of the names are now following the same standard they are all in uppercase and this just makes analysis easier Downstream so that we have a common standard for the uh the names all righty so we've cleaned up our client and beneficiary data and happy with the outputs thus far now now I want to move on to cleaning our numeric data the amount USD in this case we've got a couple of missing values that looks like the only issue yeah that is the only issue and in this case I'm going to impute or infer what the value should be what the missing value should be a common strategy for numeric data is to replace the missing values with some form of statistic like the mean or median these are simplistic methods to infer what the data should be we could also get super fancy and use a machine learning algorithm but I'm going to keep things simple here and generally speaking the mean or median is usually the way to go but to make things a bit more clear to illustrate the difference between the mean and median to you I'm going to go ahead and visualize the distribution of our amount USD data so I'm going to use matplot lib to plot out the distribution all righty so the code that I've added here just to run through it super quick quickly I've specified a variable here ax for axis as the plot of clean data frame the column amount USD do plot kind equals his for histogram so the plot of a histogram with 10 bins so splitting our data up into 10 different bins and the color set as sky blue and then the edge color as black just to make things easier to see I've just set the Y label and X label so that the graph is a little easier to read and I've specified a legend over here on our plot and then I'm just showing the plot using Mac plot lib so let's go ahead and run that and we can see the distribution of our amount USD data something to note over here is that the distribution is quite skew right it's quite skewed to the left and that means that there are more lower values or values around this range of about like let's say 50 to like 100 20 something I'd say then there are values above that range or at least that's what it looks like right that's the the what the data distribution looks like now to illustrate the difference between the mean and median so that we can make an informed decision about what we're going to use to infer the missing values I want to plot the mean and median on this graph so to do that I'm first going to calculate those two values so let's say mean value as a variable equals our cleaned data frame specifically the amount USD column do mean and that gives us the average right so simply the sum of the amount USD divided by the count or number of records that we have right that is the mean similarly let's specify the median value median value is equal to India select the mount USD do median and in this case right the median is the 50th percentile or middle most value right the middle value in all of our records we specified those two values as variables I'm just going to add some code to visualize those on this graph so the code that I've specified over here is this axv line for axis vertical line right that I'm going to use to plot two vertical lines at the points of our mean value and our median value so in this Top Line over here I'm specifying the mean value I want to use dark blue as the color I'm going to make it a dashed line make the width two so that it's clear to see and then label it with the actual mean value so that we can see that pop up in our Legend I've done something similar for the median value let's go ahead and rerun this to see the two different values all right so we can see the mean is 146.4 whereas the median is 122.75 ploted over here we can see that the mean is more toward the center but the thing is it looks further away from the majority of our data as compared to the median right so the median value is in and amongst more of the data meaning that it's probably closer to the majority of our data well not that it's probably closer it is closer to the majority of our data the thing is when we impute data we want to minimize the amount of error we introduce and so we need some way of measuring that error to illustrate this point I'm going to take a look at the difference between the values that we have and these two different Statistics over here to get an idea of how much error we'd introduce by replacing the missing values with these statistics let's go ahead and look at first the difference so mp. ABS the absolute difference between the mean value and the data we have right so let's go ahead and take np. abs for absolute between cleaned DF amount USD minus the mean value over here so if I run this I'm going to get a list of differences over here right so the difference between the actual amount and the mean value now I want some way of aggregating this so I'm going to actually take the average right so I'm taking an average of the difference between the value vales and their average a little confusing but it's basically the average error that I'm looking at or the average Delta right between the actual values and the mean value so let's go ahead and specify mp. mean over here Open brackets close brackets and we get a value of 59 what this means is that on average the mean deviates from the actual value by about $59 let's do something similar for the median so let's add another line of code here and I actually just want to subtract the median value when we run that we can see that the average deviation between the median and the values in amount USD is $55 so the median introduces less errors than the mean now this is typically true of data that's skew like this in the case of a distribution like a Gan distribution if you have a perfect Gan distribution your mean and median are going to be the same value right but in this case for skew data the median is usually the way to go but it's nice to have a technique that can actually show or approximate the amount of error that you're introducing by doing an inference over here all right so I'm going to go ahead and use the median to replace my null values with the median what I'm going to do specify clean DF do full na so full uh null values and then in full na I'm going to specify a dictionary the key of the dictionary is amount USD or the column that I'm referring to and the value in this dictionary is going to be the value that I'm going to fill with again I'm going to specify in place equals true in this case so to do it in place and then just print out our data frame to take a look at the result so running this we can see that we get the median values getting put in place over here so just to remind myself yes that is the median value 122.75 so in this case our null values have been replaced with the median all right so we've dealt with our numerical data as well and we've done an inference on the missing numeric data the last thing we need to do to clean up this data set is infer the values for the client tier there are a couple of missing values in here and we're going to use a somewhat similar technique to what we used in the amount USD case now the thing is the client tier data is actually categorical data right we can't take the average of this nor can we find the median but we can find a statistic called the mode and the modal value is simply the most fre frequently occurring value in the categories so in this case if we say that something like the tier gold occurs most frequently that would be the modal value and we would use that to replace our missing values just to illustrate this I'm going to go ahead and visualize the distribution of our category data okay so to look at the distribution of the category data I'm creating a variable here called tier counts which is equal to to clean data frame the client tier column or selecting the client tier column and getting value counts actually to illustrate this let me first comment this out for a second and print out tier counts just to show you what that looks like all this is is the different tiers Gold Silver Platinum and bronze and their respective counts so two columns one is the client tier and then the other is the count or the frequency at which that tier occurs all right so just to take you through the plotting code I'm going to take tier counts. plot this time I'm going to plot a bar chart in red with the edge color being black I've set a couple of labels just on the X and Y AIS as well as a title over here I'm just rotating the labels on the x-axis so it's a bit easier to read and then showing the plot so let's go ahead and run this we can see that the most frequently occurring tier is in fact gold followed by by Silver Platinum and then bronze the thing is is that gold is not dominant by many actually if we look again at the tier counts so let's go ahead and grab the tier counts we can see that gold occurs 13 times but all the tiers that are not gold so Silver Platinum and bronze account for 9 + 6 + 5 about 20 records versus 13 which means if I were to replace all the values with the gold tier the most frequently occurring tier I would be wrong about 20 out of is that 33 times and looking at that that means I'd be wrong about 60% of the time which is quite a lot of error right so it's important to note this I'm going to still use the modal value so I'm going to use the value of gold over here to replace my null values but it's important to note that we are introducing quite a significant amount of eror on the records for which we are inferring this data so the first thing I'm going to do is find the mode so let's call it client tier mode equals clean DF clean DF client tier client tier do mode Let's just print that out to see what we get so you'll notice that the client here mode is actually an array or actually a series of values right over here a panda series of values really we just want the string gold so to do that I'm going to index the series and select the zeroth item which happens to be just gold over here right and get my string gold funny enough it looks like there is a space on this on this uh uh on this value over here which means actually actually I need to clean my data right so clean DF client tier equals clean DF Center do string. strip right if I run that and then rerun this there we go got rid of those extra annoying spaces so yeah again a bonus cleaning step that we have over here all righty so we've gotten our modal value and to replace the missing values in the client tier column I'm going to follow a very similar method to what I did for the amount USD column so let's do that so clean DF clean DF do Bill na again I'm going to specify a dictionary where the key is client tier because that's the column we're interested in and the value is the client tier mode over there and then going to specify in Place equals true right and let's print out our data frame to take a look at the result and we can see we've replaced the null values with the modal value so we've gone through several methods to clean a data set including dealing with bad data formatting bad data standardization and inferring missing values now what I wanted to discuss was one of the key principles of data cleaning that has really helped me out in my career the essential principle of data cleaning that I'm referring to is to remember the context of the data generating process itself this context should always inform how you clean data let me make things a little more practical with an example so I've created some dummy data over here that's very similar to a set of data that I was working on for a client of mine now in this data set we have got a date and currency conversion rate so I was working on converting Forex transactions to a base currency for my client what I needed was currency conversion conversion rates much like over here I have the Euro and US dollar rate and I needed these rates for all the transactions that I was seeing to convert them to a base currency like US dollars what I found is that the data set would periodically be missing data for the conversion rates now my first instinct was to go ahead and try and predict the conversion rates using some Advanced machine learning model to really infer what the price of the of the currency conversion would be now predicting the price of currencies is a business in and of itself and if I pursued that method maybe I would be a lot more wealthy but it turns out that the missing values would occur on weekends right the markets would close and so I wouldn't get any data on the weekends and the right method to use the accepted practice to use for currency conversion dates or these FX rates was to Simply feed forward the Friday rates on a Saturday and Sunday now this is way simpler than trying to predict the price of Euro to USD for a particular day right so to do that the simple method that I needed to use was taking our data so in this case I've got my Forex DF specify buing the EUR to USD conversion rate equals Forex DF again the specific column and use FF fill for forward fill the data and if I print out the result you can see that the Friday rate is being carried over for the Saturday and Sunday this saved me a ton of hours of work so always remember use context to inform how you clean your data it will save you a lot of time and energy to make this practical if you're working for a client talk to the subject matter experts on the data if they are available to you do some research on how the data is generated something that I've always found helpful is to draw out a process diagram of how the data is being generated or just the process that is creating the data for example drawing out how does a transaction flow for example our transactions dat here where does a client start or where does the client data start which system does it come from and which system does it end up in if we're integrating multiple systems this is super useful to note because we can detect which system or pinpoint which system is causing the issue or the causing these specific issues that we're seeing so we've discussed the importance of data cleanliness we went through some powerful techniques to improve data quality and clean up our data for analysis and touched on the key principle of data cleaning that I really like which is to understand the context of the data before you go ahead and clean it what are some data cleaning techniques that you've used I'd love to hear about them in the comments below I'm working on a video dealing with some more Advanced Data cleaning techniques and I'll link to that once it's ready but in the meantime now that you know how to clean data in pandas how do you make your pandas application super efficient you can check out my video on improving pandas efficiency up here and if you found these techniques in this video helpful please consider giving it a like and subscribing to the channel for more info on data analysis data engineering and data science thanks so much for watching and I'll see you soon
Info
Channel: TrentDoesMath
Views: 282
Rating: undefined out of 5
Keywords: data analyst, data scientist, data cleaning, jupyter notebook, data analysis, data science, python, pandas, python tutorial
Id: ycQbwBqB8wY
Channel Id: undefined
Length: 36min 12sec (2172 seconds)
Published: Sun Jul 07 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.