Data Science Project from Scratch - Part 3 (Data Cleaning)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: Data Science Project from Scratch - Part 3 (Data Cleaning)
  • Author: Ken Jee
  • Description: This is part 3 of the Data Science Project from Scratch Series. In this video I go through how to clean up your data to make it usable for exploratory data analysis ...
  • Youtube URL: https://www.youtube.com/watch?v=fhi4dOhmW-g
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ May 06 2020 🗫︎ replies
Captions
hello everyone ken here back with part three of the data science project from scratch series now today we're going to go through and clean the data that we collected in part two so if you recall we went on Glassdoor and we found a pre-existing scraper to go in and be able to extract extract that data from the website now today we're going to take that that data that we extracted I ran the code overnight and got roughly a thousand different job postings and I'm going to go through and basically clean it up so we can use it in our exploratory data analysis as well as in our model building so the data cleaning is extremely important a lot of our models will not run if the data is not of the correct type or in the correct format so that's this is a really important step that often gets overlooked if you do this well you can really avoid a lot of challenges down the road as well so before we actually go in and start coding let's do just a tiny bit of housecleaning let's make sure our github repo is up to date and let's start writing a readme for that so we're probably crediting our sources and giving anyone who's coming in and looking at the code a little bit of context around that so let's go over to my computer screen here and we can see this is the code that we wrote last time this is the web scraper I gave the offer credit up here because I did not fork the repo and then here's the little data collection script that we ran that imports the scraper as a module so what we did here is we just got a thousand of the postings and again this took around an hour an hour and a half to pull all these things and we saved it into this glass door jobs CSV file so now I actually want to upload this code to github so we're gonna open the get bash again and then we're gonna get we have to change it to the correct folder so the change folders will you see the documents and then I believe this is what was this called D s seller crotch all right so as you can see we're in our master branch and usually you don't want to upload directly to the master but in this case since we're really early on the project I think it's okay for the for the actual data cleaning phase we're gonna make another branch so that we don't have to worry about messing up our our source code so now we're gonna get add and then we're just going to add everything I'm gonna upload the data as well and then we're going to get commit - M and this is just the message that goes with think of it so we're gonna say up looting data and data collection script great and now we're going to just get push I apologize for the sirens it's kind of one of the day in the life living in Chicago so we should in our github repo here it might take a second to refresh but there we go so we have hourglass or jobs that CSV and everything seems up-to-date so I like to do the the readme actually in github because they just have like some formatting stuff that's pretty good so this is in markup so it's kind of easy to get a feel for it if you just do get readme markdown I'm sorry it's in markdown I put maybe there we go so usually you don't have to get too crazy but if you want to embed some pictures you want to do some stuff like that that's not what we want [Music] I've usually found some in github but okay so what we want to do here is we just want to say resources so the two hat the two hashtags gives makes it into a header and then we're just going to take the code that we used or the link to the github that we used and put it there right at the top so people know that we're not plagiarizing or anything so then we can see what this looks like and it just displays that here so now we can we go you say okay there to leave me we commit the changes and now we've made those changes to our readme base we go so those changes have been made you can see we we did that 12 seconds ago now we have to make sure that we have the most up-to-date version on our computer so we're gonna go here and we're going to get pull and this will just pull in the updated things so now we want to actually make a new branch just for data cleaning so we're going to get check out the cleaning and then we're gonna do everything in here so the reason why we make different versions different branches is so that we don't have conflicting code with the master so we can mess up do whatever we want in this branch and the master branch will be perfectly fine we can just revert back to that if we want to start over again we also if multiple people are working on this code base can go back through and not have to worry about our work conflicting and we can reconcile those changes when we mark it so what we're gonna do is we're gonna go to the clinton of the code here we're gonna upload the branch and then we're going to merge it at the end so I'll also put this tutorial in and the readme oh no I do that I think that makes good sense there we go added you yes and again we're going to get actually worried about that okay so let's actually get in to the code here so we're just gonna make a new file for for data cleaning we're gonna call this that there we go and we're gonna read in this data that we got so if we go to file explorer we should be able to see our glass floor jobs CSV so we're going to import and as as PD and then we're going to make our data printing equal to PD dot you got read this be okay so let's read these in and then let's just look at our data a little bit so usually you want to explore and kind of get a feeling for what your data is like so we can see that our job titles they look pretty consistent for the cleaning or definitely going to have to clean up the salary estimate so we're gonna have to remove this kind of glass door estimate text there we're gonna have to remove the K we're gonna have to it the dollar sign and we're gonna have to remove the dashes we're also probably gonna have to average those numbers so that's gonna be you know pretty pretty a fairly heavy lift next we're gonna have to go in and parse the the number out of the company name I've mentioned that I want to split out like Albuquerque New Mexico like the state from from the from the text and then maybe we could also do we can change this founded variable to how long the company has been around so it's like more meaningful to us aside from that maybe we also want to like pull some stuff out of this job description that could be fun so if it's like a job that requires Python that would be kind of cool to be able to see if you know how different the prices or the salary estimates are associated with that so let's go let's just write a couple things down before I forget so let's put this over here so we want to do salary parsing we want to get the company name text only and I recommend kind of making a list I think that when you know what you're gonna do next it's a lot easier to actually get started so state filled and then a company and then maybe also parsing of Job Description hi okay so now that we have a feeling where this looks like I also see that there are some values where there's a negative one for salary I don't think that those will be particularly useful to us so let's just remove all those and see how many we have left so I'm gonna just keep this around for your reference over here because we don't have too many variables yet and then actually okay so to get to remove that or to dia equal dia and that not equal to so because this salary estimate is an object or a text field right now this minus one is going to be in in quotes if it were numeric we would not have to put it in quotes so let's see what that does so right now we have nine hundred and fifty six rows see if that looks like so now we have 742 so if we open this back up we won't we shouldn't see any more negative ones in this column here so first kind of thing on our checklist is done so now let's actually go through and remove the the kind of text field here where it says Glassdoor s today so the way that I would go about doing this so there's a couple different ways you can use regex which is regular expressions which means you can do some pattern matching and just remove certain things from from the text or keep certain things I think there's probably an easier way to do this and I'm gonna do that with lambda functions where we just split on certain things or we replace certain things with with blank spaces so for all of these we see that there's a dollar sign and a K for 4,000 and I'm just going to replace replace those with blank spaces I'll also go through and split on this left parenthesis and just take the left so we can we should be able to just get rid of the glass the rest and all these so let's try and do that so we're just we're just gonna call this so salary is equal to DF salary estimate and we're going to apply and so a lambda function is just a normal function so we're just looking at like Daffy and then whatever your code is there in just one line of code so here we're gonna do 2x and then we're just going to create a little function here so we want to take X dot split and we're gonna split on that type of parenthesis and then when it splits it gives you a list of in this case two things because there's only one right parenthesis there and we're gonna take the first element of the list which is this the salary information there so then we're just gonna do the first one and we'll see if this works I should I might have messed up some syntax it looks like I did mess up some sometimes it's there we go so let's just look at this salary here and as you can see we got rid of all of the Glassdoor estimates so it looks like there's also per hour and employer-provided salary we'll take those out a little bit later but everything else that looks like they're pretty uniform so now let's let's create a remove the the K's and minus K D which is minus the Kaizen dollar sign so we're going to take our salary dot apply we're gonna do a similar thing lambda X and then we're gonna do X dot this place and then we had a K and we're gonna replace it with basically just a normal space and then we're gonna do the same thing replace that with replace a dollar sign with just a blank space all right so then we can see we get minus KD and then we just have all numbers except for this employer-provided salary all right so when you see employer-provided summary and we also see per hour so it probably makes sense if we want to calculate like an annual rate from per hour for us to note when these things occur so let's append this salary estimate to our data frame and then make columns or fields from from those so maybe the the the positions where there's a salary estimate there's they actually pay more or you know hourly might pay less we probably expect okay so our next step let's let's make a column for per hour and then a column also for employer-provided salary right so if we want to do that we can just go into our our bigger data frame up here and so we can do DF hourly okay a whole bunch of spelling this house there is equal to DF salary estimate Thanks and then if so what we're gonna do is a 10 or a operator it's just in if-else statement and a single line of code so we're going to have a one if what we're looking at hourly so let's try and find an hourly one real quick just see exactly how the text looks per hour there we go if per hour in X dot lo so that just finds out if there's a per hour wage if there's that kind of texturing in in that line so it looks like I messed up the syntax as usual oh I need else else zero there we go so let's see if it worked so we should have in this we're gonna get through this one go over to the end and we have hourly so there should be some ones which there are okay and if we go over to some of the top ones we can see that there's per hour in there so that worked out let's do the same thing for employer-provided so we're gonna I already forgot what that looks like so let's open this up I think that there were some hourly that were also employer-provided so let's move narrow down that way to for our for our employer-provided there we go so it's employer-provided salary : good and then let's there we go so let's see if that worked here good okay we have some numbers let's just check the top couple so perfect they're all employer-provided so now we want to go to this we can actually move these up here we can do that even before our salary valuation so let's go here and just to remove those things from this - the dollar sign and thousand marker so we're gonna do we'll just do - our equals - be much when I made that capital who's gonna go with it fly and X X dot place for our we have to do X stop lower so lower just makes it lowercase so that way we don't have to worry about any capitalization issues and then we're gonna place that with a blank and then so now we can just look here and we shouldn't have any per hour anymore we still have the employer provide a salary but let's now just replace that here so if we do that we should now have no more employer-provided or per our salaries and we'll have columns for those so now let's do a the min salary and then a max salary the min salary is going to be min our Y and X and then it will be X dot split so if we if we look at this we can see that they're separated by a - so we're going to split on the - and just take the first thing and then to get max we're going to split on the - and take the second thing so we're going to split on a - and then we're going to take the first element so now we can look at this data frame and then we can see it and we have them in we should also make this in an integer let's make this numeric I think they're all integers so it should be fine and we'll know in spider you can know it's an integer budding this otherwise you can do D type so you can do so it's an INT as you can see there now let's just copy this and then do max salary and instead of taking the first one we're going to take a second one all right so now we should have the min and then the max and then let's make a little average salary real quickly so dia average salary equal to DF min salary plus yep max salary I'm gonna put this parentheses and divide it by two so you don't always have to use lender functions if you're doing get a you know this across the whole array this approach also works so as you can see here we have the averages for all these and this is really important because this is our dependent variable this is what we're using for everything so we have knocked out the salary parsing now let's go on to the next thing that we want to parse out so I think that was probably from what I was looking at the hardest thing that we had to do so let's now look at the company name so we can see here where is this company name that you know I think about maybe doing it by where it starts as a number but it looks like this company has a number in it is there anything irregular you know I think some of them might not have ratings let's see if there's any without a rating so if they don't have a rating it looks like they also don't have like the number or the rating in here so let's also take that into consideration so it looks like everything that does have a rating it's three characters long so let's just remove the three characters from the end of everything that's probably the best way to clean this up so let's do D F dot company text and this is equal to D F company since we're going to use both the rating in this thing as well as the company name we don't want to just do the false we're going to do it D F apply same thing and X there's probably something more efficient than using only lambda functions here but in the essence of time and to show you how to do this this is the approach that that I'm the most comfortable with so we're gonna return a company X dot name if X dot rating is less than one so the ratings are on the America so we can do I'm sorry we want less than zero is all the night all the ratings that are non-existent or negative one if else we want to return X company name and then we just want everything until velocity character so we can do a little character manipulation like that and that should work so let's see what it looks like now reading occurred at index okay so because we didn't specify a series we have to let the data form know that we're doing this on on rows so we do access equals one and that should work there we go all right plugging through it here guys so as you can see come on baby we have all of the companies by themselves I don't know if we'll really use the company names that much for the actual algorithm building but maybe we'll use this for some of our exploratory analysis so let's now do the state field so like what state the company is in or what state the job is in right so we'll just take location and then we're gonna do a split and take is the second part on the comma so DF you want job state and that's equal to do location and then we're just going to X split and we're splitting on it , this time and then we're gonna take the first entry there no no we're gonna take the second entry so I guess so let's see if again that worked we're gonna go to the end and we just see all states so if we want to see how many are in each state for example we can do just DF so again if we do a dot versus the bracket it's the same thing but if there's a space in the actual thing like company name we can't do the dot method here so we got value and then we'll be able to see how many jobs are in each state which i think is pretty cool so California Massachusetts New York Virginia maybe Amazon has something to do that well that's cool way to see if we have any outliers there or any error stuff so it looks like we took care of that pretty well so now for the company age let's see what age looks like I think it makes sense actually before we do angel let's also do let's see if the actual job position is at the headquarters I think that that would be probably a pretty a pretty useful data point here so we're gonna do I already forgot what the fields are called so you can get all the columns of the data frame if you do that which is nice so DF name date equals ya and X so if the headquarters and the location are the same it's in the same state we'll go to one if they're not we will give it a 0 X dot location 1 FX stuff location is equal to its headquarters of 0 and that one oh sorry again we have to do access is equal to 1 again sorry for the sirens rough tough times out there so as you can see we have the if things are in in the same state or at the same location which is a nice little a nice quick adjustment that we can make there now let's do age of the company and also kind of parse out the final job description ok so to get the age of the company let's just subtract the year founded from the current year and if that doesn't exist the founding date will just keep it as negative 1 so it's the founded field so we're gonna do have age as equal to that up bounded by X so it's so we're going to make it X if X is plus one else 2020 - max so again let's check it out we want to make sure these are all working so this company seems old company 239 years old Takeda Pharmaceuticals I mean it looks like it was founded in 1781 I'm actually kind of curious well let's check that out real quick 1781 Wow stood the test of time okay it's the last thing and something that I think is pretty relevant let's actually parse the job descriptions so let's actually look at what one of these look like so and let's just look at the first one in that series okay so it has the full like pretty much the full job description in here which i think is good so let's parse out like a Python and maybe some of the other data science tools so so look for like Python Excel and I think it's kind of BS list but I think R will be kind of hard because it's just a letter maybe our studio but let's see we'll just do a couple that I I think are relevant so let's look at Python our studio spark and yeah and then let's also just look at itself see if there are any positions that have this so we just want to see if that exists in that position so for Python we're just gonna DF by phone and YY in this entry us now the DF which is our job description X okay so if python the one a python in lower else zero we'll see how many of these job descriptions have Python and again we just do oh you counts alright so looks like about a little 1/2 the jobs have Python that's pretty cool field be cool to look at or so let's see if either of those variations are in our I want to change this to our our know if this one work it's always validate okay so there were two with B there are those let's see if this mess it up this or this hopes okay so that seemed at least grab something that's probably going to be too small to be relevant but we'll decide that a little bit later so it's kind of big data job with since pachi spark and we'll see how many of them have that okay okay so a lot more have to spark that's then our studio that's interesting yes okay same thing around the same spark and then let's check out Excel for our last one there might be some crossover with like you Excel it XYZ rather than the software but yeah looks look again around the same as the exact same as Nate at the S which is interesting hey I think that those will be relevant fields for our analysis here so now that we have this data we can either choose what fields to include or we can just kind of save the whole thing and I'm just gonna say that the whole thing is data claims that the actually I would drop that first column because we don't want we don't want this unnamed column I totally do this all the time so there we go I knew it was dropped but how do we drop the first one okay fine so yeah I'm just gonna get the name of this first column we go so we're gonna drop this so DFL should be one row lust not there we go and now we're gonna just make the CSV of this if you have out to Guinea and then we're going to call this salary CSV and if we don't want that unnamed column will do index equals plus so there we go now let's just read it in to make sure that we do that right sorry that all right cool that looks like we have this all set up now let's all it left that's left to do is to push this to github and and merge it and then we can start on the exploratory data analysis okay I'm just gonna remove this and then let's push this to github so that now we're gonna get a thing we're going to up what did we do this time so we're gonna get now we're going to get push and that's going to tell us to get push cool and then so now we are up to date we have a new branch on github so let's go here and then we can see that we have data cleaning and so now we want to create a pull request and then so this just takes the code that we were working on and makes it our master branch after we've confirmed all the changes that we want it to me so create pull request and then now we're going to merge it confirm merge and our master should be updated as well so we'll go here as you can see we have this extra file that we just made well two files and they should both be located here so salary data cleaned and that isn't master generally you would delete the data cleaning but for now we'll just leave it so as you saw we cleaned up this data so now we can actually use it in our analysis we took the salary and we made it numerical which was probably the most important thing we had to do and we've also extracted and did a little bit of feature engineering for some of the job description components as well as the actual state so in the next part which is part four we're going to do some exploratory data analysis which is kind of where it starts to get really fun so we're gonna see maybe how the salary changes by state by job title by some of these other things and that will lead us into how we actually build our model to predict the salary from that on earth so please stay tuned for that video thank you so much for watching and good luck on your data science journey
Info
Channel: Ken Jee
Views: 44,667
Rating: 4.9729042 out of 5
Keywords: Data Science, Ken Jee, Machine Learning, data scientist, data science journey, data science project, data science project from scratch, machine learning project, kaggle project, data science project python, machine learning project python, data scientist salary, data science for beginners, data science project for beginners, data science project tutorial, data science project walkthrough, github, data cleaning python, data cleaning, data scientist salary in usa
Id: fhi4dOhmW-g
Channel Id: undefined
Length: 43min 52sec (2632 seconds)
Published: Wed Apr 08 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.