Data Science Project from Scratch - Part 4 (Exploratory Data Analysis)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
  • Original Title: Data Science Project from Scratch - Part 4 (Exploratory Data Analysis)
  • Author: Ken Jee
  • Description: This is part 4 of the Data Science Project from Scratch Series. In this video I perform an Exploratory Data Analysis (EDA) on the data that we collected from ...
  • Youtube URL: https://www.youtube.com/watch?v=QWgg4w1SpJ8
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ May 04 2020 🗫︎ replies
Captions
hello everyone can hear back with part four of the data science project from scratch series in this video I want to be doing some exploratory data analysis otherwise known as EDA of the data that we collected from glassdoor.com in part two and cleaned in part three really excited about this EDA portion of the project because it's the first place where we can actually get some cool insights from the data that we collected if you enjoy this content please hit that like button for the YouTube algorithm and remember to subscribe and turn on notifications to be alerted when I post the next segment of this project okay so let's get started with this EDA here so as you recall we want to go and pull down our git repo so we're gonna change to the folder and then we're going to get get pull and it looks like we're up to date I'll probably blow out my IP address there just in case next we want to get check out B and we're gonna create a new branch for this project so this is going to be data eg a great and so now we're actually using this branch and everything that we we push back up will go specifically to this branch we don't have to worry about messing up the master so our last two segments we did in spider and I think that one you're writing code and running it that way it makes the most sense but for the actual EDA we're using a lot of visuals and you want to tell a story it might make more sense to use a jupiter notebook so why don't I open up one of those and we'll get started okay so you can just open the prompt and write it that way and the jupiter runs in the browser which is really nice so we're gonna go to our documents we're gonna find this folder here and then we're gonna do a new Python 3 notebook and so again as you can see this runs in the browser as usual we want to import and as G and we want to import for this we're going to be doing a lot of visuals so in that not lot with five o'clock as guilty and then let's also do Seaborn SS so see boards also a plotting library built on top of matplotlib and let's also read in our data so our data frame is going to equal to PD dot read this B and I can't remember what it's called so let's just go to the file explorer salary data cleaned so I just rename it and copy it that's the fastest way to do it for me perfect so let's do so unless you're using like Jupiter lab but you usually don't see you know like the pop out kind of Excel formatted folder so we're gonna have to use a couple different commands to actually view our data so with you have had we can kind of see what our data still looks like and if we want to get all the columns we can just do up gdf columns perfect so you know one thing that I noticed when I was actually editing the last video is I left out a couple of things that I wanted to include and there were a couple more things that I needed to engineer a little bit to make sure it worked in our EDA and our model so why don't we go and clean those up you know part of exploratory data analysis is also feature engineering and you know that also again falls in in between maybe data cleaning and EDA so we'll do a little bit of that to start that off I've written some functions beforehand to save us a little time on this because I obviously made that whole video last time about about what that process is like so let me pull up that file if I can find it see here okay so here is some of the code so what I did is as you recall our titles had a lot of different characters they had a lot of different formatting some was due to scientist some where data scientist positions on machine learning engineers analysts but we don't need all of the specific we just want kind of the broader category that they fall into so I made some tags that will make this them really only fall into four or five different categories there's also seniority attached to a lot of these positions and we want to be able to parse that out so we basically run that and now we can actually start formatting this ourselves all right so let's quickly you know make these works so for our data frame we're gonna do a Java simplified column so just enter and this is equal to data frame let's go back to here so our job title apply and instead of doing a lambda function we can actually pass in this function that I will wrote right here so we can just call it idle simplifier and then let's actually look at what those look like said you have to value counts if you remember value counts from our last video I really love that so we see that you know there's a lot of data scientist positions there's a lot of na s I saw a lot of like research scientist positions and things like that so that might be where all of those are coming from but you know to have like it it not being a true data science position in the data we can still use that as a data point so I also want to parse out seniority from that same thing so what we're gonna do is just copy that and then we're gonna call this seniority and then we're gonna do that and then we're going to apply seniority here and same thing let's I should just do it here so you can see we have a lot of like not specified a lot of senior and a lot of junior so that's a good way to evaluate that here I also saw that we had Los Angeles in the data frame as a state and we probably shouldn't have that so let's actually look at that so we're gonna do EF job state up value counts and let's as you can see we have Los Angeles down there so let's fix this really quick so we're just going to do DF job state equals DF job fly and X and then we're going to turn X dot strip so a strip just removes any extra spaces if X dot strip low or not equal to los Los Angeles Alice right we're going to return CA okay so let's see if this one works we're gonna run this again state equals oh I have space so this should work now okay there we go so I accidentally created a new column in our data frame called job space state so we're going to do bf drop job in place all right so what that did is I just dropped that fake row that I think fit column that we made in place true means we did it in the data frame so it's not it's saved in the data frame an axis equal one so is to let it know that it's in columns next we want to do find a job description length so this should be pretty quick as well so let's see here so we're gonna do data frame description lamp that's an equal to data frame description Y and X and we're gonna turn this the length X so that's pretty straightforward and then let's also just to show you what this looks like return that as well so as you can see at the length is probably isn't is in words I think so it seems to make sense that there's there's not many in it might be characters I can't remember but regardless it might be interesting to know if companies have longer descriptions if they're posting higher or lower salaries you know you might want to explain away a low salary by having really in-depth description I don't know I just think that that could be pretty interesting we also have competitors that are listed in kind of this what is this this this column here so I'm pretty bad spell or something that just copy I said the F unless you see that that looks like so it looks like they're separated by a comma and it's negative one if if they don't have any competitors listed we're just at that equal to the competitors apply them and X and we're gonna turn so we're going to return X dot split and we're gonna do that on oh we're trying the length of X dot split and we're gonna split on a comma so if you're a call split makes it into a list with however many you know units in the list for each of our split criteria so if there's like four commas there'll be like four or five splits in the list if X [Music] not equal to negative one else zero let's see how that turned out that and then let's actually look what okay so I accidentally made this the same a data frame we do not want to do that we want the number of competitors so let's just reinforce our data we probably want to redo that really that we that do that okay that looks good take a job description and then we're gonna change this to no pump rather than over writing this so now we should be able to see um help and we have our numbers and if we just do normal competition we should still see that okay perfect okay next we're gonna make our hourly wage into an annual wage so we I've mentioned that we wanted to do that and I forgot to do it in the last video so to do that you know basically you can take an hourly wage and multiply it by two thousand and that roughly equates to the annual wage so let's and in this case you know where we actually don't need to get the hundreds of thousands or like the actual number we're doing that actual number divided by a hundred so we can just multiply the average salaries per hour by two and we should be able to get approximately what that would be for the annual wages in our data set might be a little bit confusing if that is just leave it in the comment section and I'll try and explain it a little bit better there so DF min salary is equal to DF top apply and X and so again we're not just applying it to the in salary because we actually want to use two columns for this we want to use if they are hourly or not or hourly and also the men salary so we're going to do x x times two if if X dot our lead no sorry do X dot in salary times 2 if X dot hourly e equals one else X dot bin and I confess your work you know we got some errors because I have to do access equals one I always forget this you see me forget it like four times in in this series of two videos so let's just check out what that looks like that probably wouldn't help that much but let's let's just look at the data frame then so we know we did something if the min salary is greater than the max salary are really close for a lot of these hourly people so let's find we're just gonna put those three columns I forgot comma and then maybe let's do I sort these like hourly equal to one okay here we go so as you can see our min salary is higher than our max salary and a lot of places here so that is kind of what we wanted to see now let's go in and find that this max salary so we do basically the exact same thing and instead of min salary we just do max here and Max then let's see this looks better there we go that still do I think I did something wrong man max do oh it's because I multiplied min salary again so let's go back and just replace let's go back and just run this again I know it's a pain in the butt but this is kind of part of part of the whole process here so okay so now this should work perfect what we see is exactly what we want to see so the final thing I think is we want to remove this newline character from the job title so let's see if I can explain what that looks like so we have the columns so if we look at our company text there's this new line character and this is something that's really easy to just remove so we just doomed it up company text equals dear company text dot apply and X X dot place and [Music] okay there and then we can see here that should be gone alright great stuff okay so let's actually start diving into some of the data here so the first thing I almost always do is D F dot describe and this gives you for all of the continuous variables in your data frame like number it should be 742 for everything the mean of all of all of the features standard deviation men and the quartiles this is good just to kind of get a feeling for what the data looks like you know pretty big standard deviation were founded obviously which you know just based on on the way we structure that but you know for what was the feature we created for that which was age right you know it's a lot smaller because it's a different different metric that we're using here the next thing I like to do is I generally create histograms for for most of the relevant feature so let's look at our columns again and you guys will notice that when I go through this especially in the Jupiter notebook it's very editor iterative I you know keep checking for the robbed I keep making small changes again I don't think that this is good you know practicing like best practices for programming but this is generally how a lot of data science is when you go and actually present this to someone you clean it up but you know again Jupiter notebooks are the best way to actually see this and see how you walk three steps you can also add in not just comments but like text cells so you can actually tell your story as you're going along and if you go on Kaggle you'll see that's how a lot of people really do it so let's look at our our ratings for example the VF dot ratings and the nice thing about pandas is that there is some visualization capabilities built in which I really like and I use for very basic stuff so we just do DF ratings not hist Oh its capital rating sorry guys that's a rating there we go and so we can see that it's pretty close to a normal distribution let's also look at D F dot average salary hist and you can you can go through a loop and find all of the numerix and do this but you know for me since technically like ones and zeros are still numerix it's kind of a pain in the butt to actually to actually do that here so let's also do like a histogram of age and so if you can see here you know ages almost never really normally distributed it follows an exponential distribution so if we're going to use this in our data we might have to normalize it especially if we're using a regression we also built this job description length so let's actually go go through and look at the distribution there so EF DSC okay great so this also follows pretty close to normal distribution you can also do box plots they remember having that yeah so we should be able to follow the same format if we wanted to see that box plots are also extremely useful so let's do average salary no attribute boxplot well let's just do this IDF boxplot use the same type of system they did here column equals age average salary [Music] description and then what was the other one we did rating and I think we can put those in brackets there we go okay so you can see the box bots are obviously not normalized so I probably should've done that first or done them individually it really makes no sense to keep description length in here so let's just see what it looks like without that cool so we got a little bit more interesting information so we can see that there are some high outliers on average salary same with their some really old companies let's just look at reading oops what's that okay I don't think you need that this way we'll see come on there we go and so you can see that there are some really low ones obviously we've kept that negative one in there for ones that don't have a rating so those will come out on the low ID the next thing we generally want to do is look at you know the correlations between our continuous variables I'm going to do the continuous stuff first and then I'll kind of go through the categorical stuff and then we'll make some pivot tables so let's do let's look at the a couple different correlations so we're going to D D F and then let's look at just the ones that we have had up here so take these and then we're gonna add I'm gonna do about before oh sorry that is not how how Court works how do I make a correlation and be born so I could use this one before oh I need the parentheses that's where other stuff I thought that okay cool so you can see the actual correlations between each of these so rating is relatively negatively very small negative correlation with description length older companies generally have looks like longer description lengths but usually we want to make this slightly visually more appealing so let's actually kind of plot this using Seaborn again this is how usually just plot things as I I find some of the you know a place online that has it and then I go through and build it so we're going to use this heat map and our core here is going to be this we could type it into a variable but I don't think that's practical we don't need a mask I don't think we need see map let's just see how that looks okay so that's pretty ugly let's actually just use some of the formatting that they use here so I believe this controls the coloring so let's include this back in above this and then we're going to add in an argument see map equals D map and then I have to add it all right so that looks a little bit more pretty I mean the colors are inverted from probably what would want but it looks like the strongest correlation is between age and description length and between salary and description lights so those are positively correlated so you know that's something we could take into consideration it looks like reading isn't really highly correlated with anything but it's slightly negatively correlated with this description right here so we we have a core plot that we can actually use so when we're actually model building we do want to pay attention to what's correlated with what and we want to avoid multicollinearity and our models so this is something that if we're using regression we really want to pay attention to you now let's actually just go into the categorical variables so the columns let's find ones that are categorical but we want to that aren't too big because we want to graph them so you know company name we probably don't wanna use that location is probably should fall into that so let's do you know location headquarters sighs I think found it is the year right so we don't want to use all the type of ownership industry sector revenue is in groups I don't think we want competitors oh actually we can add competition in here if we want so let's put what was it it was um um let's see what that looks like so it looks like age like older companies have more competition which is which is interesting and somewhat intuitive also the description length is generally longer for companies that have that have more competition so let's that's interesting there do they have to explain why they differentiate a little bit more we can go into the whys later but it's it's kind of fun to think and brainstorm about what could make some of these things happen so revenue [Music] maybe company name we'll just do it for for four gigs watch job state same state we don't want age what we do want that what else might we want we don't want any of the salary Marik we want all of these and then I should be good all right so what we did here is we just are making a taking all of the categorical data here so we can maybe loop through and graph a bunch of this stuff let's just use simple bar charts that's probably the best approach so we're going to do 4i in the fat columns and we're just going to make a simple bar chart for each one where the SNS dot I wish there was some code help okay so this is what it looks like SNS barplot X is equal to so on the x-axis we want our our index so we're going to do DF cat i dot value ounce dot index actually let's make a new variable so we don't have to make this code super ugly so alright so I'm going to do cat index and then Y is going to be equal to cat so that's actually like the series let's see if and then I'm going to appeal t show I'll see if I have sores named tips you get data every day live this way okay so we get a lot of really ugly graphs but let's try and clean these up a little bit you know there's a lot of graphs I just wanted to make them pretty quickly but let's first say what each thing is we're gonna print graph or percent s and their way to do percent so this is some like string formatting in Python I and then we also want samples Oval equals receptive and then so we're going to do i and then we also want cat num Len alright let's try that again okay graph the location so there's total 100 200 different locations 298 different headquarters let's also try and make the the labels like go horizontal right so Seabourn X label rotation alright looks like I've actually done this before [Music] so here we go for this we're gonna just put that in there we're gonna call this chart and we want them 90 degrees I think so let's try that one more time see if they look a little better I mean that one doesn't but a lot of these really do so let's go through and actually kind of analyze some of these things let's skip the lick locations and in headquarters I would imagine that San Francisco is is probably the most popular there so we have most companies are in this 1000 to 5000 range which isn't the biggest I think that there's you know 5,000 to 10,000 10,000 plus and there's some very small ones you know we have type of ownership so private companies are the most common in in these you know asking for these positions which is a little bit surprising to me you have some nonprofits and and we'll go through and compare salaries across all of these things as well industry is pretty dense but sector is interesting so you have IT biotech pharmaceuticals I mean a new data science is pretty big in that field but I didn't realize it was this big business services I really know what that means insurance healthcare and then we have some in agriculture and then some that are unmarked so some pretty cool stuff just from like a very simple loop that's giving a lot of our charts so again we can see that revenue some of the the biggest companies have by revenue or hiring data scientists and also you know there are some that have less than a million revenue so I guess it's probably not surprising that the biggest companies are hiring the most data scientists in terms of revenue because I think data science is a huge driver of growth so this is like positions by company and we can break out some of these in a little bit more detail so you know let's actually just copy this code and maybe just look at some of those really long ones more carefully so let's do we'll just do location location will do headquarters will do company text will do oh I forgot to look at job State that's probably pretty interesting one there we go so California Massachusetts New York Virginia Illinois what was another long one I think that that should be good but instead of taking you know all of the companies like we were doing before I mean within where there's so many and you can't really see it we're gonna just take the maybe the top 20 if it does okay nice so location New York California [Music] looks like Massachusetts oh sorry I did this backwards this should go after the value counts there we go so New York there's a lot of a lot of job postings from there Sam to San Francisco Cambridge and Chicago headquarters a lot of them are New York and San Francisco no surprise there but Chicago you know big big city but that's a little bit surprising you know considering we saw a lot of Massachusetts tough I'm surprised Boston's kind of lower I guess that there's multiple Massachusetts things on here so that could be an interesting thing here looks like mass mutual is hiring a lot takeda pharmaceuticals I don't see anything there's no Amazon there's no you know super big companies on here which is fairly interesting as well okay now let's start actually doing some pivot tables and looking at salary by some different categories here so let's do D F dot columns again just to see what we're working with and let's do our first pivot table on index is equal to let's do the job title like simplify that we did and then we're also going to make the values equal to average salary and so we got some interesting stuff so we see analysts making less data engineers that we can slightly more data scientists are doing pretty well the one thing that kind of strikes me is that kind of managers are we're getting paid less than men data scientists or or data engineers and you know machine learning engineers it looks like you're gonna clean it a little higher than data scientist so I think that's like a pretty cool observation as well it let's add one more layer so we can make multiple variables and we can add you know this is the equivalent to like a group by we can add in also seniority so we can see how these positions are getting paid at junior and senior levels so senior data analysts are making about as much as you know actually more than a normal data scientist you know senior machine learning engineers are making quite a bit of money anyone with seniors is getting a pay bonus here as you can see though analysts are making less money but you know you can move into an engineering role or a data scientist role from an analyst position if you kind of focus new projects and you learn some of the material now let's also look at a couple of you know different things you know we can look at by you know age of the company by let's do by location though that's kind of a fun one so let's do just copy this we're not going to do actual location I'm going to do upstate and then we want to actually sort these and we want them sort it by average salary and then we want to set sending equals to false so we want this in descending order so we see California Illinois DC Massachusetts New Jersey New York is really far down there on the on the data cell science kind of salary pricing here which is which is kind of surprising to me you know maybe they're hiring more analysts or maybe they're you know they're they're doing something different there especially because they're you know the cost of living in New York is so high so let's actually look into that let's let's do job state [Music] bye-bye I like that job simplified field so we had a job [Music] and let's actually knocks we're done actually let's sort by values and we're just put them by job okay [Music] so we want to see a beautiful list how they display full data frame output display set options value do I have to do anything have to import anything I'll see if this works just experiment a little bit again this isn't my bread and butter per se and the mark TV default value max inflatables it's ten so okay so that was the max Rose is sixty so we want I believe this is what sets it so let's try that again and then let's try this again there we go nice so let's put this up here and now we can see you want to go to New York clearly past New York not great at the alphabet okay so we want to actually do a count here equals count so this is actually going to give us the total so again let's look at New York so they forty data scientists for engineers and fourteen analysts let's look at for example California more analysts and engineers I think proportionally more and then let's look at Chicago so that Chicago is hiring a lot of directors so you know that could be a factor there's a lot more diversity here and actually only a few analysts so that could be what's actually bringing that average New York salary down let's do one more kind of exploration here and let's just filter this data based on only data scientists to see you know how different that is for New York rather than some other states so we're gonna do - you f is equal to data just and then it's just going to keep drive state and remove this and then we also want to remove this count so by definition it usually uses the average and then let's kind of sort values again by average salary okay cool so looks like DC is actually leading in the average salary category New York still bind which i think is a pretty interesting insight so maybe you're not getting as good bang for your buck in New York as you are in some other states okay let's let's also look at some other things so let's make a list here so maybe salary by rating I do this a lot so we want salary X X rating salary grading industry sector revenue number of comp hourly employer-provided Python our spar AWS cell description length [Music] and then type of ownership so I'll go through and make all of these pivot tables and just we'll see if we find anything interesting of course like before we should probably go and loop through so let's do that reading industry sector probably should have come these better to start what I did should've made spark and a TBI small bees yes now also but but I didn't so we actually don't want this Christian that here that's that would be and then type of ownership nice see what I spelled wrong equal Kia and then so type of ownership is not in indexed type of ownership great so now we have this data frame so we're gonna do for I in yeah columns rinse just gonna print I so it'll say what column it is and then we're gonna do print you got visit and that pivot table is going to take the F pivots as the data index is equal to I values equal to average salary let's see how that works do key error not huh didn't like that okay let's do some debugging here so we'll just comment this out make sure that works we know that works problem is in the pivot table so you know the index is equal to reading industry values equals average salary so let's just print the same thing a bunch of times does not work and we not okay No oh I'm not using a pivot table that's what the problem is classic so there's a difference between pivot and pivot table I obviously made that error their table every oh this will fix all of our problems yeah pivots index equals I values equals average salary and then we cross your fingers and oh it doesn't like our value of average salaries Oh cuz heifer salary isn't in here yeah come on Ken there we go so reading yeah we probably rat forgot rating let's continuous I don't know why actually put that in here industry alright we want to sort my values here great okay so the the companies that had that had no ratings maaan type object has no attribute sword oh because it's inside the print statement so as you can see this is a very iterative process this is why I did it this way is so you can see the mistakes I make and and see that it's natural for everyone to you know have some debugging to do so you know companies that scored a perfect 5 the obviously rating of negative 1 is it's their unrated but of perfect 5 are offering more money also of 2.5 which is really low so this looks like it could be pretty random but we're seeing more twos at the bottom except for this four point eight one if a sample size four for these is probably pretty small industry it looks like retail pays a lot we should also probably include the accounts for these just because they're so high but you know motion pictures pays a lot let's see what's on on the bottom end we expect nonprofit gambling is very low which is interesting [Music] Sporting Goods let's see sector so constructions and lowest non profits pretty low is what we'd expect real estate case reasonably well again media media is kind of at the top here which i think is interesting in terms of company revenue so it looks like like these are that the top are kind of the lower lower revenue companies and the ones and and then also the highest but the bottom is all these kind of like mid to your revenue companies which i think is interesting so if you're if you don't have a lot of money or you have a lot of money you're willing to pay obviously these are anecdotal observations but I still think that that's relevant companies you know with a little competition it looks like they pay better you know I wouldn't put too much weight in this hourly workers obviously pay a lot worse when the employers for you know the salvage provided its higher we're looking at Python so Python when it's in the description it's higher when R is in the description it's lower but the sample here is smaller remember there's only two to actual are our samples there spark is high AWS is high Excel is is lower you know most data scientists don't use Excel this is probably more catered to analysts and then if we're looking at ownership so if you recall a higher-up the companies with the the most counts like the most job postings were private but it looks like public companies are in general paying the most universities are paying a reasonable amount which i think is interesting government is you know governments of government nonprofit of course is also on the lower end we ran it looks like into some more errors but we got exactly what we needed regardless so it doesn't matter too much that we ran into some errors down there okay so you can go down a pretty aggressive rabbit hole with all these pivot tables you can also add columns to look at you know like multiple multiple things at the same time let's just do a quick example of that but you know in general I think I've gotten most of what we need from the exploratory analysis let's just do it April and let's look at what's like a like company revenue and if they're looking for Python jobs Lydia visits index is equal to let's do revenue and then we're gonna set our columns equals to five on one end and then our values equals to average salary so as you can see here you know this is this is by panel let's let's do count cool so we can see that you know by each range you know who's really looking for more Python people in general which i think is pretty cool yeah so as you can see you know we can look at the different companies and compare who's hiring maybe that either the most data scientists or the most data scientists who are interested who at Python and we if we wanted to we could make a ratio but if from this 1 to 2 billion it looks like they're really focusing on Python in comparison to that over 10 billion maybe where you're looking at actually less focused on Python and more focus on their skills ok so the last thing I thought it would be fun to do is to maybe make a word cloud with the job descriptions to see what actual you know words are most popularly used when you know when talking about candidates so I'm actually going to go in and pull up some of my old code all right so let's actually pull up this code and you can see it here and we're just going to copy in some of the code that we used to be able to to get this going so I didn't have to use Twitter scraper I don't have to use date/time I already have in pandas so I probably need this word cloud and then I probably also need those so let's get word cloud here I'll import that and then we also we might need these NL TK toolkits as well so let's just import all those things and see if I have them great I do so you know it's really common to use your old code and reference it I do this quite frequently so let's I think we just want the words and we want to get rid of a lot of I don't know we don't want to get rid of too many duplicates in this one so let's take that that looks pretty good we want to get rid of stop words so we're gonna include that as well I haven't done this one in a while so we don't have any unwanted words we want to actually get text here and obviously we're going to be versioning this for this actual use case and then we don't have a mask we have this word cloud that we want to generate and then we want to show it off here we go all right so let's see if we can work through this and actually make it work for us here so I don't think they're any duplicate drop do put jobs and so we just use data frame job description and that should work now should filter everything out we take the text and we join everything to make it into one just really fat word chunk so we don't have a mask we do have to remove stop words so let's look in here and see how I did that okay so it looks like I didn't have to define that that's already defined those things all look good we don't need to recolor it so we should just be able to be that and then let's see how it looks let's see if we did that really fast so texts don't have unwanted so in words filtered just do that and again I didn't work on the cross your fingers last time but a bit more optimistic this time the word clouds can take a little bit of time to generate I just think that this is like a nice way to understand the data you know I I use word thoughts a lot they're not particularly useful other than that but um again this is a pretty interesting use case here so looks like we're missing something why max website generate text image show let's see oh we have to pass our actual word cloud in here that seems to make good sense so let's give that a go you know this is probably gonna be the last thing we do in the Explorer analysis I think that this is so deep and there's so much information here I'll probably do a separate video where I talk about my findings my findings from this I think that just the descriptive analysis the EDA could be even its own project here but we wanted to make sure that we actually are going to build this model so let's kind of focus on doing that in part 5 okay cool so we have this this neat word cloud you see da da la you see team solution product client business I was the machine learning research financial data analysis and again a lot of really cool stuff here before we log off let's remember to just push this too or github repo we're gonna name this data cleaning and then we're going to go in here we're going to get we're going to get that being shoot her notebook get push so we have to set it because I get push [Music] cool so this is being pushed to our github we will be able to see it here when we reload it and we can go into data eda that exists now we want to create any pull requests and then so ETA analysis we're gonna create the pull requests and then we are going to oh so I'm in a different account so I am I know different account oh I am yeah I'm in my in my work account so I can't merge this I'll go into my playing numbers account to actually merge that poorest request let's go dude a salary proj requests and then here we go and go in and merge this information so that is back up to date again in part five we're going to go through the actual model building process we'll try a couple different models and see what performs the best for predicting these outcomes as usual thank you so much for watching and until next time good luck on your data science and journey
Info
Channel: Ken Jee
Views: 45,442
Rating: undefined out of 5
Keywords: Data Science, Ken Jee, Machine Learning, data scientist, data science journey, data science project, data science project from scratch, machine learning project, kaggle project, data science project python, machine learning project python, data scientist salary, data science for beginners, data science project for beginners, data science project tutorial, data science project walkthrough, github, data scientist salary in usa, exploratory data analysis, eda, feature engineering
Id: QWgg4w1SpJ8
Channel Id: undefined
Length: 68min 39sec (4119 seconds)
Published: Fri Apr 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.