How to Web Scrape Indeed with Python - Extract Job Information to CSV

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can scrape data from indeed.com hi everyone welcome my name is john and let's get going so the first thing that we always need to do is check out the website so we're going to go to google and we're going to go for indeed.com we're going to do a search for a job spec and a place so we have some kind of data to get to and here we are so the first thing i'm actually going to look at right here is the pagination i can see the url at the top and i'm going to see if how it changes when i move to the next page i'll just give you an idea of how everything's working so by clicking on page 2 i can see that it goes and gives me this extra part of the url it says and star is equal to 10. so i'm assuming that's the number of records so if i change that back to zero uh we get the first slot again here see page one now if we change it to 20 we actually end up on page three so that should be absolutely fine ten page two great so now i know what the url is i'm going to copy that with the pagination on the end and i'm just going to quickly just put that in here url is equal to just so i've got that saved so to check how we could scrape this site i always do view source first unless it looks really like it's javascript which this one doesn't and then i'm just going to type in some text which i think was on the page and i'm going to check to see if it is in the html somewhere it looks like it is we've got this data here which is quite useful but not everything keep going down and we can see right here there is all the information available in the html so we can use python we can use um request beautiful soup to scrape this information so now that i know that i'm going to go back to my code i'm going to start writing and we're going to import requests we're going to need that let's make this one more bigger there we go and then i'm going to do also from bs4 import beautiful soup which is what we're going to use to pass the data so i always work on a 3 function approach which is uh extract transform and load which is a common uh computing term so the first one is going to be extraction that's gonna be getting the data off the website so we're gonna do define i'm gonna go extract and we're gonna give it a page now what this means is that when we run this function we are going to put the page that we choose into the url inside the function so what i'm going to do is i'm going to indent the url and at the end of it where is we said start is equal to zero what i'm going to do is put the curly brackets i'm going to write the word page and that is because that page is can is matches this and at the start of the string when we put an f we'll turn this into an string so what this does is that when we run this function and we give it a page into the into the function there whatever we put here will appear in here so that was our 0 our 10 20 30 etc etc so now that that's done i'm going to also get some custom uh user agents so i'm just going to quickly google my user agent i'm going to copy the string and then i'm going to go back to my code and i'm just going to type this in here i'm going to go headers it's equal to and it's user-agent [Applause] as a python dictionary and i'm going to paste that in there great so now that i've got the url and the headers set i can do my requests so r is equal to requests dot gets url and then headers like that the next thing that we want to do is uh just test that this works so what i'm going to do is i'm just going to do return r dot status code and i'm going to run that with underneath i'm just going to write extract and then i'm going to do 0 because zero was the first page i'm gonna run that uh and they said it didn't print it so we're returning it but it didn't print it there we go so we did get 200 so i know that this is working so let's get rid of that let's get rid of our return statement as well because we're going to add one more thing in here so what we're actually going to add in here now is we're going to do our soup variables so we're going to say soup is equal to beautiful soup wow and then our dot content and then we're going to use the html parser like so you can use whichever one you want i'm just using this one at the moment and then we're going to return our function and we're going to return the suit so when we run this with the page number what it's going to do is it's going to get the information and then it's going to return the whole soup so what we would have to do is we'd have to store this into a variable so we can then give it to our next function which is the transform so i'm just going to put that at the bottom here for the moment i'm just going to say we'll just call it c is equal to extract and then i'm going to give it 0 again for the first page we're just going to leave that down there for the moment so now we need to go back to the website and we need to have a quick look and see in the html where the data is that we are after so to do that i like to use the inspect tools easier and what we are looking for is basically each one of these sort of blocks should have a class or a div and there it is right away so we can see on the right hand side the div class of job search serp job card etc etc now each one of those is a new job if we had that if we scroll down a bit see they'll get highlighted so what we can do is we know that all the information is in here if we open that up we can see we've got the title uh etc etc now i'm going to go ahead and get this copied and i'm going to actually just use the first part of it to match we go back to our code and now i'm going to start our second function which is transform so i'm going to go transform now we need to give this a variable and i'm just going to put soup here because we are going to return this suit from here and then we're going to pass it into this one so we might as well call this soup as well so now what we want to do is we want to say let's just call it divs it's equal to soup dot find all and it was a div and the class with an underscore remember is equal to the job search serp card like that so what that's going to do is it's going to open up the page it's going to pass through and it's going to find all of the all of the instances where the div is this so i can do uh let's do return and we'll do length of divs for now and after we've done our extract here we can actually just go do we'll do need to do print this will just show us how many there are transform and we give it our c variable because that's what we called it here we've called it soup all along here but we're actually calling it c right there so if we run that we should hopefully get a number back we do we got 15. so we've actually picked up 15 divs that match this on the first page great so i'll get rid of the print statement here and we're not going to return the length of the divs because that is not useful to us but what we want to do is we want to loop through each div and get out the respective information so back to the website again and if we hover over our div class again there it is within the first one the h2 class of title it has the job title but actually within that we have an a tag that has the text underneath so what i'm going to do is i'm actually going to go for the a tag in this case so we'll go back to our code and we'll do we need to do a for loop to loop through every one so we're going to do four item in divs you can call that variable whatever you like i'm just really used to using items so that's what i use i'm going to say title for job title is equal to item.find it's an a tag hopefully that will do because this is the first a tag within the div so when we do a find we find the first instance if we do find all it returns a list now we don't want that in this case so we're just going to use find and we're going to return the first a tag so here's our d of where we're looking for and the first a tag we come across is this one so go back to our code and i'm gonna do print title here i'm gonna put dot text underneath on that and the end of that one and now i'm going to put a return on the end of our function and we'll check to see what comes out now so then we can do uh print um i don't think we need to do print transform so we're not returning it we can just do transform c because our print is within our function run that and you can see we've got the job titles back there now it looks like there is a lot of white space so we can get rid of that quite quickly and easily with dot strip and that should remove all of the white space around the text that we are scraping there we go and now we have a nice stripped out list of the job titles that are on that website matching this text string cool so now we know that that works i'm going to get rid of these we don't want them and we're going to go ahead and try and get more information because we need more than just the title so the next thing we can do is we can look through and we can see that there is a summary and there should be a company hover over it okay so we get a company and that is in a span tag with a class of companies so let's go ahead and get that so we do company is equal to item.find and it was a span tag and the class was company and again i'm going to do text and because the last one had got strip i'm going to do that as well because just in case it's got white text around it as well so we're gonna do print company this time didn't put equals in here so we're not actually looking for anything and there we go so if we run this we should get the company information okay so we did it looks like facebook are hiring not on this page so we must be picking up some more information from somewhere else oh yeah there we go cool so that works now the next thing we want to do is we want to try and get some more information out so we'll go back to our website again what else is useful well i think uh some of them have a salary that i noticed on here so what we can do is we can see where that is hidden we can say it's in span class of salary text but not everyone has that on there so if we were to try and look for it in this you can't so you can see that there is no salary text in there so what we're going to do is we're going to do a trial it we're going to try and we're going to do try for salary it's equal to item.find and it was i've forgotten span span tag and the class is equal to salary text and again i'm going to do text and dot strip just to get rid of all the stuff we don't want but we know that that's not in every single one so we need to do accept and because we're doing an except but we still want to add something because we're going to create a dictionary with all of the information in at the end i'm actually going to put another variable in here a salary because we're going to have that in our dictionary of nothing just blank okay so that's good and let's go back and find something else one thing i liked was the summary because this was a list of text that was in every single one and it kind of sums up as a summary what they are expecting from you so let's go here and it's a div with a class of summary and i'm just going to ignore the list because some of them only have one or two and i'm just going to take the summary of the class and we're going to put all of that text into one variable so let's do summary is equal to item.find and it was a div and it was a class and it was a summary so if it wasn't a class each one of these was a class so we're able to do class with the underscore if it wasn't let's say it was an id you can pass in a python dictionary here you can do like this so if it was id you could i do id and that would work i'll show you we'll use class because it works for both class summary and dot text and dot strip so we now have title the company the salary if it's there the summary if it's there what else could we go for i think that's probably about it okay so i'm going to leave it at that and we're going to go and get we're going to go and create our dictionary now so once we've looped through all of those we would want to store all this information into a dictionary so we can then append it to a list and manipulate our list nice and easy so i'm going to do i will call it job we're going to call it job is equal to and we're going to start our dictionary and our first one is title and that is the title that we created and then company and that is company then we'll do the salary if it's there and then we'll put in the summary which is the text which is more like the description really okay so that's good so now we've done that what i'm going to do is outside of our function i'm going to create a list that we're going to append all of these to so we're going to do we're going to call it job list like this so whilst we because we've got this outside of it under here we can actually do job list dot append our job here and because what we're going to do even though this function is above our job list here we're actually running it down here so we're going to create a blank list and then with our function which is underneath we're going to append to that list within it so all we need to do now is we just need to return and just make sure that you put your return on the correct line otherwise you'll do one job and then it will return out of your function and you won't get that many results right so hopefully if i've made no mistakes there we can now run our code so we've got our empty list we're looking for the first page and we're transforming that so now we just want to do job list dot let's just run the length on it first we might get some errors if i've made some mistakes so this should just print out how many results we got which i think was 15. again good we've got 15 so now let's print the actual data there we go so we've now got a list of dictionaries with the title the company a salary and a summary which has the text information in it so what i'm going to do actually is i'm just going to tidy up the summary text a bit and i'm going to replace the new lines with nothing so under the summary we've got dot strip i'm gonna do dot replace and i'm gonna do uh backslash n which is a new line so we're gonna replace the new lines with nothing now hopefully that our data should look a bit nicer okay that's better so we've removed all of that so that's good great so we've managed to scrape the first page and we know our functions work so we've extracted it using requests and beautiful soup we'll return our soup out of this function we pick our soup up with the with our transform function we look for all of the divs which were the job cards and then we looped through them and get the title company salary and summary if it was available if there was other information which is only available in some maybe some descriptions or another don't forget you can use trying to accept like i've done here and we created a job dictionary and we've appended it to our list now our list is down here so i'm going to do is i'm going to collapse these functions because we know that they work and i'm going to import pandas [Applause] as pd because we are going to use a data frame for this i'm going to get rid of our print statement because we don't need it i'm going to leave our job list there like that now i'm going to do a for loop to loop through some pages so i'm going to do for i in range now normally when you do a range it would do whole integers so if you did between 0 and 10 you would get one two three zero one two three four five six seven eight nine but we can actually determine the steps that it goes through because our pages were 0 10 20 etc etc so i'm going to do 0 and i'm going to go up to page let's just do 40 for now and i'm going to do intense so it is 0 to 40 but that will give us 0 1 2 3 so 3 pages and it will be in a step of 10. so let's indent this and then we can do let's print the length of the job list again and let that run so this should just go through the first three pages and we've actually returned 60 results so 60 job information which is pretty cool great so that would be the simplest way to do it if you wanted to get multiple pages if you weren't sure how many pages there were rather you could do a while loop and you could go to the end and see what um see where it ends and see what happens and then just add in some error handling for that i'm just going to add this add this information now to a data frame so i'm going to do df is equal to a pd dot data frame and then our job list so that just creates a pandas data frame and i'm going to print df.head so we can see the first ones and then i'm going to do df.2 csv and i'm going to call this jobs.csv so we're going to create a csv file with this information here and inside our loop i like to put a print statement so i can see what's happening so i'm just going to do print getting page and we're going to create an f string here and we're going to do f and although it's intense we'll still know roughly where we are and we can put i in there not like that like this um even though it's intense we'll still know where we're at so let's run that again getting page 0 10 20 30 and we can see that we've returned a nice data frame with that information and if i go into our explorer we've got a jobs csv file with the title company salary and the summary so this is pretty cool you could do a lot more with this if i was going to do this for myself perhaps i would want to scrape the jobs maybe once a day and have some kind of machine learning perhaps or maybe look at what the words are in the summary and if you pick up something you're specific so maybe you're looking for something really specific and you could try and find that information and then notify yourself or maybe just see what jobs are available so that's it guys hopefully you've enjoyed this one nice and easy web scraping with requests and beautiful soup again make sure you use some nice functions so you can keep track of everything your code is nice and neat and yeah thanks a lot for watching consider subscribing for more web scraping content loads on my channel already and more to come main videos sundays extra videos i if i can on wednesdays and some live streams coming up too so keep your eyes peeled for that thank you goodbye
Info
Channel: John Watson Rooney
Views: 18,257
Rating: 4.9616613 out of 5
Keywords: web scrape indeed, web scrape jobs, scrape job posts, scrape job listings, python web scraping, web scraping with python
Id: PPcgtx0sI2E
Channel Id: undefined
Length: 20min 54sec (1254 seconds)
Published: Sun Sep 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.