Web Scraping to CSV file using BeautifulSoup | Beginner Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to today's tutorial today we are going to scrape a website for some information such as a python or a developer position the name of the company and a salary for that position so this is what we're going to make with beautiful soup i'm going to scrape indeed.nl so for the netherlands and i'm going to look for jobs for developer jobs in amsterdam and this is the information that i got so position full site developer the name of the company jump work and salary 65 to 85 000 euros per year well that's in dutch but yeah position the next one front end developer the name of the company and the salary next one position developer python azure financial company name is salary per month position so when it goes and finds a position without a salary it says position django developer company name making media and salary not mentioned so this is oh this is a nice salary huh yeah okay so uh this is what we are going to make today to scrape indeed.10l for developer positions in the netherlands okay so i'm on this webpage on indeed.nl and and this is my query python developer where in amsterdam and then i'm going to scrape this webpage for information about the position like an algorithm developer the name of the company such as optics and then if there is salary mentioned somewhere i want to also grab that information so you can see we don't have any salary mentioned for these positions and here we have some here we also have some so this is what we are going to scrape now the first step in scraping a website well before getting to coding is to know the html structure of a web page so this web page is basically html with some styling and css and some javascript now in order to see all the code the html code you just click write somewhere and you can either see the source code which is here so this is the html document type declaration the language let's see the head this is a javascript part here then you go down below until the css files here and all that stuff and then you see this is all javascript let's go until we get to yeah here html starts actually yeah so this is the html like for tables and for divs and that sort of stuff so this is the the page that we are going to get we are going to grab and then we are going to look for positions for company name and for salaries now another way to look for this information is that so this is basically a card with some information in it right like title like the name of the company and for some there is also salary and it repeats this kind of structure repeats so we need to go to one of these structures click write and inspect and now here we can get access to this part so what is the name of this wrapper this container what is the name of it so if i just go over them i can see you see so i got it now it's highlighted so when i go over it so this is the place that i should look for this information inside it now this is a div and this is the class of the div so it has several classes row result click card and this one and this so i think this is the class which is specific to this job kind of card now let's go back to our code editor and let's see how we can grab this so before doing anything we need to install beautiful soup first so you simply go pip install and beautiful soup for press enter and then there should be space here and now because i've already installed it so it says requirements satisfied so after you've installed the beautiful soup you also need to in pip install again um requests because we are going to send requests uh to that url and ins oh okay so yeah i just had a title here install not install so install requests and i've already didn't download them up there so import requests and also import bs4 from oh sorry from bs4 import so from bs4 um oh my god i can't spell things what happened to request i just deleted that okay so go again import uh requests good and then from bs4 for god's sake import beautiful soup okay now that we have these two we need to grab the url of the page we are going to scrape so this is the url of the page so let's grab it let's go back and put it in a variable let's call it url double quotes and let's paste it in so we have the url we're going to scrape inside this variable the next thing we do is to grab the page content and also put it inside a soup object so so we need to have another variable and we need to now send a request to this url and get some response so page equals requests dot get and you saw that for auto completion there was a url inside it so here i need to put a url which i just have here so url now i have requested this page now let's see what the code status is let's see if we have made a successful request and got a response or not so if we print out the page we should get some number either 200 300 400 something let's print this out and let's check so we get a response 200 so that's good news 200 means that it has been a successful response now in in order to check these numbers we can go back and here right under network we have all these status codes so it's it's an okay see 200 okay so this means this is a successful kind of request and response okay so now we've been successful at establishing this connection now uh we need to so we have the page now we need to make a soup object out of the class so we instantiate this object soup you can well call it whatever and then we say beautiful soup what do we want with that beautiful soup so we need to find something here we need to find something and that something is beautiful soup gets two of arguments one is actually can be the contents of the page so say cam page.content page.content yeah and content comes with actually a beautiful soup so it means that the page that we just requested we need contents of it that is whatever is inside that page the javascript the css the html whatever so we get that content and another parameter is html parser because we are asking beautiful soup to parse this this this page using html parser now okay now we have this object soup with all this information that is it's been parsed according to html tags and attributes and stuff and all the content is there so if i print out the soup you would see all the contents of that page that you just saw here all this is what we will get now but we are not interested in all that stuff so what we're interested in is to go to our page like this one and grab uh this card and inside the card we want to grab the title what is the title let's go for the title um title you see algorithm developer so we are interested actually this ah these bikers yeah okay so we need to forgot to say annoying bikers okay um so i just forgot what i was trying to say yeah so it's inside this title inside this div we are looking for an a tag and inside that a tag we actually have algorithm developer there is a position the job title you see so we have it here now we want to target this this one so first we need to target this uh where was that yeah we need to target this class first which is the old all of it and then we say inside this look for an a tag here let's just do it okay so let's save these results well in a results variable and say that soup we want to find every div remember that was a div so remember this thing here this thing all of it this is a div with this class so we want to grab all of it and then we say inside this look for something else look for the position look for the company and stuff so and this is the class that i want to actually copy yeah okay so we say that soup that find all so find all the divs and which have the class and class remember should be there should be an underscore after class because this is a class itself is a python keyword so we need to make a different a distinction here so class underscore equals and what was the name of the class job search serv job card so inside all the divs on this page on soup on this page look for the divs which have this class okay now that we have this uh now that we have all of these cards so we have this we have this all of them have the same clan class name and all of them have divs so we have all these now we are going to loop through each one by one so we say these are all results so this is one result this is another result so we say for result in results find the position find the company find the salary now this is how we do that so for for result in results now for every result that there is in this kind of uh devon class we want to find a title title we want to find the position uh sorry the company the name of the company and we want to find also salary now how can we find the title inside this div so let's see if we can find it now first of all i want to print out results so that we have it down here it's easier so print results and yeah let's comment this out okay now we have all those cards here inside a list now there is one div for example it starts here then it goes all the way and finishes somewhere let's say finishes i don't know maybe here and then the second card starts and then the third card starts so we're going one by one through each so the first one we're looking for title now so we have this title here and full site developer nice so the full site developer is inside an a tag a tag here and this a tag has a class of job title cool so we can directly target this so we can say inside a tags with job title we want whatever is inside so that's for title so title is oops oh let's comment okay now so title is look for uh look inside result and so this is one result now inside the result find and uh so find not find all because find all we find every uh instance of this tag but we want only the first one because it's just there is only one possible combination that there is a position so result dot find and then find what a tag and what is the class for the class is uh job title job title is it like that time oops yeah job title okay so now we have access to title let's go and find the same for us company what is oh company here cool easy so it's a span tag with the class company and that's the name of it so span company it's not an a tag anymore it's the span tag and the classes the classes company let's copy and paste again result.find now salary salary here awesome so what is it it's inside the span tag again with salary text class okay so it's a span tag and salary text okay so now we have access to these title title company and salary the next thing we should do is to print them out so we can well we can also print them out already here but i think it's uh more needs this way so print let's put an f string let's say position and the position is going to be this variable title right but it's not the title itself because it remember title is this uh kind of a tag where was it yeah so this is the title but we don't want all of it we just want the text inside it so we say title title dot text that is we want the text inside this tag not all of the html tag so and yes and then let's print again something else so let's print a company so this is a company name and then we have company again that text that it takes inside and the last one salary salary the text and here is also salary but remember not all these positions have mentioned the salary some of them has have not mentioned anything about a salary so if we ask to print it then there will be an error that's why we need to say if there is salary so if salary then print this if not that is else uh let's print something like salary not mentioned yeah okay let's uh i think okay let's just print this out and uh let's see what we get in return wow look at that information so we got all this text why do we okay let's just uh go again and let's run the code and let's see okay now you're talking it's here actually yeah so here we have it position full stack developer company jump works salary 65 000 a year next position and salary next position and seller very cool but it's yes so close to each other so maybe we need some space or some new lines i think i've printed something else as well i think oh yeah i don't need to print this out here anyways okay so we need at the end of this so after salary maybe we should have some new lines or actually maybe two lines so what we need to do is to comma and and the end should be a break a new line uh actually two new lines so yeah so let's have two new lines instead of one so whatever the answer is so if salary is there or if salary is not there i need to have two breaks after salary anyways and then this one okay let's print this out and then check what there is something wrong here let's check out um i have no idea there's okay let's print well i have no idea what's wrong so let's see oh okay oh well actually nothing was wrong so let's go back and check position python developer company kabisa salary not mentioned position team lead back-end developer company salary okay you see we have all this cool information and we have just scraped indeed dot nl well yeah not all of it but yeah so this is uh basics of web scraping i hope you did not expect to for me to go through all documentation but only we touched some basic concepts in web scraping using beautiful soup thank you so much for listening and watching okay so i almost forgot that i also wanted to export this kind of information and onto a csv file by the way so let's uh start from where we left off so this was what we got and now we are going to make a a csv file and csv stands for comma separated values so we're going to make such a file which is more readable actually so in order to do that we need to from csv import writer because we're going to write onto a csv file and then uh right above the for loop we need to um what do we need to do yeah open with open i'm going to open a file which doesn't exist yet so we're going to open this file let's just call it jobs.csv i'll go away please jobs.csv and then we need to uh kind of specify a mode for it so writing mode there's also r mode which is only readable we cannot write anything onto it but we need to use uh w which is writing mode and then we need to specify a new line which is going to be just empty spaces like an empty string and then the last one is encoding coding which utf-8 this one sometimes it's not necessary but i've tried a couple of times and for me this should be there otherwise i would have an empty csv file so we're going to open this file as let's give it a name here as well just as f and here we need to specify now use the writer so let's just say the writer you can yeah there's something you can just name it x or whatever but this is going to be a writer so i just call it the writer and the writer is going to be responsible for writing so it's going to be writing writer and the file is going to be yeah the f this one that we just mentioned and now we are okay now let's also specify some headers that is uh what is going to be like the head of the columns in our csv or excel sheet or that sort of kind of file so our first column should have this title header the second one should be the company and the last one should be salary [Music] okay yeah and now now we need to write some write these into rows so uh like the first row of our a csv file so we need to say the writer which is supposed to write stuff on this f which is jobs.csv dot right row row there's one row and let's call it yeah just header so now a writer which is responsible for writing on f which is job.csv is going to write a row which is titled company salary the first row in our table okay so far so good now we need to do the same stuff now with these values so let's see now uh we need to take our for loop inside this file and okay here we don't need these prints yeah we are not going to print anything on the terminal so we should get rid of these prints and now now our title is not a text anymore so it's just the html tag all of it we need to turn it into text here remember here we had convert them into text before but now that we don't have these so [Music] let's just turn them into text right here so let's just say text and the same dot text and the same dot text okay so now we have our texts that is uh inside these a tags now the only issue is that uh whenever we when i get access to get access to this title there will be also a break line at the end of for the the code at the end of our uh line here and we don't want that brake line so let's replace that brake line which is uh backslash n we want to replace it with uh just an empty string so we don't have that let's use it again here and here okay so now we have these but there's still we have this issue with salary remember that some some of those positions in some of those positions salary was not mentioned so if we try to write it there then we will have some errors now what we can do is maybe here let's create a variable is called salary and let's give it some empty and empty string and let's call change this one to salary one now and what i'm going to do is to say if salary one that is if salary exists uh and yeah without a text because okay yeah we're going to add this to salary then yeah so if salary exists then salary which was the empty one should equal this one salary one with the well the text of it and yeah the whole thing so if salary 1 exists inside this then this should happen to it and we assign it to salary but if it doesn't exist that is else uh we should say salary equals and not mentioned something like that so not mentioned okay good now we have everything in place so we can what we can do is to put all these variables inside a list so let's just say job info and the list would have title and company and a salary okay cool um okay now we need to just what we did here so we should write it now so the writer the writer now we should write a row for it and the row should have job info and then this will go on and on until it gets to the bottom of the page so i think i think we are done yeah let's just uh run this file and let's see what happens if we get any errors here or if no errors here and we also got our jobs.csv cool okay so uh let's get to our jobs at csv hopefully it will not be empty oh nice look at this beautiful thing okay so we have title developer c plus plus python web developer company isens davinci makin and salary uh yeah this weird sign is euro sign actually okay so maybe we can manage how to change it later but salary this salary here this is per year this is per month well as in dutch also per month and not mentioned for these awesome yeah so now you know how to scrape a web page well not a very complex one and turn that kind of information into a csv file like this thank you for watching and listening
Info
Channel: Pythonology
Views: 33,516
Rating: undefined out of 5
Keywords: Beautifulsoup, webscraping, beginner webscraping, beautifulsoup tutorial, webscraping tutorial
Id: Ql8Na3astdQ
Channel Id: undefined
Length: 31min 4sec (1864 seconds)
Published: Wed Jun 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.