Build your own Yellow Pages web scraper with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can extract company information and business information from yale.com or yellow pages um hi everyone welcome my name is john and let's get going so the first thing i always like to do is to have a look at the website and go to view page source and that will pop up and that will tell us whether we can or can't scrape with beautiful soup and one of the easiest ways to do that i find is if we take say some specific part of the information on the site and copy it and then you go to the source code and if you can find it and it looks like this in the html you'll be able to use beautiful suit to be able to get it out okay so the next thing i'm going to do is i'm going to copy the url and we're going to head over to vs code so we're going to need uh two libraries for this we're going to need requests so let's import requests and also beautiful suit 4 so from es4 import beautiful soup if you don't have either of these installed you can install them with pip once you've got those installed we're going to do our set our url and put in what we came up with there i'm also going to use custom user agent for this um so i'm going to do set some headers is equal to and then i'm going to say user dash agent and we're going to set it to a proper user agent just so that's done nicely so if you don't know what that is um it's basically just a text string that gets sent along with your request and that helps the server identify you so if you're not sure what to do just type in my user agent and chrome will tell you for you tell it for you you can copy that out we can paste that in there and that's all good then so the first thing we want to do is we want to make our request to the server so we're going to do r is equal to requests dot get and we're gonna say our url and we're also gonna say that headers is equal to headers and that's just gonna do uh specify what we've put here and send it with our request and then we want to do soup is equal to beautiful soup and we're going to turn our r from our request so we're going to do dot content and we're going to say we're going to use the html parser for this one there are other ones you can use it doesn't matter sometimes i switch this is just the one i'm using at the moment okay so now if we print let's say soup dot title hopefully we get the some information back from the page and we can see that we've got the title back here so that means everything is working so now we can go back to our page and we can start using the inspect element tool to find out where the information is that we are after okay so if we look at this to just to start with we can see we have the name um there's an address and there's a rating and how many ratings there are um okay so there's two buttons as well there's one for website and one for call when the website one does appear to have the links to the business website so we can get that and the cool one is actually just a button that brings it up for us like this but what we can do is when we look through the inspect element tool it's actually still there it's loaded it's just an event load up so we can get that information as well so let's start with a name so we'll do inspect and make this bigger so we can see click on the little select element and the first thing i want to do is hover over the whole thing so if we see a bit bigger small there we go so we can see that we've got all of these article classes and now every time i hover over it it loads the next one you can see it makes the next one blue so we're selecting that the problem with this is that although they all start uh with the coal sm24 we could use that but it's a really really long class so if we just go in one level further we can see we actually hit a div with a class of row business capsule and that is identical to each one of those so i am going to use that to start with so i'm going to copy this div class i'm going back to vs code i'm going to say article articles is equal to soup dot find all and it was a div and the class was equal to that right there so all i'm doing is i'm using the find all to find every instance where there is a div of class that matches this in our suit so if we just do print the length of articles because that would that article is going to be a list let's say that we can run that we should get a number back 25 so it's looking like we're getting 25 articles per page so the next thing we want to do is we need to find where the information is within this so again now we know we're in this this div here we can click on the title we can see that underneath our h2 we have the name of the business here this is actually going to span the class of business capsule name so we're going to copy that that's vs code and we're going to just paste that in there like real quick so we're going to do name actually what we'll do is we'll start our for loop because we want to loop through every single um article on this page every single business so we'll do four let's just call it item in articles and then we can do name is equal to item item.find and it was a span and the class is equal to this and we want the dot text so all we're doing here is that for because articles is a list with our find all soup dot find all will return us a list we're going to loop through that list so for every item in that list we are going to find the span with the class of this and it's going to return the text so if i print that out now we should get all the names appearing in our terminal which we do and there are 25 of them okay so that's great so that's working so now we just need to go ahead and get the rest of the information now any information on this page you can get previously i've done things like the link the name the address all that sort of stuff i'm just going to focus on a few main ones at the moment so you guys grab the concept and then if you actually wanted to do this for yourself you could go ahead and maybe get all of this information as well so what i'm going to do is i'm just going to get the address now if we hover over this we can see it kind of in bits and if we look down here we can see it's on screen yes but it comes up and it says it's on a span of item prop is equal to street address now that brings up three different ones here but they're all inside this one here with a span item prop address so what i'm going to do is i'm going to go and hit this one specifically and then what we're going to do is we're going to strip all the white space and then we're going to remove the extra lines you'll see what i mean in a minute so i'm going to copy that and back into our code and i'm going to say address is equal to item.find again and it was a span now we need to do something a little bit different with this one because uh with these ones we can call class with an underscore but we actually need to use the uh the dictionary approach for the item prop so we're going to open the little brackets up here and we're going to say item prop and then we're going to pass in the key uh sorry the isopropyl is the key and then the value was address that we put there so if i go ahead and print address just like that we'll see that we actually get one is it one two three bits of information back per span tag okay so we can go and we can ask for the um dot text of that and that will take away the span tags okay but you can see it's kind of giving us a load of white space and it's all on separate lines so the way to get around that is to do dot strip to remove the white space and then to do dot replace and we're going to replace the new la the new lines which in python is a backslash n so that means it's finding every new line within our text and we can see that we have one in there and we're going to replace that with just nothing just like that so if we print this again we should get all the addresses on one line nicely separated with the commas which were in the html code anyway so now we've got the address let's go and get the website if we come back here and hover over the website button we can see that it is an a tag for a link and it gives us the href and that is the website there some of them um don't have um a website but um there's something in there anyway and we can just filter through those when we when we don't need that information so this is the a tag here got it has a class of button yellow business blah blah blah so let's copy that head over to our code and it's again it's item.fine because we are looking for one of these and we said it was i forgot already oh of course it's an a tag a and again we need to use the same um class approach that we did on the other one because it was a class paste that in there and at the end of this um this this line here if we just pass in href in brackets like this what that's going to do is it's going to get the actual href tag from here so if we wanted to get any of these other ones we could put that in at the end but none of the other ones are any use to us we just want the href so we'll show you that now i'll do print website save that and we'll run it and we should get the website back for every single one okay so we did but we can see none type objects is not subscriptable subscriptable so what that means is the one that whichever one is after this one doesn't have a website so to get around that we can just simply do a try and accept so i'm going to do try and then if that doesn't work if there is nothing there we'll just do accept and after our accepts what we're going to do is we're just going to put website is equal to and then we can just have a blank thing there so um it will just give us a blank line so let's run that and we can see that we've got all the addresses it looks like we've got some duplicates i wonder if that's because we actually have multiple oh yes we have multiple greg's and multiple preps uh okay lots of them so that's uh it's bringing up multiple data but um that's because it is multiple so now we've got that done um let's go ahead and get the telephone number now we looked at this earlier and it comes up when we click on this button but what we can do is if we check through once we've clicked on it and hover over it actually brings it up here and it is available and to double check that if we copy this class and go back to the source code and search for it it is indeed here span class and it's just hidden from view until you click on the button that's quite lazy what they probably should do to stop this is have it as an actual event so it loads it up instead of just being hidden but that's that's not a problem that works for us just fine so now we know that it's a span with a class of business telephone number we can copy that go back to our code and get rid of our print telephone hotel is equal to item dot find and it was a span tag and the class again underscore is equal to business telephone number and we'll hit text on that as well um and let's try now to print the name address website and telephone number for each of the businesses on this page great so it did work we've managed to get some of it however we've gone and hit our doesn't have an edge none type and that basically means that there is no data in that one as well so we're going to have to do the same thing again for the telephone number we're going to do try try and find that and return it otherwise we are going to do tail is equal to nothing now you can do a pass in here but because we're about to put this into a dictionary we want something in this value okay so now that we know we're going to get there with that we can do our let's say business is equal to and we'll give ourselves a little bit more space just to work here and we're going to basically put in all this information so we have name equal to our name variable address and website and again uh tell for telephone okay i think that was everything name address website telephone yep and then i'm going to print this dictionary that we have just created for each one and we'll run this and we should have a nice python dictionary with all the information for each one of the things on that page so we have a quick look we can see here the name address the website now this is just filler data but that's okay but it's got a telephone number i'm noticing just now that there's some extra white space on the telephone numbers so i'm actually going to put dot strip at the end of that as well so that's how we would do uh the main crux of it getting the information out we could also say if you wanted to get the ratings and whatnot that's also the same follow the same principle do item.find check the html tag if some of them have ratings and some of them don't which they do make sure you do a try and accept and make sure you put this data in afterwards otherwise when you try and add this to your dictionary it will fail and you won't get anywhere so that's how we did it for one page now if we wanted to do multiple pages we need to see how the website deals with multiple pages so we scroll down to the pagination and we click on the next page we can see that the url changes and it said adds a add and page num is equal to on the end so what i'm going to do is i'm just going to copy the end of that url i'm going to add it to the end of our one here it's quite hard to see because it's at the end of the line but all we've done is just added this onto the end of the page url now that'll still work nicely but we actually need to tidy this up now because it's much easier to do pagination when everything is in a nice function so what i'm going to do is i'm going to split my code up into our three main parts and then it will be nice and easy for us to manage and see everything happening and and test and do all the pages so the first one i'm going to do i'm going to define a function i want to call it extract and i'm going to say we need a url for this and i'm going to indent all of this so the request part and the suit part so what we want to do is we want to instead of calling our variable here we actually just want to return this okay so now our next part would be transform [Applause] and we need a variable so let's call that articles because that's what we called it before and again let's indent all of this like that and instead of print business we're actually going to add this to a list so i'm going to say main list dot append and we're going to say add our business to that okay and now i'm just going to create a blank list at the top so we can add it to it [Applause] okay so that's our creating our main list with our each one business in it um so now we know that this one works we can collapse down some of these so we have a bit more space and what do we need to do now okay yeah so now we can call our functions so we're gonna do extract and remember we need to give it a url and this is the url here so i'm going to copy this out and then i'm going to delete this line seeing as we are putting this in at the top we don't need it in there anymore okay so let's make this a big one again now that we know that the page number changes at the end of the url every time what we want to do is we want to turn this into an f string so i'm going to change this i'm going to get rid of the one i'm going to put the curly brackets in i'm going to put an x in so what this means is that every time we loop through this it's going to add whatever value x is to the end of the url and we're going to use x as the page number and in front of the text we put f now if we go back to the end we can see it's changed it's now got the y x in the bracket so that's what basically means that it's fine it's working and then we would want to do transform and we need a variable for this and we might as well call it articles again so what we need to do is we need to say that everything that comes out of our extract is saved into the articles variable like this now if we print main list and we say for the stars let's just say x is equal to one you don't need to put this in now i'm just basically doing a quick test and that's not how i did main list is it i did it with a spare underscore okay so this should basically return all of our lists back so you can see we've got now got a list of dictionaries with all of the information we just got out that we were looking at this is still for the first page if i change x to two what was the first one was the honeybee bakery we run this again we should get a different one at the top and we do we've got another cafe nero there seems to be lots of those in glasgow or anywhere probably i should imagine at the bottom of our transform uh function we need to make sure we add in a return function which i've done here now i'm going to do a for loop to deal with the pagination so i'm going to do for it at x in range which means every time it loops through x becomes the next number in our range and i'm going to do page one two uh include up to and including eight so one to nine because i know that this website has uh eight pages for this results now this is sort of so an easier way to do it but if you were trying to do this a bit differently you'd want to see what happens when you try and go to the end page and see what happens there and then do some kind of proper error handling but just for this example i'm just going to do for range because i know this this will work so we'll do four range for x in range one to nine because i know there are eight pages so we want to go up two but not including nine and then we're going to transform our extracts and then transform and now just to do a test i'm going to do and go ahead and print the length of the list only and i'm going to change this to just to do one to two for now and see how many results we get back so we got one why did we only get one i've just fixed a uh a small error that i had in my function i had the return indented one too far which meant it only went through the first result and then returned so we need to make sure that the return is on the correct line so now i've fixed that if we run for x in range we'll see how many results we got back we got back 25 because we only did one page one up to two but not including two so if we do three we should get i think 50. okay excellent so we know that that is working so i'm gonna put that back up to nine because we know there are eight pages and i'm going to create a new function to export this to csv now to do that i pretty much always use pandas so i'm gonna do uh import pandas as pd i'm going to save that and then we're going to collapse this down and we're going to create a new function here and we're going to say and we're going to call this load and we don't need any thing to pass in there and we are going to do df for data frame is equal to pd dot data frame with the capitals and we're going to give it our main list so that's just going to load our main list into a data frame makes it nice and easy for us all tidy and then we're going to do df.2 csv and we're going to call it coffee shops glasgow dot csv that's a really long name and we do index is equal to false and that me basically means when you create a data frame it has its own index uh which is a zero index we don't need that we just wanted to start at the first thing that we've got so we're going to remove the standard data frame index when we export it okay so now that we know that that works after and outside our loop so we want to loop through every single page and store it all in our main list we're then going to call our load function like this and that's going to do our put it into a data frame and export it to csv and then i'm going to do a print statement i'm just going to do save to csv so we know something happened and in here i'm going to make a couple of adjustments as well inside my loops i always like to have a print statement so i know where it's at so i'm going to do print and i'm going to say getting page and we're going to do another f string so we put our x in again because we want this x value and an f and then i'm going to do time dot sleep and i'm gonna do a five second sleep on each page because otherwise if you hit it too fast the server will block your ip and to do time dot sleep we just need to do import time at the top okay so i'm just going to run this now and we'll see the result at the end okay that's all finished and if i open up the csv file that we got we can see that we have gone ahead and got all that information exported to csv there are 192 entries of a name address a website if there was one and also a telephone number if there was one so this could easily be adapted to do any of the other fields that were in the information for the business or we could do multiple pages you could adapt this so it takes user input for the search because if we look at the search url there's some keywords in here so we could maybe manipulate that and the location we could go ahead and build up quite a nice data set so that's it guys hopefully you found this useful more web scraping with beautiful soup and requests there's some cool bits in here and getting out some useful information if you've got any questions let me know don't forget to subscribe there's more web scraping content in my channel already and also more to come and i'm starting to do a weekly live stream on youtube as well where i take questions try and answer or try and do some live coding and some scraping so um you'll see notifications for that hopefully if they pop up so do join us thank you very much see you in the next one bye
Info
Channel: John Watson Rooney
Views: 3,978
Rating: 4.9736843 out of 5
Keywords: web scraping with python, python web scraper, yellow pages web scraper, yellow pages extractor, yellow pages download excel, yellow pages data scraping, yellow pages data, yellow pages data extractor
Id: PhJFg1THF9E
Channel Id: undefined
Length: 25min 9sec (1509 seconds)
Published: Sun Sep 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.