Web Scraping with Python: Ecommerce Product Pages. In Depth including troubleshooting

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome John here today's video we're going to be doing a bit more web scraping and what we're going to do is we're going to be scraping an e-commerce site and online shop and we are going to be going into each individual product page and get information from there so this is the website that we're going to we're going to be having a look at subsonic shop that sells whiskey I've gone to a subsection here the Japanese skin if we look there's quite a lot of choices there's quite a lot of products here and if you wanted to actually find out what the user ratings were they're not on the front of the page they're actually inside each of the products so what we're going to do is we're going to get a list of all of the links for every single product all five pages of them all 87 of them and then we're going to go into each product individually and we're going to try and get out the rating but as you can see here some of them don't have a rating so we're going to go into each product try and get the rating and maybe get the about text as well so there's a few things that we need to do first go to our text editor we're going to import requests we're going to need that and then from bs4 in pop utiful soup as we would do normally so I'm going to do set the base URL first of the main page because we're going to need that when we construct our URLs for each of the individual products then what I'm going to do is I'm going to get a user agent I can send with every single request so by default when user requests if you send off to a page the user agent is request Python and I think that that can get blocks more often so we're actually going to override that and I've picked this one here from this website it's a common one it's going to send and say that we're on Windows and we are using Chrome so that will do what you'll put that in as headers is equal to and we need to then add a little dictionary and it's user - agent and we'll set that to this long string that we've got here so that should do us nicely now we need to investigate the page a bit more so we can work out where the links are and how we're going to get them out I always do that with inspect element so we go to the inspect element what we can do is when the when everything is highlighted in blue we can actually see where we're selecting so we can see everything is highlighted so I need to go down and live on make this a bit bigger so we can see so we need to go down another level products grid down one okay so group this one is highlighting them all and now we can see we've got div classes equal to item and as I move down each of these it highlights each individual product so now we know that we're in the right place so if we go into the first one and then we can see we've got the a tag class for product and here's the second part of the URL that we're going to need the H R if we open that up a bit more we get some more information so we can get we could get the name there's the name and we could get the price but it doesn't have the rating which is what we're actually after or and the description as well so what we're going to do to get rid of that we're going to write our script to go through each one of these and creates a URL for us so to do that we need to go back to the item and we need to go to our zip inquests get and we're going to need to put in the URL that we are getting I'm gonna put this straight in for now just like that so then we want to do the soup is equal to beautiful soup and then our content to get everything from the page and I'm going to use the Earle XML parser on this Sun let's go this together and then we're going to do for that list which is equal to soup doc find all and what have we got here it's a div with a class of item so I do div and class with the underscore is equal to item this is class underscore because it's obviously not a Python class so if we print out let's print the product list and just check the we're on the right yes there we go we can see that we are getting the HTML for the item for the items on this page in side each of these is the a which has got the link to the product that we're after so I'm going to create a blank list quickly and I'm going to go to call it product links grab and list and now I'm going to get rid of the print and let's do for item in product list we want to do item dot find all put in a because we're in looking for the a tag and to get out this href we will need to pass in href is equal to true okay and we're going to put this into another little loop so I'm going to do for item I'm going to do for link in items off I thought there we go and then we're going to print out the link and we are going to go and put it in our brackets like this which is basically going to get access this tag for us so hopefully if that works yet they go we see we've got each one of the href so what this is doing is we're looking for inside every div class item which is each individual product that's this line 917 and then we're going in there again to look for the a and we're going to loop through and get out the product and the H ref so what we can do now is we can concatenate our base URL with this second part of the URL and get a complete URL that we can then use and saving to our list so we can do and product links dot append and then let's put in base URL plus link Hof like we just called okay so now if we come out of that and we do print let's just do the length of the product links list it's 20 so there is actually 20 per page now I experimented with this earlier and although you can make it 200 per page that's like it seems to be some kind of JavaScript filter or something so he doesn't actually translate to requests by changing the URL so we need to make sure we loop through all five pages now to do that I'm going to put this whole thing into another loop and I'm going to do 4x in range and we know there's five pages so I'm going to 1 to 6 and then we're gonna indent this whole thing we leave this down here on that i think this URL is going to change when we go to the next page so if we click next page you can see that the URL has changed and we've put in this little PG equals 2 I'll copy and paste that over here so we can see it a bit better so you can see that there is a page identify in the URL there what I'm going to do is I'm going to use f strings now f strings will let us put our number into the middle of the string at the point that we did the point that we choose so instead of the two I'm going to put the two little curly brackets like that and then I'm going to put an F in front of there and that's it turns it into an F string okay and an inside the curly brackets I'm going to pass in X which is going to be our number so if we put X there so what that means is for every time it goes into this string here it's going to put the X in which is going to be the number we choose number 1 2 3 or 5 because it's up to 6 and it's going to go in there so I think there's going to be think that will get us all of the links so we should get 87 let's see it's only got seven that is because we've got the product links inside our lips are black next is getting over overwritten every time so it gets do that and we've gone through every page about 87 results now and we could check those by printing out the actual links themselves and we can see that we have all of these links to all of these products now what we're going to do with this list is we're going to then loop through each one of these links to get the product link sorry to get the product information and we're going to extract the product information from each one like that and then save it into a another list or dictionary so we'll get rid of that printing we don't need that for now and let's go ahead and see how the information is displayed on the product page so the first thing that we're going to want is definitely the name and we can see that is under an h1 class okay so that's nice and easy the next thing we wanted booster reviews it comes under star rating and one review - that information might be quite useful and then the price yep see that there and there's also this little an in stock thing here stock flag so we might be able to get that information out as well we'll see if we want that much to work out how we're going to get each and individual information out I'm going to use one link to start with so we're going to take this and we're going to do this as our test link so we don't loop through everything whilst we're doing it so we want to do R is equal to requests get test link and then we need to do let's put in our headers now and put those so headers equal 2 headers and that's going to send off our fake user agent which will hopefully stop us getting stopped doing this or stop to stop us getting like people and so easily let's fill requests right now help so let's see what do we want to do we want to get that link and we want to do soup is equal to beautiful soup again our content before L XML we're going to use as the HTML parser so now we want to do let's see if we can get the name out so let's print out soup dot find and where was the name let's go back to that so we want to get the h1 class so you don't find h1 class underscore equals and this right here face time so let's see what we get no syntax error that's because I've missed a bracket there we go okay great we've got it out there so we can actually put in the dot text now here and we should just get the text of that cool and I'm actually going to put in looks like we got some white space so I'm going to put in a strip as well there we go so this is always this is a really useful thing because quite often the HTML has white space around the data that you're actually after so just do strip with a double brackets out there because it's a function so there we go that's the name so we'll let's put that in there name let's get rid of that one what else what else do we want so we wanted the number of reviews so if we look here we've got span class review let's try that then alright let's try and get the rating out now so let's say rating is equal to soup dot find if we look here it's spam let's put this plan in there span and the class is equal to this so let's copy that put out there does that work done something wrong let's make this a little bit bigger and this a little bit smaller you're a bit more to work with on this side seem to work so let's print the rating out okay great so what we've got is the information there hopefully we can good dot text and it'll get that out okay yeah that worked and then again da strip so now we've got up to name and rating so else do we want what else do we want to get the price so if we go back to inspect element hover over the price here we can see it's under a P with product action that's nice and easy I've copied that so price is equal to soup dot find all speak like the P tag and the class we just copied and again we'll do dot txt and because the other ones had to done will also do that strip so let's try printing that as well okay got an error okay we don't find all which we didn't need to do we wanted to do find there we go so now we've got the name the number of star ratings and the price the other thing actually that I did want to get was the number of reviews because that could be quite important to us so it's just another span and we copy this one I'll cut out here and we're going to do reviews and it is called review overview count instead of this like that okay and didn't print it right nearly there oh great great so we've got it now so now that's a bit more information because we can see that if it had three stars with 100 reviews we might not be interested in this one but because it's only three with one we might want to click onto it and read the more information about it okay so now what we want to do is we want to make sure we get all this information all the save it so I'm going to call this whiskey minus a is equal to our new dictionary okay name Boop and that's the name and the rating is rating and reviews reviews and price okay great and if we do let's just print that out to make sure we know where we are should get a nice little dictionary now we go name rating number of reviews I've got some characters and they will leave that for now because we're going to human read this for the moment but we would want to tidy that up probably and a price okay so how do we then take this and then turn it into something that can go through each and every single link so we need to change a few things first we're going to remove this test link I'm just going to comment it out for now just in case you want it again and we need to find out our product links this is the one that's got all of the information in it so we want to do for link in product links that's going to loop through each and every single one and we want to indent this and we want to change this to link to match this it should be fine that should be fine okay so let's have a go and see what we get out there so we're failing to get the rating on the second one so let's have a look at that and see why we are is it because yeah that's exactly what I thought I think you can see this here that when it changes to a four-star rating it actually changes it to 40 there so what we need to do is we need to find a way around that so what I might do instead is just get all of the text from the class of this instead so that might that will help us where the ratings change so we're going to change reviews to give here and we put we're actually going to comment that one out and we're going to change this one to div we're going to put this in here like that review overview and we're going to get their text and we're gonna strip there and then we're going to comment out this as well let's see what happens now okay so now we've only now we've managed to fail but we've got to the second there third one this time and I think this could be the first one where we don't have ratings right see we don't have any reviews for this so we need to put in some kind of error handling for this so the easiest way to do that is I'm going to actually move the price up to here like this I'm going to get rid of this line now because we're actually not going to use it and get rid of this one and we're going to do in here as well we're going to we're going to do try this okay so it's going to try this I try and do that and if it doesn't we're going to get an exception and then we're going to do rating equal to no rating so this is going to do is going to go through this and it's going to try and find this and if it does it's kind of carry on if it can't we're going to hit the accept and we're going to put our rating to no rating okay let's save that okay we can see it's coming up and we can just see there that they're getting the note I'm gonna stop that now because I know that that's working so instead of printing I'm actually going to add this all to a dataframe and we're going to get all the information this is a really nice and easy simple way to do some kind of basic error handling so we're going to want to save this information somewhere that we've got out so we can view it easily and for that I was pretty much always use pandas it's nice and easy and convenient you could use other things away with like the default Python CSS right or something like that but I just find pandas it's super useful for this so we're going to do import and as a speedy and we're gonna save it into a data frame so at the bottom here where we were appended the item we wanted to our list we need to do DF which we're going to cool for our data frame is equal to PD dot create our data frame remember capitals and then we're gonna pass in the list like this and then we're going to print the DF and go ahead which will give us the top I or default give us the top five I'm just going to show 15 and under here I'm actually going to add in a print statement because we've got no print statement so we're not we wouldn't be sure if anything was actually working it's nice to see some output so I'm going to do print saving and then I'm going to do whiskey I'm just gonna do the name so we can see that it's printing that out okay so hopefully that all works let's run that okay there we go back in you see it saving it's going through each and individual product page and getting out the information that we've asked for here and it's going to print it all nicely and our data frame for us and you could use this data frame and you could export it to excel or csv or something like that if you wanted to great so now we've got our list here and this is our data frame we have some extra line breaks in here which didn't come out with our strips so we probably want to work out why and get rid of those as well just so it's a bit neater but we can see here we have the name the price and the star rating and the number of reviews and you could do this for for any any site like this any any web shop or something like that you could get out more information if we get rid of our inspect element you can see that all this other information here I'm not on this one maybe go back to this product okay so you've got the description all sorts of information you could get out here you could do whether it was in stock that would be quite a useful one and maybe anything else like that so that's it guys we've successfully scraped information from product pages from a web store I'm sure you could do this for other stores where you could go in and get specific information we've chosen a category we've looped through pages and we've pulled out specific parts of the information from within each and individual product page we've done it in quite simple math quite a simple way I'm sure you could tidy this up nicely but let me know in the comments what you thought let me know if you've used this before or if what bits were new to you and don't forget to Like and subscribe thank you bye
Info
Channel: John Watson Rooney
Views: 44,266
Rating: 4.9741936 out of 5
Keywords: python, learn python, web scraping, web scraping with python, web scraping with python multiple pages, python requests, python beautifulsoup, web scraping products, web scrape online shops, web scraping with python and beautifulsoup, web scraping python beautifulsoup, beautifulsoup
Id: nCuPv3tf2Hg
Channel Id: undefined
Length: 21min 52sec (1312 seconds)
Published: Thu May 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.