Weekly Web Scraping with Python: Product Pages, Pagination, Save to CSV

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome john here and today's video is the first in a series of weekly web scraping where we will take a website and we'll scrape the data and we'll look at how i did it uh what packages or anything that i used and sort of how to get that information out so hopefully you guys will enjoy this and maybe learn something from it so the first one we're going to be doing is called it's a hiking backpacks it's from rye.com and we can see there's 524 products over multiple pages 18 to be precise so the first thing i always like to do is to check out the website itself and see how the pagination works so i will come down here and i'll click page 2 and we can see that the ur url has added the page equal to two at the end so if i go back to page one we should go back to the first page where we were okay so that's a good start so now we know how we can loop through all the pages now my plan for this is to actually get all the main links off every page and then loop through and get the product information from within the product page so we could take the information off this page here but there's not an awful lot whereas if i go actually into a product there's some more information that we can see so there's lots of different things here now to before i move on what i want to do is i just want to check this page and see what how i'm going to get the information out because how i'm going to do it depends on what python libraries or packages i'm going to use so i'm going to go ahead and inspect element first is generally what i would do and we'll have a look at the inspector on the left hand side and we'll start hovering over we'll have a look okay so we can see that is in a span tag so this all looks fairly accessible what else i like to do is have a look at the source code and i tend to copy a sort of unique or specific bit of information usually the price in this case and then i will go to uh view page source and i'll make that bigger and then search the source code for it now the first result here is quite interesting because we can see that we're inside this script tag it's got what appears to be all the information in so that's good to know and if we go down another one we can see that we've ended up all the way over here now if i highlight this whole line and move it to the middle we can see that we are here inside this script type application and it's got all the information within it for each one we can see there's a product name here there's a url there's a description and the price and everything like that all on the line so what i'm going to do is i'm actually going to use um trump.js i'm going to use this line of the information and that's where we're going to get the data from that's going to make our coding so much easier because this is already going to be able to be turned into usable dictionaries in json right so i've talked enough let's get coding the first thing we want to do is go back to the main page let's copy the url and i need it with the page thing on the end so i'm going to go ahead just grab that page is equal to and we're going to go and i'm going to use request html from this i'm going to do from request html import and html session and we're also going to need to use chomp.js which i just described that we would need to use and we'll leave it there for now so i'm just going to say our url get that put in there and we'll change it to pages equal to one so now we need to work out uh how to get the you are the individual product urls from each page so i'm going to say s is equal to html session so we have a session object to work with and then i'm going to say uh r for the response is equal to s dot get of the url that we are looking at so we're going to go back to the page now and we're going to go let's go back to the first one and page is equal to one nothing in that case and now we're going to go ahead and inspect the element and we're going to see where the um url for each product is so i like to hover over the whole thing you can kind of see it goes in blocks and here we have under this a tag we have a uh href here that we can use so that tells us the um the product of the page but because that's not a full url we're gonna need to add in the uh w the in https of it but it's interesting because all of these classes here have this sort of random or maybe odd string instead of saying product data or something like a lot of websites do so we're going to have to find a way around that what i like to do is tend to collapse as much of this as possible and head back up the tree so we can see that if i just close these ones down as well here's a big long uh the list objects for the unordered list here that have all the information in for each of the products you can see they're being highlighted but here we have a div id of search results so that's really good so that means as you can see that on the left hand side of the screen that narrows it all right down so i'm going to copy that we're going to come back to here and we're going to say our dot we'll say let's go for results is equal to r.html.find and it was an id so we need to do the hashtag and we're going to put search results in there so now if we do print results from this we should hopefully get our element back for that one yes so here we are here's the element it's a div and it's the idea of search that's good so now we're in that we know we're in the right place so now we want to go just down the tree a bit so underneath our search results we have this ul class here and then we have the li and after that we have if i just close these we have this a here so we've gone you can see we're directly underneath so let's take again the under the ul the unordered list then we have the list item here and then we have the a tag which has the product link in it so what that means we can do is we can actually chain together the css commands and we can actually go directly to that so what we can do is go then the very first the uh unordered list and then it was an li and then it was an a so what that means is all we're doing is some basic css selectors and we're going within our search results there's an unordered list and then there's the list item and then there's an a tag so let's run that and we should still get some elements there we go we've got a load of elements back so these are basically all the links on that page within what we set up up here our search results so these should all be product lists uh product links the downside that we've got here is there are actually two a tags within this list item uh here we are so you can see both of them so we're getting the same link twice however that's not the end of the world instead of trying to get around scraping it all i'm going to do is when i get convert my list i'm just going to remove the duplicates that's a nice and easy way around it so what do we do what do we need to do now well we need to get the href element uh the href attribute sorry from each element so to do that we would want to loop through each one and then get the href element so let's do that now let's close this one and let's say for link in results we want to do link dot attr's attr href and let's print that and check that that's the right thing yes there we go so we can see that we've got all of these links here now we can see you can clearly see that there's two of each for the same one so to do what else we need to do is we need to add the base url to that so you can either hard code it in and copy this or i tend to do add another variable in because you never know where you might need it base url is equal to and let's put that there so if we just do print base url plus link i can't remember if we need this extra slash forward slashing or not but we'll find out okay so we've got too many so i'm going to remove that from here so that's all of our links constructed so that's good but gal we need to work on a way of just returning the unique ones and an easy way to do that is to use the list function and the dictionary from key so what we can do is we can just say print let's add these to a list first so let's say uh our product links is equal to this and we can add that each time to product links dot append so we're adding this each time and we can say links is equal to list dict dot from keys and we'll give it our uh what would you call it products links and i think that should be enough and then if we print the links list we should have only the each and every unique link there and there you go you can see that we've removed those so that's one way of doing it and this is um this is quite cool this is a good css selector to use we've got some good practices in here but this is loads of lines of code for one one thing essentially so what i want to do is i want to really tidy this up and to do that we're going to use list comprehension so what we can do is we can remove a lot of this and we can turn this into a function that will return this out so i'm going to start removing some of the things that we don't need so we can get rid of all of this because we don't need to create this extra list because we're going to do it all inside here so what we can do is we can say results and then we're going to do our list comprehension and we want to take what we want out of it which is our base url plus the href and then we can do 4 link in that so all we're doing is some basically this comprehension saying we want this for this is our four variable uh in this so that should return that fine and then if i just remove this back up here and if we do our function so we'll say define and we'll call this fetch and we're giving it a url and then we'll return out of the function oops not like that let's move that back out and we'll return this out of the function and this is what gave us those uh product lists so if i run this function now we say fetch and we give it the main url hopefully we'll get that all that information we just saw back out except i've done something wrong i've missed the colon on my function and i'm missing the end of the brackets here so you can see i was missing the end of my list comprehension bracket there so now hopefully it will get the links back which we do we didn't print them which is why we can't see them so let's just do that one more time and we should be fine there we go so now we have a list of all the links for the products on that page so that's good that's working but we know that there are multiple pages so what do we want to do is to change it up so we get instead of just the initial url here what we're going to do is we're going to say i'm going to go to x and we're going to loop through and we're going to use the numbers so we can then just make it give us a new number each time and then add that to the page at the end so what i'm going to do is i'm going to copy this url here and i'm going to say get this and we're going to remove that i'm going to put x in here and we are going to use an f string to automatically put the whatever value of x we give to our fetch products url into the page number there so if i change this to page two and now run that we're going to get a different set of products to the ones that we saw the first time around we can see that they are different indeed so now that's our product links sorted out nice and neatly all done with our nice css selector here to get right down into them really easy and our list comprehension to avoid having to do multiple lists and loads of for loops and loads and loads of lines of code so the next thing that we want to do is we want to look at getting that information out of each of the product pages now i mentioned earlier that we're going to use trompjs for that because of the information is all here i've moved that over i didn't mean to put that back because the information is here so what do we want to do so we need to load up the page and we need to find this script tag here and we need to give this information to chomp so that's nice and straightforward so to do that i'm going to write another function to do that right now so i'm going to say pass product and we're going to be giving it a url because we're going to give it each url for the product that we're going through so then we can do r is equal to s dot get our uh url and now we want to go back to the page and we want to say what do we need to find we want to find this script tag okay so i'm just going to copy that for now because we can do that nice and easily with css selectors as well so i'm going to call this details because it's all the product details r.html.find and we were looking for a script tag with the attribute so we can do that with our open brackets and it was here we can see type is equal to application slash ld plus json and we can then close that off here and i'm going to go first as equal to true because otherwise this would return a list and we know by looking at our page that the one we want here is the first one so there's another one here but we want the first one if you wanted the second one you would index the second one from the list and then we can do data is equal to uh chomp js dot pass js object there you go it suggests it i don't need to remember how it goes every single time and then we want to give it details.txt because we want the text from that element and then out of this we want to return the data so now if we do pass product and we give it let's get it that url here let's copy this product url like that we're going to give it so let's remove that and we'll do print pass product and we'll give it this url for now to test that it works we'll see what we get back there we go so we've got all of this information for this product now and this is the same for every product that we're going to be scraping through so we can really easily pass all the information out just by looking and checking to see if it was there in the first place we've got all this information here that we can easily create a pandas data frame which we're going to do at the end and then just flatten it all out with json normalize and get a nice csv file with all of this data in so that's really cool really useful so our whole html passing is to function with three lines of coding so that's really useful i'm really cool so definitely always check the html like i showed you at the beginning to make sure you're getting that data from that place so what do we want to do now so now we've got our function that's going to give us all of the product lists on each page which we're going to loop through we're going to do page 1 2 i think that was 18 and then we can create then we can get that information from the product for every url that we give it so we've got the two main parts of our code written so we want to do the last bits now so i'm going to create a main function i'm just going to remove this so we're going to say define main and this is the one that we're going to run every time when the code is executed with if name is equal to name but we'll come to that in just a minute so we're going to say our urls is equal to so we're going to get all of the product urls again we're going to use this comprehension and we're going to say fetch x which is our function here so this is the number the page number we're going to give it 4x in range 1 2 and i'm going to do 1 to 3 to start with just so we can test it so we don't do too many requests to the server when we're actually not saving the data just in case so this is nice and simply uh forex in range loop that we've just flattened into this comprehension and that's going to give us all of the urls for all the products on pages one two and three in fact it should be just one and two because up to three in range and then what we want to do is we want to save all of those into a list however because we're creating a list already we would end up with a list of lists of product links it just makes it a bit more complicated to loop through them easily and get that data back so what i'm going to do is i'm going to use the python it tools to change my list of lists into just a list that we can then nice and easily loop through it sounds more complicated than it is we're going to import into tools so we do import it tools this is within the python standard library and then we're going to say our products is equal to list and we're going to do it tools here and then it's going to do dot chain dot from iterable urls so all this is going to do is it's going to take our list of lists that we've created and it's going to go basically through all of it and it's going to turn it into one big list of all product links which is exactly what we want because each of those links we can then give to our past product function that is going to return the data that we want the next thing we want to do from this function then is to return the data that we get when we pass those product urls so we want to return and now we can have a list for this which is fine so we want to turn again list comprehension past product url for urls for url in products there we go got there in the end so all we're doing here is we're running our past product url function i need missing a bracket there for every url in our product list so i've got possibly not the best variable names here but you understand you can see where they're coming from so what we're doing is we're taking every product url that we've just created a nice simple list for with this tools and we're going to loop through every single one and we're going to run our function on it and that's going to return that data out into a list of all the products data so we can then work with that nice and easily so now we need a way to run all these functions for us in the right order and run and manage the session etc etc and the easiest way to do that would just be to type out the names of the functions and what you wanted to do but i'm going to use if name is equal to main if you're not familiar with this it basically means that it will run these when the code is executed assuming that the file name is within itself so if if we were to just run this file now it wouldn't do anything but if we do if name is equal to main and then we give it something it will run that but it won't run it if it's within another file um that's possibly not the best explanation if this name is equal to main but basically just follow along and you'll understand it so we're going to do is if double underscore name is equal to which is double equals main then we're going to do this so we're basically saying if the uh the name of the file is this then we're going to run it and we want to do i'm going to move actually this code into here as well i don't think we need this one anymore but we're going to need the base url because we're here in fact we can put the base url inside this fetch function that will be fine but we're going to need the session down here because we want to use the same session object throughout and then we can say let's do uh well actually we need to import pandas now that's probably the best bet so i'm going to say import pandas oh that's pd so this is just going to allow us to export the data really easily we can use the json normalize method because we have dictionaries adjacent data that we can flatten out and it's going to handle all of it for us so i'm going to say df is equal to pd dot json normalize there we go let's move this up so you can see and we're going to say we want to run that on our main function so our main function is the one that's going to return a nice big long list of all of the product data so we're going to say whatever comes out of here whatever product data we get we're going to create a data frame with it using json normalize we're going to flatten it all out into columns and everything we can we can see in a csv file and then i'm just going to df dot 2 csv i would call this ry week one dot csv and i always do index is equal to false otherwise you get the pandas index down the side which is no use to us and then i'm just going to do print and i'll just say finish as we've not actually printed anything out and we wouldn't see anything working if you want to see things working you could always add some extra print statements in here but i don't think we need to as i said i'm only going to run this on two pages so this is where i run it and we find out what i've mistyped and where my errors are none apparently finished okay so let's go right week one there we go and you can see i have a nice long uh 60 products which was 30 per page of and all the information here so that's worked great um this is actually called csv rainbow or something which i find really useful if you don't have a csv fewer viewer in your in your vs code i like this one because it changes all colors and you see it nice and easily so that's it guys that's worked really well we've done that in well this we're up to line 29 that's not a huge file we've got some good use of css selectors in here if you're not really up to speed with css selectors i've got a video on that on my channel but it's definitely worth playing around with we're using requests html which is at the moment my favorite web scraping library just because it combines everything together it's not necessarily lighter or faster than using beautiful super requests separately but i find it's a bit more feature rich and it works really well and we've got um the trump js where we're taking the data from there so that was a good bit of sort of working out and researching the page to see where it was otherwise we would have had to have gone and looped through a load of different bits of the page to get the past the html out and we would have had loads more lines of code and we probably wouldn't have got all the information that we wanted because maybe we just couldn't find it or couldn't see it whereas doing it this way we have it we have a main function which is going to just control everything and run these two uh these two fetch and the pass functions for us and then we're using if name is equal to main with our session here so we use the same session throughout which speeds up the loading of the pages and keeps everything nice and tidy and then pandas for json normalize just to flatten everything out and give us some nice columns now this won't always work depending on what your data looks like but it's definitely worth trying first if you have data that look like that because it's a good chance it will work so that's it guys hopefully you've enjoyed this and learned something from it i've done an awful lot of talking um not an awful lot of coding but i'm going to be uploading this code to my github so you can see it and have a look and have a play around if you want to hopefully you've learned something or at least just enjoyed this video so thank you very much for watching i've got loads of web scraping content on my channel already more to come so if you're interested in that sort of content and python in general i'd recommend subscribing so you don't miss it hit a like on this video drop me a comment and i will see you in the next one thank you very much and goodbye
Info
Channel: John Watson Rooney
Views: 5,121
Rating: 5 out of 5
Keywords: web scraping, python web scraping, learn web scraping, learn python, code tutorial, learn to code, python projects, requests-html, web scraping pages, pagination, chompjs, python itertools, product scraper, ecommerce scraper, weekly web scraping, extracting data, python, python tutorial, save to CSV, pandas to csv, web scraping with python
Id: aA4o98Xb8JU
Channel Id: undefined
Length: 24min 45sec (1485 seconds)
Published: Sun Mar 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.