3 Ways To Scrape Infinite Scroll Sites with Playwright

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
there's no avoiding infinite scroll it seems to be all over the place and if you want to scrape data from a site like this you're going to need a few different techniques in your back pocket and in this video I'm going to show you how we can use playwright to scroll down to the bottom of a page each time to load up more data and then we're going to combine that with my favorite way of getting data from a website and that's the backend API so every time we scroll down it's going to fire off that Ajax request and we're going to be able to intercept it using playwright and print it all out to the page this is a pretty good technique to know how to use when you're thinking about web scraping although one that you might not use as common but I like it because it's really easy to set up and get going and is really good for exploratory figuring out what's going on on the website so let's get started we are using playwright and it's the Sync API and this is my basic function so all this is going to do is it's going to load up a browser I've chosen not have header so we can see it I'm changing the viewport size otherwise it's really small all on my screen which is a bit annoying go to the website I've put a sleep in here just to make it a bit easier click OK on cookies and then wait for something to happen so let's run this now and we should get our browser pop up and there's our page and we'll have the cookie pop up and it will click Yes except once the page has hit Network idle which is pretty useful wait there we go it's done I'm gonna close it this video is sponsored by scrapingbee a real-time web scraping API that solves a web scraper's two biggest issues JavaScript rendering and proxy management scraping B will handle your headless browsers and manage rotating proxies for you giving you a simple to use API to integrate into your own system or project the hardest part of scraping is always getting the data in the first place so why waste time effort and resources doing it yourself scraping bee uses the latest Chrome browsers for quick and easy rendering and Page loading taking away the issue of the heavy resource usage compared to running your own stack of chrome instances you can also send custom JavaScript Snippets to evaluate on the page for example scrolling down like the one we use in this video right here this on top of a large proxy pool that will all auto rotate and are geolocated mean a very quick and efficient data extraction and web scraping experience for you or your development team there's a python Package 2 on pip to make your life even easier if you're a python developer like me so if you think any of this is appealing then go ahead and click on the link in the description below and check out scrapingbee for yourself so we want to scroll down now there as I mentioned there are three different ways I'm going to show you how to do this they're all pretty simple up is up to you which one you choose to use but they do have their limitations because we are going to have to execute the key press or the mouse wheel or the JavaScript each time to keep going down that's going to involve some kind of loop so let's go ahead and put it in here we'll just have our let's have a quick comment here scroll down and then I'm going to do 4X in range and we'll say to start with just do one to five and then we can put in our first one which is going to be the key press and the end key so if you're on the website and you hit end it goes to the bottom fairly self-explanatory so we're going to mimic that so we can do page dot keyboard dot press and then put in here the end key like this what I'll do is I'll print out some text here so I can say scrolling key press and then we'll just have the number there then we will have our just put in a little sleep so we can really see what's going on so this is basically going to do exactly what you think it's going to do it's going to load up it's going to click on the cookies thing which keeps pop it up which is a pain and then it's just going to hit that n key each time and I don't have any implicit waiting or anything like that but you may want to do that if you're really working on it there we go you saw it scroll down pretty self-explanatory so that's one way of doing it recently I think in one of the more recent versions of playwright you can actually do mouse wheel as well which is also a pretty good one so we're going to go ahead and change the keyboard here and we can do page dot mouse dot wheel and then we need to give it some coordinates to scroll so we want to put something like zero and then I think maybe 15 000 or 10 000. the second number being the x-axis which is down there we go so that should do exactly the same thing so we'll run this and we should get scroll all the way to the bottom of the page five times like we did before works pretty well as you can see the last one we can do is we can actually execute some JavaScript we can evaluate JavaScript code on this page which we can then use to do exactly the same thing just go to the bottom of the page so let's remove this and again we'll put in here this is what it looks like so I don't have to type that out again you can see all we're doing is we're using JavaScript to scroll down so let's go ahead and run it again and we'll see you'll get to say exactly the same thing what we need to think about whilst this is doing this is what's actually happening when we scroll down each time well let's go ahead and look at the network tab in the browser so we can actually see what's going on so I'm going to come back to this other one we're going to do inspect we'll go Network and there we go hopefully I can make this a bit bigger if not I'll just zoom in on it so every time we scroll down I'm going to keep scrolling we'll see that we get these requests here now these ones are the ones that have all of the product information in so every time we scroll down not all these teasers about Annoying analytics we get a new set of products which is exactly what we would expect there's another one at the bottom there's another one you can see them coming through now what I generally say to people when uh they're looking at scraping a site like this is to actually go ahead and grab this request and try to mimic it from outside of playwright this has its advantages because hey we can be much quicker and you can often tailor the request to what you actually want but in some cases like this you can just let the browser do it if you don't need to worry about overhead or if you don't need to worry about overall scraping speed you just want to monitor this set of products you can absolutely make playwright work for you so what I'm going to do is I'm just going to amend our code just a little bit so every time we hit one of these requests it's going to spit out all of this Json data so we could then just basically save it or pass it whatever you want to do with it so let's go back to our code I'm going to keep with the JavaScript scroll just for the sake of it for now and we're going to build out and we're actually going to use the network events in playwright to grab that information so when we use these Network events it's basically going to trigger on the response so we need to set in here somewhere we're going to do it uh just underneath our viewport size we're going to do page dot on this is going to give us access to the actual event as you can see it's telling me they're all about the events we can have a look at this and read this if you wanted to and somewhere down here is response there we go so this is the one that we're going to use so we're going to say on the event of response and this is every time that there is a request and a response that this happens when this page is loading up what we want to do here is we want to actually access this and we can pass in a Handler function into this so we'll say Lambda first and the response which is going to give us access to this data so let's take a look at all of the responses so let's go ahead and just print out our response dot URL okay so I'm going to save and then we shall run our code again and we're going to see a lot of stuff coming up here so this is all of the data transferring the network transferring that's happening when this page is load up now you can see there's an awful lot going on here and this would be exactly the same that you would see if you were to look at that Network tab although on the other browser I had it filtered so what can we do with this well we're going to need to filter this out there's a few different ways you can do this because you have access to this response URL the response body the response headers but one of the easiest ways to do it is just to try and pass each response body with Jace as Json and see if it accepts it so let's go ahead and have a new function we're going to pass into our response instead of just printing so we'll say I'm going to call this check Json and we need to pass in the response that we're going to get response like this let's give ourselves a better space so in this function we're going to use an if statement because we want to see if the word product is in the response so if I go back to the response that we were looking at before with all the product data the actual request URL has the word products in it so we're only interested in this URL so we'll say if products is in response dot URL then we're going to print out and just pass out the Json data from it so I'll just say URL is equal to response dot URL and then the body can be a response dot Json so at the moment I'm going to assume that if products is in that URL response then I am going to be able to access the Json data because it will be Json if we have errors with this what we'll do is We'll add in and try and accept and we'll handle the error which will probably be a decode error we will see where we get to so now we have our Handler function for check Json what we can do is instead of printing out here let's remove this we can do our check Json for the response that we're going to send there so now when I run this again we should hopefully get some bit more of a bit more useful information so let's give it a go okay so we did indeed get some product information but my scroll stopped working because I've removed it by accident here so I just need to put that back in I wondered why we weren't scrolling that will be my Vim skills letting me down but duper do persevere with Vim trust me in that respect I'm going to remove the print for scrolling we don't need to see that anymore and let's try it again and we should get a load of product information come through every time we scroll down the page on the right we should get the Json data in our program on the left let's see there we go so you can see we're getting more and more information each time it's all come through this has failed so I think this is probably to do with um The Waiting or the browser closed before everything else was done so we would want to handle that some way but you can see where I'm getting at with this method every time we scroll down we get a new chunk of product data for that part for the browser to load up in the page using JavaScript we can actually intercept that by just pulling that information directly from the browser and having it here in its Json glory in our terminal from here what I would expect to do would be just to iron out a few of the Kinks the time.sleeps aren't great you want to look at using the playwright weight commands way on element I think or uh wait for load state is a good one I'm using that here and then you can actually just choose how you want to handle that Json data as if you would it came from anywhere super cool method very useful in certain situations a good one to know if you've enjoyed this and you want to know a bit more about how to handle that Json and how to work with that specific method more targeted you're going to want to watch this video right here where I go into it in a lot more detail
Info
Channel: John Watson Rooney
Views: 16,367
Rating: undefined out of 5
Keywords: john watson rooney, web scraping, python web scraping, infinite scroll, data extraction
Id: VDf7nfjLwRU
Channel Id: undefined
Length: 12min 18sec (738 seconds)
Published: Sat Apr 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.