How I Scrape Data with Multiple Selenium Instances

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to look at a few different ways that you can actually connect to your Chrome browser using selenium we'll start off by launching it from within our code like I'm going to do here and then we'll move on to looking at selenium grid and how we can actually use that concurrently to run lots of different web pages in one go now I'm just experimenting with grid so I'm not fully utilizing it but I think what I'm going to show you you're going to like and it's going to be pretty cool so this is the basic starter code here as you can see I'm basically creating my Firefox driver and we're just going to load the page get the HTML Source we're going to close the driver now this is I'm doing this for a reason and I'll show you that later and then we just pass the information out so uh if I was to run this file we'll see the Chrome the Firefox instance open on that side and it will then load the page close down and we should get the information back and it's going to do that for every URL in that list which we do not need to sit and watch what I'm going to do now is I'm actually going to go ahead and have a look at the selenium Webdriver and the remote driver section here the remote web driver so what this basically is is it's saying that we can connect remotely to that running instance of Chrome or Firefox or whatever it is that we've got this is particularly interesting because we can now think about having our browser separate to where we're running our code and that's the main thing I think to take away from this it's very simple to make the change as you can see the code here is just changing the driver to webdriver.remote and the command executor here and we're going to do that in just a second and then we're going to move and then from doing that we can actually look at grid so grid is going to manage all of these instances of the Headless browsers for you I've started with the simple version and what I'm going to do is I'm going to start up a new uh new terminal here and I'm going to run this command now this command is basically using Java to run the selenium Standalone grid server this kind of like does everything in one go so I'm going to hit enter and it's going to go ahead and hopefully start this up here on uh Port 4444 and as you can see somewhere around here it says 12 times and if I come to my browser and go to localhost 444 we can see that we have this selenium grid UI that we can actually look at it'll tell us the running sessions that we've got and it'll also tell us hey you've got 12 available instances of these that you can concurrently spool up so I'm going to use this now we're going to change our code I'm just going to move this one over to where the browser is here and we'll go back to our main.py file here and so we'll remove this and we'll add in our options so like we saw in the documentation and we'll have uh Webdriver dot Firefox options and then from here we can say that our driver is going to be equal to Webdriver dot remote and we need to have the command executor which is the localhost URL that we've just created we just run our selenium grid instaxon 4444 and then the options was going to be equal to the options for Firefox options that we just created I said options a lot of times but we're done now from here I'm going to save we're going to come out of this and we're going to run this file again and we should now see exactly the same thing that we just had except now we are running this here you see we have sessions one and it will tell us and this is because that one's just closed that it's actually running it through our selenium grid you can just see it there I've got everything on the screen at the moment you get the idea now what's important about this is a couple of different things one well we're actually only using one of the available 12 browser instances concurrently that we could be using and also if we can connect remotely surely we can use something better than just running this through the the Java file here we can use Docker so I'm going to close this and we'll get this stopped and out the way and get rid of that browser instance so I'm going to close down my Java version we don't need you anymore what I'm going to do is I'm going to come out of this file and I'm just going to show you my Docker compose file now this is available easily on the on the internet I found this on GitHub this basically means it's going to pull the docker image that we need it's going to run it with selenium Hub this is slightly different this is not the same as the selenium Standalone which is what we just looked at this is the Hub and node version and as you can see we're going to have a Max sessions of 10 which is what I stipulated I could probably change that to 12 because that's what it said I could do but I'm just going to leave it at 10 for now using Docker here now I can actually do cker compose up and this is going to start this up like so and you'll see everything's getting going and this is actually crime instances in this case now you can see that this has all worked and if I come back over to my grid and go to overview now we have 10 of the Chrome instance available I'm just going to close this stop this for a second and we're going to run this with the D flag just so it runs in the background there we go so this is going to now just run in the background and we should still have this available once it gets going there we go so let's go back into our YouTube folder and open up our main.py file now I did use Chrome in this instance because uh that's why I chose my Docker container so I'm just going to change this to Chrome options now instead okay so let's now go ahead and do our PI main.py file and we should start to see some data coming back so this is going to run it completely headless so you can see it's working so if I come to here we'll see we have sessions again one and you can see them come and go now this is why I put into my code the driver.close now if you were doing this without a driver.quit sorry it's important that it's dot quit now if you were doing this without selenium grid you could absolutely use the same driver and just go page page page page but what you want to make sure you do is you actually quit that driver out otherwise it's going to stick in that session and it's going to stop everything going because that instance of that browser is just going to run and forever and just sit there until the end of time essentially until something crashes so that's why I have driver.quit in here just so I know that it's closing now we could of course make this so that driver that browser instance that we all up could go through multiple pages and do all of that stuff that we normally do I just have it doing one thing at the moment so that's worth bearing in mind now I mentioned at the top of the video concurrency now we can do concurrency within python there are a few options I'm going to use concurrent Futures now there is a thing with concurrent Futures where sometimes it will have issues with memory however I'm still experimenting with this and I found it the easiest way to show you how we can actually run multiple browser instances using grid and python so we are indeed going to do that so we need to add in a few bits extra into our code so we need to do import concurrent uh dot Futures like so now using concurrent Futures and the thread Port executor we can then basically run this function concurrently so we have multiple instances of chrome going now to do that underneath my pass HTML function here we're going to put in a piece of code we are going to use a context manager with and we're going to say concurrent Futures the thread pull executor and we're going to give it Max workers of 10 because we have 10 browser instances so I figured that just makes the most sense as executor I want to say that the results that we want to get back from running all of this in concurrently All In Parallel we want to have a list back and I want to say this is going to be of the executor.map which is basically going to allow us to run a function against a list so our list is this URLs and our function is the get HTML so we do get HTML and then our URLs like so so I'm going to save this we just need to then change this so for let's let's get rid of this we need to do then Loop through our results effectively so we're going to load up all of our browsers in one go and then we're going to get all the results back and they're going to be in this results file so we'll do four res in results and of course the result from this get HTML function ZZ centers you know them in your screen which would be cool is actually a HTML page so it's a load of html text so we can then go ahead and print out our pass HTML of layers like so okay so let's give this a go let's run it let's do PI main.py and come over to our selenium grid and you can see that we have one session going and it goes up to eight and we are basically this these are all of the browsers that we are running at the moment uh oh and they've all finished which means there's our information come back so basically we loaded up all of those browsers in one go through selenium grid and then we basically got all the information back and then queried it through passing the HTML now there's a lot of cool stuff that we could do with this if you think about what Docker is good for and that is for deploying stuff and having things running in their own container we could put this onto a droplet digitalocean for example and we could now run our scripts that need selenium on a digital ocean droplet in the cloud or something like that much easier than it did need to be before we can also run all of this stuff concurrently so you can greatly speed up any code that requires some kind of uh action like this multiple ways so for example what we could do is we could go to either 10 or so different websites at once and start scraping that way or we could start with a list of URLs like this like I've done here or even you could go ahead and go to one page pull back the URLs and then action against that if you wanted to actually spy your way out for example so that's it let me know what you think about selenium grid let me know if you've used it more than I have which probably maybe you have and let me know what the best way to work with it is because I'm only just sort of starting to work with it and I'm finding it pretty interesting there's lots more going on on my channel coming up soon too there's a Discord which is going to be linked in the description so if you want to join that go ahead and click that in come and have a chat there's patreon going on as well so that's down there come and have a chat with that and if you want to watch more of my videos definitely do that and start with this one more web scraping stuff that I think you will enjoy
Info
Channel: John Watson Rooney
Views: 11,028
Rating: undefined out of 5
Keywords:
Id: GRu117xiusQ
Channel Id: undefined
Length: 12min 5sec (725 seconds)
Published: Sun Oct 08 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.