This is the ONLY way I'll use Selenium now

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
scraping with selenium is slow but as I've been using it more and more recently I made this quick change that can speed it up multiple times and it's very simple to implement it requires a little more initial setup and changing a few lines of code but it's the only way that I'll use selenium now I'll show you exactly what you need to change how best to manage it and the benefits of using selenium this way over the traditional method first I can imagine you sat there thinking well why would I be wanting to use selenium in the first place for scraping aren't there multiple different better options for it and yes this is often the case uh I always go and look to see if I can reverse engineer and API first or maybe there's a load of data hidden in the script tag that we can use but sometimes you really just do need that real browser there to pass JavaScript tests or maybe you click on buttons and move around the site it's a good skill to know to have and by the end of this video I think you're going to want to give this method a goto here's the code that we're starting with if I come down to the function that runs everything you'll see that we have our hander the driver web driver and I'm using Firefox in this case from here we then create our item list and use the Run function to get the information from this URL and do all the stuff that we want to so if I was to run this here you'll see that Firefox is going to start up on the right hand side it's going to go to the URL flick through the pages like I asked it to it's closing in between each product and I'll tell you why I did that in just a minute you can see it's all loading up now this is inherently relative ly slow because we have to start up that browser each time now what we can do though is we can actually look at something called grid now what we want to do is we want to have our browser available to us that we can ask it to do something with and the beauty with grid is that not only will it cue our requests it can also handle multiple sessions so if I was if I come and we look at uh selenium grid let's go to our browser and open it up here and let's take a quick look so I think grid has been updated recently and it's been it's really easy to use but what we can do is we got a couple of different options of how we can run it and a few different ways that we can run it in so what I'm going to do is I'm going to show you the I don't know if it's the easiest but it just requires this Java and you run this file we're going to use this and then I'll show you how to use it with Docker which is my preferred method how to connect to it and how you can change your code like I'm going to do with my code here to use grid and then we'll introduce some concurrency at the end so we can run multiple web browsers in parallel and speed our whole thing up 10 fold so that's the plan so as you can see here it says you can start it using uh Java which is what we're going to do so I'm going to close this I'm going to open up a new terminal and I have this already here it's going to run this and it's going to run at Lineum manager so once this has started up we get this URL which we can open up if I move the terminal out the way you can see see now that we have the selenium grid running on my local machine we have no sessions going at the moment but here is the most the best part of it which is we have Max concurrency of 12 which is my CPU on my machine here so how do we actually change our code to connect to this and use this to run instead of starting the browser as our code runs on our local machine so we're going to come down to our uh web driver here which I'm going to remove and what we need to do is I'm going to actually use Chrome now is we to say our options is going to be equal to uh web driver do uh options do Chrome options and then we want to connect to our remote driver here which is this what our grid is web driver. remote like this and we need to give it a couple of commands a couple of things that we want to do so here you can see we want to know the command executor which is going to be our grid URL and then the options that we've given it which is just going to be our Chrome options so let's have this here I've got two to many brackets here so we'll have our Command executor and this is going to be our URL so I'm just going to copy this over we don't need the UI part and then our options is going to be equal to the options for Chrome which we just created above super so now that that's done I'm going to format this with black and that's going to move it across neatly there and that's it that's all we need to change to actually start to use grid and it's not difficult and we've we've run it nice and easily so let's run this code again using grid and we'll we'll see what happens so I'm actually going to move this over to this screen we'll put grid on this one let's save and let's run now so we'll do PI our grid version and it'll come back over here and you'll see that it's loaded up and we had one session here going and you can see we're using Chrome and it's worked exactly the same way but this was no quicker really uh because we were only using one browser so how do we make it so we can use all of the sessions available to us and that's with concurrent current Futures so I'm going to do from uh concurrent Futures we're going to import in the threadpool executor here what this allows us to do is it allows us to use a context manager with where we can give it our run function and our our items list which is essentially what we're doing here in this uh loop so we're doing here's all of our items and we're doing our run here we're going to give it this and we say just go for it and we'll let the concurrent Futures handle everything for us and because we're connecting to our grid we'll have multiple sessions so let's change this we don't want this here we want our uh concurrent Futures so we'll do with our threadpool executor now you can put Max workers in here uh by default it will just take the most it can do so we'll leave that is and we'll do we'll call this as executor and now we want to do our executor and we want to use the map function here this is going to say take your callable and your iterable so our call able is our run function and our iterable are our items list now so what I'm going to do is we'll come back out of this and we will run this again and we're going to see multiple browsers coming up and that's because we are running them together now this uh for the keyid among you you will have seen that we had two browsers come up despite 12 different sessions and that's simply because my products list is only too long so what I'm going to do is I'm going to grab some more products now now we have four so we should have four different sessions straight away starting up and as you can see they're all working at the same time to pull the reviews up to the maximum number of pages that I said so we've gone from going on one page at a time opening the browser then closing it then opening a new one to opening and closing the browser across multiple sessions using grid so you may not want to run grid like this I'm going to stop this the J the uh the JavaScript thing the uh Java version here and there is obviously a better away so we do want to run this in Docker I think this is the more preferred option rather than the Java version so what I've got here is my Docker composed. yo file now this is very straightforward you can just copy this I'll leave a link to it so you can just pull this down and as long as you've got Docker installed on your machine this will do everything for you so what we're doing is we're saying we want the Chrome service and here we are we're saying the maximum sessions we want is 12 if you don't have this number put up to whatever your uh CPU can handle then you're going to end up with no extra sessions that you can't really do anything with so here we're going to do Docker compose up and this is going to find that Docker composed file and you can see that I'm having to redownload the selenium version this is what's going to happen for you when you run this the first time it's going to install all the images you need and then it's going to run the selenium Hub grid the same as what we were looking at with the Java file that's going to give you that URL that you can actually work with so I'm just going to let this install again and then I'll come back to you when it's done so I can see here that grid has started up and we are on 172 1802 444 18.02 and Port 4444 and here we are back on our grid so this is exactly the same as what we had before except just we have just the 12 Chrome instances so all you need to do is use this URL into your remote web driver code and this will work exactly the same now the benefits obviously of using Docker is that we can actually run this on a server you can run this across multiple machines and once you get a bit more advanced into it which kind of what I'm looking at at the moment is you can spawn up and down sessions and nodes as you need and connect to your grid so you can expand and contract your selenium grid your different sessions and nodes as much as you need to so that's it for this video if you've enjoyed this I recommend that you go ahead and you give this a go take whatever code that you've got that uses selenium and fire up grid install Docker use the docker composed that I'll link to let it do its thing and see if this will help you out make sure you use concurrency though otherwise it's not going to be worth it join the Discord like this video subscribe to my Channel all that good stuff loads of cool stuff coming loads of more web scraping stuff coming if you can't wait this video right here is going to be really interesting to you if you want to learn more about the web scraping that I do
Info
Channel: John Watson Rooney
Views: 7,133
Rating: undefined out of 5
Keywords:
Id: n_EGHAF3SDQ
Channel Id: undefined
Length: 9min 26sec (566 seconds)
Published: Wed Nov 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.