Selenium Headless Scraping For Servers & Docker

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys welcome back in this video today we're going to learn how to do proper web scraping in web automation using selenium on a server where we might have no screen as well as other limitations so let us get right into [Music] it all right so we're going to learn how to do proper web scraping using selenium in a server environment and the difference between your desktop system and a server environment is that the environment has certain limitations that your desktop system may not have and this is even more the case if you use a containerized environment so if you take Docker and you dockerize your applications before you deploy them this introduces an even more special environment with more limitations and more things to consider so the obvious limitation for most servers is that they don't have a monitor and when you build a selenium application by default what happens is it opens up an instance of Chrome for example shows you the process interacts with web page shows you exactly what's happening and if you try to just take this application and run it on a server it will most likely crash because the server will tell you that it didn't find a screen to to display something on so that will lead to problems because you don't consider the limitation that the server doesn't have a monitor this is one example so in this video today what we're going to do is we're going to build a very simple selenium uh web automation application so I'm just going to go to my own page and scrape a couple of headings and then we're going to see how this does not work immediately in a Docker container and what we need to do for it to work in a Docker container that is what we're going to do in this video today obviously for this we're going to need to install a couple of packages uh first of all of course selenium then also beautiful soup for which is for the scraping part uh and then we're going to also need what is it called Web driver let me just see what the package is called exactly web driver unor manager I think is the package and we also we're also going to need lxml for the beautiful soup parsing but the main thing here is selenium which is the web automation part beautiful soup is just for the scraping web driver manager just downloads the uh the Chrome driver that we're going to use and this is just for parsing the XML or the HTML um so yeah this is what we're going to do and then we're going to open up a python file here and we're going to start with the Imports first of all we're going to import time so that we can delay certain parts of the application or of the automation then we're going to say from bs4 from beautiful soup 4 import beautiful soup then we're going to say from selenium import web driver from selenium do chrome. service import Service uh web driver Chrome sorry so selenium web driver Chrome service import service and then from Selen or actually from web driver manager. Chrome import Chrome driver manager so those are the Imports and what we're going to do first is we're going to create a driver so we're going to say driver equals web driver. Chrome with a Capital C and then we're going to say that the service that we're going to use here is going to be um created on demand so we're going to say here service instance inside of that we're going to say Chrome driver manager and then do install so it's going to install a chrome driver if it doesn't find one that's the idea of this um so we don't have to specify a concrete path to a chrome driver to some package uh or to a binary we can just say install the Chrome driver like this it's just a convenient way to do this um and now what we're going to do again very simple we're going to specify a URL as I said I'm going to use my own page here so uh neural n.com I'm going to go to my books page and I'm I'm going to just scrape the headings of the books so what I'm going to do is I'm going to say driver get URL and then we're going to say soup equals beautiful soup based on whatever the page source is so I'm going to go to the URL I'm going to get the page Source uh I'm going to parse it with lxml which is why we installed this and now I'm going to say the headings that I'm looking for are going to be Su find all um H2 TXS with the following limitation the class has to be Elementor heading title and you can see my page is based on WordPress for heading in headings print heading get text so just so we see that this works in general then we're going to sleep for 10 seconds just so we can look at this a little bit and then we're going to quit so again this is a very simple um use case here nothing too fancy I just want to show you uh that this works here on my desktop system now so I can run this I don't do anything now it's going to open up a chrome instance you can see it opens up a window it needs a monitor to do that it goes to my page uh and it already scrapes here all the relevant um information that I specified here this stays open and at some point it closes so this is a simple web automation script here I cannot take this now and just deploy it on a server I cannot just dockerize this and expect it to work as it is and I'm going to show you that this is the case um so what we're going to do first is we're going to create a Docker file which is going to uh be the basis for dockerizing this and of course you need to have Docker installed you need to have uh Docker working on your system you don't need to understand everything about Docker but you should be uh able to run this on your system I'm not going to cover this part here so we're going to say here from python version 3.10 this is the one that I'm using we're going to say working directory let's call it SLA we're going to copy everything from this directory to SLA we're going to run the following command for this we need to have a requirements txt file here we're going to just um put selenium with version 491 U toall soup 4 with version Oh we need two equal signs uh version 4.2.2 then we need web driver manager version 4 and then lxml whatever version shouldn't be too uh too important those are the requirements and what we're going to do now here in the docker file is we're going to say pip install we can also specify here if we want trusted host py. python org minus r requirements txt which should be part of the app directory since we copied everything and then we're going to run some Linux commands here so apt get update and then we're going to say apt get install dyw get unzip those are just two tools that we're going going to need then we're going to uh use them right away so we're going to say w get htps colon download so dl. google.com Linux slir SL Google Chrome stable current AMD 64. and then we're going to continue down here uh apt install dasy this thing that we just downloaded Google Chrome stable current amd64 dode and then we're going to say remove this thing again so remove this Google Chrome uh stable current amd64 de and then finally apt get clean so what this does is basically it updates the system packages it installs two tools that we need here for unzipping and for getting something from a URL we get this from the official Google website we get the Chrome driver we install it we remove this and we clean uh with appg nothing too fancy and the command we need here for the container is uh python main.py this is what the file is named so that is what we need to create a Docker container and what we need to do now to turn this into an actual Docker container is we need to open up the command line and we need to navigate to our working directory so in my case it's the current directory here uh and in my case it's located at documents programming neural 9 Python and then current and here now we're going to run a Docker command now you need to have Docker installed on your system you need to have it running so the Damon needs to be running in the background if you have Docker uh desktop you just need to have Docker desktop running so Docker needs to be active for you to use it um and then what we're going to do here is we're going to just say Docker built- T and then a name for the container for example uh scraping Das selenium or something like this uh and of course I need to provide dot as the current directory so what it does here is you can see it builds the container and the idea of using a Docker container is that you have have a containerized environment you don't depend on any system packages system resources because this Docker container is its own thing it's its own full uh environment with all the packages all the tools all the python libraries everything it needs it doesn't rely on my Linux system here it doesn't rely on your Windows system you can just run this um on a server that has Docker and it has its own uh environment which makes it predictable to some degree it makes it um always use the same packages use the same operating system it doesn't depend on any differences between operating systems and that's the idea the problem is as I said our application will not run in this Docker container right away because this Docker container of course you can con configure it to have a display and you can configure it to have uh a different way of working but you know you want to take this Docker container and deploy it on a web server without a screen you you're not going to have a screen so this is our Docker container it's now uh built it's now there and what we can do now is we can say Docker run uh and then what did we say scraping Das selenium this is going to run the application this is going to run the full container with the command python Main py and you can see we get an exception here the exception is uh the process started from chrome location is no longer running so Chrome drivers assuming Chrome has crashed uh we get this problem here now the problem doesn't tell us necessarily what to do here exactly but we can do a bunch of things that will get rid of this problem so let's close the terminal here and let's go into our code to add something called options so what we can do here is we can say from selenium web driver Chrome options import options and now before actually using uh the Chrome driver before actually creating the Chrome driver we're going to Define options we're going to say Chrome options is equal to options and we're going to add a bunch of options that have a different functionality so the first one we're going to add here is the argument D- no- sandbox now this can be a um security concern you don't want to do this unless it's necessary and you don't want to do this unless you know what's uh what what you're doing here but essentially no sandbox disables to sandbox um off chrome and it can be necessary to do that for containerized environments you can try without it but sometimes it doesn't work and you need to do no sandbox to be able to run this on a server often times there is no way around this the second thing that we want to add here is the-- headless option and headless basically just means it's going to run without a screen it's not going to open up a graphical user interface it will just do what it does and it won't need an actual window being displayed this is useful for automated testing and of course you need it if you don't have a screen and then finally we're going to say Chrome options at argument-- and then disable dasde dm- usage the idea of this is that uh by default selenium uses or Chrome uses um this /de shm shared memory um for the memory and the problem is that in server environments in containerized environments this might be a two small memory so what it's going to do instead if you use this option is it's going to use a temp directory it's going to write to disk instead of using the shared memory which is of course going to be slower but it's sometime against uh again uh necessary so those are the three options using those three options and of course specifying them here in the Chrome driver by saying options equals Chrome options uh is what's going to make this compatible with our docka container so we're going to go back now into the terminal we're going to navigate again to our working directory and here we're going to run again Docker run or actually first of all we need to rebuild this because the source code has changed so Docker built DT uh scraping Das selenium Dot and then it's going to build that container again and afterwards we can run it again all right so it's now finished we can go ahead and say Doc run scraping Das selenium and we can see now if it works or not so it takes some time and then we should hopefully see no error message but we should see the headings being displayed there you go took some time but we get the actual output so the scraping process worked with headless with uh what was it disable shared memory and with no sandbox because now it is compatible with a Docker environment we can still run this by the way locally so I can still run this here uh but we we're not going to see any window pop up it's going to do everything using the Chrome driver without using the graphical user interface and you can see that um it's going to happen all without us seeing anything but it's going to give us the result in the end and probably it took so long because we sleep here so that could be the reason it took so long for the doctor container um but yeah you can see here we get the output and this is the Headless option this is the reason why we don't see it and it worked in a Docker container and when we deploy this on a server it's also going to work on that server if it doesn't have a screen if it doesn't have uh if it has all these limitations it's still going to work so that's it for today's video I hope you enjoyed it and hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the not notification Bell to not miss a single future video for free other than that thank you much for watching see you in the next video and bye
Info
Channel: NeuralNine
Views: 25,687
Rating: undefined out of 5
Keywords: python, selenium, web scraping, tutorial, headless webscraping, docker, docker container, dockerization, containerization
Id: xrYDlx8evR0
Channel Id: undefined
Length: 16min 21sec (981 seconds)
Published: Fri Oct 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.