Scrapy and Selenium - Scraping Dynamic Sites Faster!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we are going to talk about how selenium can work with scrappy why would you need that so the example is this site so the product listing is dynamic so this page you cannot scrap directly using scrappy but once you get all the product links from this page then the product details page these are not dynamic and scrappy can work very efficiently one more way to solve this kind of problem is to use splash but the problem with splash is that it needs docker and not everybody has a machine which is capable of running docker or you may simply have some issues running docker so the first thing that we are going to do is we are going to install selenium so you just have to run pip install selenium and the second step is to download the driver so for example i'm using chrome so in this case i'm going to use the chrome driver so just to verify the version number that you're using so i'm using version 92 so click on version 92 and download the zip file and unzip it now if you're working only with selenium that what kind of code you will write so you will write something like this so from selenium.webdriver import chrome and specify the path where you extracted that chrome driver and once you have this driver path you can create an instance of chrome and you will pass on this part of the driver to this executable path parameter so once you have created this instance of the driver you can call the get method and supply this url so this will load up the chrome browser and it will load up this page so once this is loaded up you can create an xpath or css i've chosen to use xpath you can choose any of the selenium method all the methods are going to work here so once you have extracted all the link elements remember that selenium is going to return the elements themselves so you have to run another loop to get the actual links which are contained in the href attribute so this loop is simply going to run on all the link elements extract the href attribute value and then simply append to this list that we have created and then we are going to quit the driver so let me show you the output of this so what we are going to do is we are going to run python3 dot this file path and it is going to open this chrome browser window so let me put it on the right side and once this page is loaded all the links will be printed so as you can see that all the links have been printed and now we are ready to move into the next step and the next step is to integrate this entire code into scrappy project so what we are going to do is we are going to create an scrappy project and why i am working with project just to show you how nicely it can fit with complex scenarios so the command to create scrappy projects is scrappy start project and let's call it cell scrappy okay and this is just going to create a directory and a number of files inside that so let's seed into that and now we are ready to generate the spider so scrappy gen spider and we can call this again anything so this is for laptops let's call it laptops and the fourth parameter start url i'm just putting x as placeholder and this entire folder i am going to open in vs code so this is the folder and the file that is open right now is the selenium file that we have created so what we are going to do is let's open the spider okay so this is our spider and what we are going to do is of course we are going to get rid of this allow domain and start urls so these are not something that we need so again think about what exactly we are going to achieve what we want to achieve so we want to run one part of selenium and all the links that we are getting from here so all these links what we want to do we want to process them further as scrappy regular request okay so just keep this in mind and let me copy everything okay so i'm just let's come back to laptops.py file and what we are going to do is we are going to make use of start request and let's paste everything correct the formatting and let me remove all the unnecessary things what we can do is we can get rid of this we don't need to print it and instead of links dot append href what we can do very simple we can directly yield scrappy dot request and this url is href okay that's it now see how many how much changes we have done very little right and i'm going to copy this bring it here this import as well i'm going to add one more import here chrome options so let me create this instance of chrome options the headless property to true okay and this options is also something i'm going to pass through this driver so options equals options so this will make sure that this chrome is not visible when the spider is running see otherwise we have not made any change so now in the pass method we will be actually on this page which is the product page and from here we can extract all the information as the normal regular scrappy spider so let me show you how we can do it so let's do something more fun actually so the first thing that we can do is probably this driver path so keeping this here is not a great idea so maybe we can move this to the settings file so let me copy this so i'm just going to control x control p and search for settings so in settings we can create our own custom settings so let me call this chrome driver path can call this anything this is custom setting that we are going to create here so how do we read this custom setting into spider so for that we will have to read scrappy dot utils dot project import get project settings so this get project settings will give you access to all the settings which you have written in settings dot py file you will create a variable settings all right and from settings you can get this particular value chrome driver path and this has to be a string what we can do is we can go to items.poi file and we can quickly create so i'm going to remove this so this is going to be the product name so this is going to be scrappy peeled and let's get the price as well so this is again going to be scrappy feel and now we need to import this so what is the path of this complete file from sel underscore scrappy dot items import and what is the name of the class probably we can call it laptop item so import this laptop item and here what we are going to do is we are going to create a variable of laptop item now we need to get the actual names and prices so let's go to the inspect tools let's get the name so this is h1 so should be straightforward h1 there is only one h1 okay cool and let's look at the price as well so let's look at a suitable class so this is inside this pan and this looks orange and we have one class here pdp price color orange so let's check for this and there is the selector so let's copy this and come back to the code and we have to set the price and this is going to be response dot css we need to extract the text get now similarly we need to set the name so the name is going to be very simple there was only one h1 so h1 colon colon text and then get so this much should be sufficient now finally we can yield this item so let's go to the terminal and run it so scrappy list to verify yes scrappy crawl laptops and let's see so the first request where selenium is loaded it's of course going to take time and now very quickly we can see all the product items and that's it done so we can see the product names and the prices here we can very quickly see that 40 items were scrapped and see the elapsed time it took only 13 seconds if you are curious about how to handle pagination so the pagination also has to be handled in the start request method so in this block what you are doing is you are creating 40 scrappy requests so once you are done with this this particular power block what you can do is you can click on the next request and get the next 40 items and run a loop probably here and handle all the pagination here in the selenium itself so that's all for today's video if you have further questions do let me know in the comments i'll see you [Music]
Info
Channel: codeRECODE with Upendra
Views: 901
Rating: 5 out of 5
Keywords: python web scraping tutorial, python scrapy tutorial, scrapy for beginners, Python Web Scraping, selectors in scrapy, web scraping python, how to scrape data, scrapy javascript, browser scraping, scrape web pages, website scraping, python scraping, scrapy tutorial, screen scraping, data scraping, Python Scrapy, scrapy splash, web scrapping, web scraping, web crawler, webscraping, scraping, scrapy, selenium, scrapy selenium, scrape dynamic sites, python
Id: 2LwrUu9yTAo
Channel Id: undefined
Length: 9min 1sec (541 seconds)
Published: Sun Sep 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.