Crawl and Scrap any website using Scrapy and Selenium

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay guys today we are going to discuss about the scrappy and how we can use selenium with scrappy and using using selenium and scrappy ho we can crawl and subscribe any website so for example today we will scrap the rash.pk and we will save all products of raj into csv so our learning output will be as per our description so at the end you can scrap any website and you can uh should do scrapping and crawling for any website so let's start with installing in the scrappy so as you can see i already has shown you that oh we can install scrappy so you have to copy this command before that i am considering that you had already set up mutual enrollment and you are known towards python because today i will use python as the core language to use scrappy for your work so basically scrappy is a framework from calling and scrapping websites so scrappy uh help us to create a bar so we can do scrapping and calling and we can mine any data and we can save this data to files or to a persistent database like mysql or postgresql so first i will copy the command pip install scrappy then i will run this command in my code editor so for this i will share my code editor screen with you so let me do that and let's see as now you can see my whole screen and i will show you the today's target as you can see on my screen okay so let's try to install scrappy first so i had copied the command pip installs copy and then i will press enter and see if we can install scrappy or not so let's see so it's downloading its base library like as you can reach my console so it will took some time to install scrappy meanwhile we have to get so scrappy has a good way to scrap any website so how we can use this framework so basically it create uh spider as you can see scrappy is successfully installed so we can move further like i will wrote a command scrappy start project and don't put space in between these two words and write down the our project name like right now our project is draws.pk and press enter and then it will create a folder with the name draws.dk and we have to change directory and i'm entering in this folder and then let's create a spider that will crawl the dash.pk and will scrap its products okay sudras is our domain name and the rash dot pk then press enter scrappy gen spider is the command to generate a spider that will crawl any website and i had provided the domain name and the domain url so let's press enter and see what that command does for us and let's move towards our project and see if we can see what this command does for us so the first command scrappy start project will create a folder the ras.dk and scapi.config file as you can see it has some settings that like project name and default settings and further it also created some files like init dot py and items and middlewares and pipelines and settings so basically i will let you know what these files do so i will start with settings.py so it have a bot name as i'm highlighting that so bot name is our bot so basically our bot will crawl the website does that pk and we'll scrap it products and the next thing is the spider modules so right now i only had written one spider with the command and further the new spider the new spider is a spider who will let us know that the website has some updates and you have to crawl and scrap it again so and the third thing is robot.txt file underscore obey and so basically that's a flag so if we remain it true so it will follow the robot.txt file so basically this file is for crawlers so they can read this file and know that how many links our website have and that's all so for the basic setting these four lines are enough for us right now and i will move towards the pipeline file it's a really important concept for the scrappy that uh there are pk pipeline for says item the class name is the ras pk pipeline and the function is process item so in this function actually we can process every item that we are scrapping and passing so we can change this item or we can add something to that item like we have to identify every elements in the rash which have a product name like price so we have to target that that element specifically and then we will pass a string from that element okay so we can process that string in this function that's the pipeline is for and let's move to the middle layers and see if our streaming is still working so it's working it's good okay the middleware is the uh as this may be showing this is is something middle between all the process so middleware so we can call any middleware between request and response cycle okay basically we are simply getting a get request to any website and downloading its html as response and after that we are finding our element between the html okay so that's the concept for middleware right now then uh quickly i move towards the next file which is items dot py items dot py is the file where we will define our items like right now i'm just going to scrap only names of grass products then i define name so in our pass function in our spider class we will define a function which will pass the name of all products on the ras.pk okay so let's move towards our spider and init file is empty for now okay so the rash dot p y is our first spider and quickly print the response for now okay and and check the basic attribute of our class like as you can see it's defining the name of our spider that is taraj and allow domains allow domains actually is the ros.pk that's a big mistake so i'm updating that or start urls so we are basically starting from there pk and this url okay so i will add www dot okay so now i will apply breakpoint here and i will configure the scrappy project in python and you can you use any code editor so configuration is totally dependent on your operating system so i'm using mac mine is different from you so you can find your own like i have a command line scrappy exit which will run our commands like crawl the raj that's our spider name and our working project is this let's confirm it okay i think so we are good with that and let's try to update it once okay so for ease and that's our working photo let's buy it quickly start debugging our scrappy project okay so right now we had installed scrappy okay and we are going to try to debug our project and let's debug it once and start it and as you can see in my debug console so let's try to dig that it said you started a bar with the name brass pk and we are using scrappy project uh version 2.5.0 and there are multiple things as you can note okay bot name news module obey blah blah blah okay these are the all configuration things and right now we stopped on our break point and let's let me show what we had in response okay guys so here is the magic as you can see we have all the html of the rash.pk so just wait it will show because it's so big so it's taking some time to process so anyhow as you can see it's showing some uh line species and new lines and blah blah blah so as you can see status 200 that mean our request is fulfilled and in return we have a massive body data and attacks data and here is the html data as you can see so we can't understand that right now so let's just see once and let's move towards the ros.pk website and as you can see we can target any website any product here so let's try to stop any single category uh products so i am clicking on cleaning buckets and tabs and it's just random so i'm not targeting any specific categories for now so let's copy this url okay as short as much we can okay and then update our start url replace it with that and simply okay we are done then find the elements of products by doing some reverse engineering and i will try to find its class name or id so basically as you can see these this is the product so let let's try to find uh that's the main and that's the each product okay so close it and see okay we got our products okay so let's copy this class name and simply update one thing and write down so sponsor.css so basically response.css is a function where we can write any class like dot and id so using css selectors we can select any element from the website response which is basically html of garage dot pk so so this is uh you can say the products elements okay that's good and then we are going to loop these products so i will say product in products element okay and print product so for now we are done okay then change the breakpoint here and debug again until we die so so mentor box don't waste our time okay guys so we are again debugging our project and this time we had found our product and we have some error so let's try to move break from some back point so i applied background at line 10 and let's try it again if it stop here okay so it's saying that we are redirecting to some other thing okay and here see if we had found products oh no saying we haven't found any product so i will move further so i will play once so we can have products now we have some product element but we don't have any product element so let's try to find product here okay response dot css then try to find this thing if we can count now so let's try to debug again why we are not having any products so let's try to find that this parent container first we found that no we haven't found that yet so what we can do we can find any element from example path so let's try an other function of scrappy like example part and i will try same like i will say so please note down and the syntax it's important as you can see a little mistake can lead to a bigger problem so keep an eye on course okay we are ready so let's try to find it what it says it says there is nothing like that so i will say do you have some dates yes it says something so they are our etds so we can identify what is which is our day so let's try to find some other thing like it has some products and okay try this one for sure it will work again we will write class and this and now and let's try css again and here i am feel that i am stuck here so let's try the selenium request because i am feeling that that's uh some kind of pre website like that's why we are using selenium with scrappy that some websites are rendered already like php or python website so this is the website which is that is pre-rendered so basically the dom element are already loaded but products are coming after the dom element is loading so after sending a get request we are having only dom elements so we don't have actually products in our response that's why we have uh we have empty list of products and then we have to try with uh the selenium request so let's try sneam os simply so what we will do i will install the selenium webdriver and selenium so for that i will find scrappy selenium github travel okay so i think this is the one cool so that's the one so i will install it let's go to my console and install it it installs stream and scrappy so we will try to capture the rasp by using this and i will update its settings dot py for web driver settings so that's really important okay i will copy this line in the beginning of our file okay and we are going to use chrome web driver so i will copy the chrome chromebook driver from my desktop because i already have it so that's my where is my chrome web driver guys youtube has my chrome web driver so let's try to find it so let's download the chrome web driver i haven't found it on my local machine let's try on downloads if we can find chrome web driver no we don't have it so write down chrome web driver mac download it so that's the url please verify your chrome version before downloading the actual web driver because here is my version 19.0 so you can use your own version check and download accordingly don't use any different version it will never work for you then okay use your recording like i'm going to download max 64 bit version okay so that's good we are done with that and we will copy this web driver to our project folder by simply copying this to dab and our project is this and copy here and go to our project and see we found it and copies url absolute url in fact okay never mess with the paths and we will use with head and head and the headlights is an important concept of selenium web driver like if we use head last then we will never see that chrome web what is doing if we use it with head argument then we can see what is happening inside chrome web driver so selenium web driver actually will open a browser and it's the automated web drive uh browser okay so it will uh actually run commands like a human but in fact it's about so it will use a website and crawl it and scrap it whatever you can do so bot actually can resolve any captcha so we can achieve that easily so i will share that lecture after uh coming week about how we can resolve any capture like google capture or tiktok capture so let's try to create a selenium request and move back towards the selenium web browser and update its all settings okay that's good accordingly and import this important modules like i am doing so and then simply yield the request by giving by giving its url so i will simply copy this and provide the url as this is the our url right now okay so to show you that oh we can work with selenium webdriver using scrappy so it's simple we had installed selenium module and after that i had imported selenium request module and pass the url and pass a callback function i will define a callback function means our callback function def that's our callback function so it will have a response and i will print the response yes all we are almost done with our work and let's play it again and debug if it works okay so soon you will see uh chrome automatic web driver will pop up okay stopping so you have to note that chrome is being controlled by automated tested software okay so basically selenium webdriver is working automatically so now we got our page again and let's try to find product if we found them okay so let's move towards the drast so basically now we can check it here okay in our automatic browser so it will have uh updated uh ui and the elements so let's debug them once and try to figure out our data so i will try to find products element so far it has the each container for the product so this this is the whole thing guys i hope so so this is the whole thing okay let's try to debug that if we can find simply using the css selector or selenium okay dot and then okay now we have all products so there are these pre-rendered website okay so now we have all the products for our discussion okay so let's copy that thing here and apply a loop and simply check out our products which and simply and at the end we will get a csv list of products okay so let's try to figure out product name element okay so basically we have to figure out the name so i will find out the name by simply checking the name okay so that's the class of name and i will write here a name okay so name will be product dot css so that's our product name if you guys so i will define uh empty dick here where i will add all the product names and at the end we will yield this dick to csv okay so that's our big items so empty in fact or in the loop now i am updating so this is items dot update so what we are updating we have the product name okay guys that's the product name name so we are done and at the end of our program we will yield so yield is important concept so we will we will discuss about the yield concept in advanced lectures at the end of closing of our scrappy learning path hopefully and let's apply the bitcoin on yield and we are almost done with our scrapper so we are near towards end okay and let's refresh the debugger once okay so again it will start a automated browser using selenium webdriver as you can see so it will request draws specific category and let's go to our loop and apply a breakpoint here so we can analyze what's happening over here does we had found any product name so yes we had find any product name that's good so we have the product element name but currently it's a selector it's not any text element so let's extract so it's a class it's a d so let's find it's text if not so attribute text now so all right we get here okay and what we have now we have uh a with a title and i will say we have a okay let's try to die down here okay allow me okay so it's allowed me okay okay okay bro close it once and again evaluate expression and write down a and a you get the a element okay now we have attacks between a tag okay now try to extract only tags so let's see if we can have text here so i will know or we can write simple text so now we have the text by writing down only the text at the end as the attribute okay we are done so let's refresh the debug console and update our configuration to get the csc you only have to write uh additional configuration thing i will write o to the product uh the file name where i will save the product so that's all that's how we will save all the products into csv so that's a scrappy thing so there's a really little command for big things so that's how scrappy help us in scrapping or crawling any website so apply this okay we are done so let's check if we have product name and we have guys so it's good we achieved what we started and let's remove this and check if it works good and yes it's working i will remove this and simply it will scrap all the product and it will say it is done so now we will try it at the end and we will get our products csv because i had applied new configuration so i will debug it at once so this is our last time debugging and you will find that we have a product.csv file and our automatic browser is working as usual okay and right now our product.csv file is empty because it just started with the work and okay now we get a product name and that's all that's done for today and that's how you can scrap any website using the scrappy and selenium webdriver thank you very much guys if you like it you can share it and if you want any details about scrappy and selenium you can inbox me for details thank you very much and you will get a good wrapper of whole tutorial on and you just have to fork my rapport and you also have to follow me on facebook and github thank you all much
Info
Channel: Leader Malang
Views: 177
Rating: 3.2857144 out of 5
Keywords: leader malang, python developer pakistan, php developer pakistan, django developer pakistan, laravel developer pakistan, odoo developer pakistan, sql developer pakistan, Crawl and Scrap any website, using Scrapy and Selenium, wesbite scraping with selenium, scarp data in mniunets, Install Scrapy and Selenium
Id: gBqET3Pdn54
Channel Id: undefined
Length: 33min 5sec (1985 seconds)
Published: Sun May 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.