Coding Web Crawler in Python with Scrapy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what is going on guys welcome back in today's video we're going to learn how to build a web crawler in Python using scrapey so let us get right into it [Music] now before we get started with the actual tutorial I would like to mention that this video was sponsored by IP role and I encourage you to not skip that part because I think the company is very relevant to the project that we're going to build in this video today in a nutshell we're going to build a web crawler today and when you're working on large-scale professional web scraping web crawling projects you will encounter the problem of getting blocked and banned from certain websites because your IP address is sending too many requests and sending requests too frequently to scrape the website how to circumvent it is you use proxy servers that allow you to send the requests through them so you send the request via the proxy servers to get the responses also via the proxy servers and if you use multiple of those proxy servers you can rotate them and you can send the requests from multiple different locations and IP addresses making it very very unlikely that your IP address is going to get banned and the problem with all these free proxy server lists out there is that they're unreliable unsafe slow and most of the time already banned because they're not trusted because they're mainstream because everyone uses them because they're free and I would not trust a free proxy server because I always think what are they gaining from offering that service to me for offering me for free to send my connection through them that's that's always a little bit suspicious which is why you want to use a professional company like IP Royal they provide multiple proxy Services you can see they have over 2.8 million IP addresses and if you go to their website ipro.com you will also find the link in description down below you can see what different types of proxies they offer the use cases of their services and also all the locations that you can use when using their proxy servers and this is the best way to to to do a large scale professional web scraping and web crawling for hobby projects this might be a little bit Overkill but if you want to send multiple requests and if you want to send them frequently and you don't want to get banned from your own IP address and you also want to use trusted service trusted IP addresses in a Prof Professional Service that's reliable and always there you want to use a professional service like IP Royals so you might want to check them out in description down below there is a limited offer you can get 50 off your residential proxy plan using the neural 9 coupon code all right now let's get started by briefly talking about the differences between web scraping and web crawling in today's video we're going to build a web crawler but we're also going to do some web scraping and those two concepts are very connected and oftentimes combined but they are different concepts and they have different use cases the main goal of web scraping is to extract data from websites where's the main goal of web crawling is to find links find URLs find branches off a given website and before we get too technical with the explanation here I'm going to give you some examples we're going to use in this video today the books.2scrape.com website which is a web scraping slash web crawling website so it's made to be scraped or to be crawled and one typical web scraping task on this website would be for example to take a page like this one where we have a lot of information we have a warning we have results we have a title we have categories and to extract only the data that we care about so for example we can see that we have a bunch of books listed we can see that there is a title we can see that there is a price and the information about the availability and a web scraping task for example would be to go through all the pages like religion historical fiction Classics philosophy and so on and to extract the titles and the prices of the books this this would be one use case or also the ratings for example and to then take this data and do something with it for example some data science data analysis storing it in a data frame storing it in a database just storing storing it somewhere or doing something with it that would be a web scraping task to extract actual information from the website a web crawling task on the other hand would be to find URLs to find links that match a certain pattern for example we can see that when we click on any of these categories here we have a certain pattern in the URL which is slash catalog slash category and every category has this pattern in in the URL by but when we click on a book for example we only have slash catalog but not slash category so one web crawling task would be to instruct a crawler to find all the links all the URLs that have this catalog slash category pattern in them and uh the web crawler would then follow the available links to find more of those uh category links that would be the idea of a web crawler in this case this would be quite trivial because we have a sidebar where all the categories are listed but in practice in a real world website in the production website you will oftentimes have something like maybe the top 10 categories that you can click on and then there are a hundred more categories that you have to find by for example going to a book and then maybe you have on another website here 10 categories 10 secondary categories that this book also belongs to and then you can find more of them so you could instruct a crawler to go all the different to go into all the different book pages to find all the secondary categories and to then collect all the pages that are category pages that would be a web crawling task finding URLs finding links and a web scraping task would be for example to go to the book page extract the title extract the price extract the availability the rating and do something with that information afterwards now in this video today we're going to use a python package called scrapey which is a web scraping slash web crawling framework so not just a simple library or module but an actual framework it creates a new project with multiple python files with settings files and middleware files it's a little bit like Django but not for web applications but for web scraping and web crawling projects so if we want to use scrapey we need to First install it we need to open up a command line and we need to say pip install scrapey and once it's installed we can use it in the command line to create a new project so I close this here I go into my terminal inside of pycharm I'm going to navigate to The Working directory and then when you are in the directory that you want to be working in uh you're going to say scrapey start project and in this case I'm going to give it the name neural crawling you can choose whatever name you want and once you have that once the project is created you can see that in your chosen directory you now have a new directory with the project name inside of that directory you have another directory with the same name as well as a config file and inside of that directory you have a bunch of different python files items middlewares pipelines settings some of them will be covered later on in the video but the most important directory here is the spiders directory spiders are essentially essentially the constructs that we use for the actual web scraping for the actual web crawling and we can create our own custom spiders to Define what we want to do to Define The Crawling Behavior to Define what URLs we are interested in and to extract certain information from the individual sites so what we're going to do first is we're going to right click here we're going to create a new python file and we're going to call this for example crawling underscore spider and in here we're going to start with some imports we're going to say from scrapey dot spiders we're going to import the crawl spider as a base class so we're going to inherit from this class here we're going to also import the rule class so that we can specify some rules for choosing URLs and links and we're also going to say from scrapey.link extractors we want to import link extractor obviously in order to extract links and then we want to create a new class which is going to be our custom spider class and I'm going to call this for example now crawling come on crawling spider you just don't want to call this crawl spider because that is already the name of the class that we're importing and also inheriting from so give it a name that is not already imported or not already used in my case crawling spider and this is going to inherit from the crawl spider clasp now what we want to do here first is we want to set a name we want to assign a name to the spider so we want to say name equals and then we want to say for example my fancy crawler or something like that and this is now the identifier of this spider when we want to use this spider for crawling we're going to refer to it using that name you can choose whatever name you want I'm actually going to change this now to just my crawler I don't want to type too much here and this is now the name that we provide in the command line because what we're going to do later on is we're going to say scrapey crawl and we need to provide a spider in my case now my crawler uh and of course this needs to be executed in the project so we need to First change the directory and then we can say scrapey crawl and then my crawler and this will execute the strategy of the crawling and scraping right now it doesn't do anything but this is the identifier of that particular class now next we're going to allow certain domains so we're going to say allowed underscore domains is going to be a list of the means that we want to accept uh for our links for our crawling and scraping so since we're working with this page here books.2 scrape.com the only URL that we're interested in is the to scrape dot com URL so we don't want to go to any links that are on any other domains we don't want to go to Facebook we don't want to go to YouTube Amazon or something we want to stay on twoscribe.com so this is the only allowed domain here and then we're also going to specify a start URL because we need to have a base base point to start from we cannot just say where nowhere start scraping we need to have one page that we start at to find all the links there to analyze the links to follow the links and to go to those pages to find links there are links there and so on so we have to provide a starting point in this case I'm just going to say http books.2 scrape.com that is going to be the starting page and from here we're going to try to find additional links and what links we want to find is going to to be defined by our rules so we're going to say now rules equals and we're going to use normal parentheses to specify a bunch of different rules so this is going to be a tuple of rules and we're going to start with a simple rule saying that we want to get the links so we're going to add a link link extractor instance here and we're going to allow for the following pattern we're going to allow all the links that have the pattern catalog slash category now when we go to the page remember when we click on the individual categories we can see they all have catalog category and then something else and when I click on a book it doesn't have category it only has catalogs so by using that rule that filtering rule we should only be able to find um find categories so this should only give us categories and one important thing here is this is a tuple right now this is just a single element so if we want to make this a tuple this is a little python detail that you need to take care of if you want this to be a tuple you need to add an additional comma in the end when it's just one element if you have multiple elements you can just add one comma or commas in between the individual rules but if you have just one rule you need to add an additional comma in the end otherwise it's not going to be recognized as a tuple so now we have our crawling spider that allows for twoscribe.com domains or URLs that have that domain uh it starts at this URL and looks for all the links that contain catalog category that's the basic idea we can already go ahead now and execute this crawler so we can go to um current and neural crawling and then scrapey crawl my crawler and this should find now all the different it doesn't what's the problem here start URL oh start URLs this has to be plural there you go and now you can see it starts finding all these urls so it gets the category where do we have it gets the category non-fiction gets uh fiction and children's books and all that so it extracts all the URLs that have this pattern that it can find on the website so as you can see this is this works quite well every URL that has this pattern in it will be extracted but maybe we want to also extract the individual books and maybe we want to have some information from those books now before we go any further in the actual rules here I want to show you one nice feature of scrapey which is using the Shell so we can use the shell interactively in the command line to extract certain things from the website to see for example what elements we get when we try something and let me just open up here my prepared example so that I don't do anything stupid here we're going to just take this URL here we're going to copy it and in the command line I'm now going to say scrapey shell and I'm going to enter this URL so it's going to navigate to that page it's going to open up scrapey interactively here and it crawled the web page so it's already opened up in scrapey and now I can interactively get the response and do something with it so I have this response object here response and this is the response it's a 200 so successful and it got this page here and now what I can do is I can use the CSS function to get certain elements based on their CSS class so I can say response dot CSS and I want to have all the H1 so all the heading 1 elements as you can see in this case this is just all products it's just that but for example I might also want to get all these links so I might click on inspect here I'm gonna see okay this is um an a element or maybe I'm interested in the fact that this is an H3 so a heading 3 I can go ahead into pycharm I can do again response dot CSS H3 and then I will get all these H3 elements with the respective links inside them we can also do something else so I can also say here not just get the response or the selector I can also call the get function on that to get the actual HTML code so I get the actual content of that or I can also say I want to get um the text now I'm not sure if I'm going to get text here but we can say here for example H3 and then colon colon text to get the actual text in this case it doesn't work I think for H1 it should work because H1 has some text directly in it um but I can do the same thing here for a uh the problem with get is that it only gets one element so if I want to get all of them I can say get all uh what was the function get all but with a lowercase a and then I'm going to get all the individual links with the text that is inside of them so you can play around with this shell interactively to see what you have to do to get certain elements and then you can build your rules into strategies based on that um what can we do um as well we can also specify classes so for example uh somewhere on the page we should have page header we can also say that we want to get the page header instead of just um a specific element like H1 we can just get the class the CSS class so I can say for example response.css and I can say dot in parentheses uh not in parentheses and quotations Dot Page Dash header and then I can say get and then I will get this element here as well so I can also filter for class we can also use regex and maybe one last thing that I want to show you is we can also use XPath so I can say XPath give me all the links and from those links their text so I will get this and I will just say dot extract and then I get the same result as before when we got the text of the links so if you want to know more about XPath I have a tutorial on that on my channel already you might want to look that up but this is what you can do here interactively in the command line now let's go ahead and Implement that in our spider let's say we're looking for specific things in our spider and we don't want to just um play around interactively we want to say get all the books for example get all the categories and do certain things with them so I can define a second rule here where I say now what I want to do is I want to allow for all the URLs that have catalog in them but I want to deny all the URLs that have category in them so here we're interested in all the catalog links that do not have category in them so we're not going to find any categories but we want to find all the other URLs that also contain catalog in them and with those results what we want to do here uh is we're going to define a callback we want to say okay when you find these URLs we want to refer them to a function called parse underscore item we pass this function here as a string this function needs to now be defined or this method needs to now be defined here in this class and this method will handle all the instances that are found by this rule here so we're going to say here def parse item self and response and then we can decide now how to scrape these links so what we do here is we crawl these links we use the crawler to find all these links that fit that rule and then we take those and we feed them into the parse item to extract something from them so we do crawling here and web scraping here um and what we want to do now is we want to get the title the price and the availability of the items so what's the problem here indented what's the problem okay we're going to figure that out as we go uh but what we want to do now is we want to go to the actual book pages we want to extract the title we want to extract the uh the price and the availability here so we want to get this number here essentially um how do we do that let's right click on these things we're going to see that they have certain classes so if I go to the title you can see it's um an H1 inside of product main so we have product main the whole div box around it inside of product main the H1 is the title the price color is the price so the the PTAC with the class price color is the price itself and the in-stock availability um has also some text in it showing the availability so what we're going to do here is we're going to yield the results we're going to use the yield keyword because we want to have a generator here and not we don't want to have a return value we want to have a generator so that we get the items here so we need to provide the yield and not to return keyword and we're going to uh to yield a dictionary which is going to say the title of our book is going to be response dot CSS this is now exactly what we did in the Shell interactively we're going to say that we want to have the product main so product underscore main class and below that we want to have the H1 and of that H1 we want to have the text this is the title of the book want to get that then the price is going to be response.css we just want to get the thing where we have the price color class and from that we want to get the text get that as well and for the last one I want to say the availability is going to be response dot CSS I'm going to get dot availability [Music] and from that one I get the text so um because we have here the availability in stock and availability are two classes separate classes so we can just go for availability and we have here the actual content so we're going to say from that get the problem here is though um we're going to get two results or I think three results and what we want to do here is we want to actually get the second one so this is just particular for that page this is something that I tried already in the prepared code we're going to get multiple results as far as I remember and we want to just get uh the the third instance here so index two that is what we're yielding here now let me just figure out briefly what the problem is here let me just compare we have is it the indentation I don't believe that what is it rules equals rule link extractor what's the problem here or do I need just another comma no that does not seem to be the case okay it seems like this block here is just not indented properly so I'm gonna just press tab here tab tap tap tap tap and now it should be working so we now have the rules that we want to find all the categories and we also want to find all the catalog links that don't have category uh in their URL for those who want to pass those responses to the parse item function here we extract title price availability and that is our scraper so this is already our spider we're going to now open up the terminal and we're going to say um navigate to the directory and rock crawling now scrapey and then crawl my crawler but what we're going to do now or before we do that let's just run this to see the results and then we're going to learn how to export the results into a Json file so what you can see here now is it scrapes all the books all the categories as well but you can see what it does here it goes through all the instances that it can find it will take probably some time and then in the end we're going to uh have the results have the extracted values for all these instances so maybe I can terminate this and you can see here when I scroll up that we have these dictionaries here all the time so we have title we have price and we have availability availability doesn't seem to work so we might have to fix that here maybe we have to change the index um does it work for some instances I'm not sure we're going to check that out later on but what we want to do now before we run all this here on the screen is we want to save this in a Json file so we want to say here scrapey crawl my crawler and then Dash o and then output.json for example and this will then take all the scraped information and output it into the Json file here so that we can analyze it later on so once this is done we can open up the Json file and we can see here all this information that we created we have the title we have the price as I said the availability doesn't work yet we're going to figure out why that is here in a second but you can see all these instances that we scraped this is all the information that we got from The Crawling and scraping we found all the books by using the web Crawling by using the rule here this one and then we used the parse item method to get these dictionaries to get these Json uh objects here with the information now to find out why the availability doesn't work let's just go ahead and open up the page interactively or one page interactively like this one we're going to say scrapey shell and this page here and then we're going to just do exactly what we did here we're going to send a say response dot CSS dot availability text and then two dot get and in this case it says out of range this should not be the case let's just go with get here we have um backslash and let's go with the get all and here we can say Let's see we have zero okay this is one now I think when I prepared the video I got a different result I'm not sure but now we should be able to just say one dot get and then we would have to do some string formatting we would have to remove all the backslash ends and we would have to remove all the white spaces and then we should be able to get just this string here so maybe let's go with one get this and then do a replace if this is a string I hope this is a string replace all the backslash ends with nothing and then maybe also replace the white spaces with nothing maybe that will work let's see if that works I'm not sure let's just exit here from the shell and let's crawl again so I'm going to call my crawler again output Json here we have a problem what's the problem here line 20 I forgot to close something as it seems what is that um we get the instance we replace we replace oh I think I used some Bim bindings accidentally here this should suffice let's run this again and now it starts scraping again let's just break here to see what happens and we now have availability in stock 16 available so you can see we also removed the white spaces in here but you can play around with the string formatting that's not necessarily the topic of this video but you can now see that it extracts the correct information we just had to change this from two to one and we had to replace some some data some some line breaks with nothing and some spaces with nothing and then you can also use some regex to only get the numerical values and stuff like that if you want to have the raw number but this is how you can do that here you define the crawling here to find a scraping mechanisms and then you can just go through the web page automatically to find all the pages that can be interesting to go to those pages and for their respective categories you can Define how they are parsed so here we just get the categories right now you can see we get the category links here if we're interested in them but we only parse the book so only the pages were categories not part of the URL now if we go to Output Json you can see this is still not properly done why is that I think this is just an updating problem or maybe it just appended the results yeah I think that's a problem it depends the result so you have to replace the Json file every time but you can see now the data looks better we have the title we have the price and we have this availability here and again if you do some more string formatting you will be able to just get rid of in stock and available and all that you can just end up with a number here now one problem that you might encounter when doing web scraping or web crawling especially if you're doing large scale web scraping or web crawling is you might get blocked from certain pages they recognize your IP address you're sending too many requests or you're sending them too frequently or they recognize that you're web scraping and they don't want that they can just block you and you are no longer able to access the site from your IP address in order to prevent that what you can do is you can use proxy servers you can use some server in between you and the site you want to scrape you send a request to that proxy server the proxy server sends the request from their IP address or with their IP address to that website and then you get the response back via the proxy server as well and when you use multiple proxy servers you have multiple IP addresses and it's not that likely that you're going to be blocked by certain websites and this is also why the sponsor of this video today might be useful to you guys IP Royal offers professional proxy service now of course there are also lists of free proxy servers online that you can find you can just look for free proxy service and then you're going to find a list of maybe available maybe unavailable servers the problem is those are not reliable those are oftentimes slow and I always ask myself the question why are they giving the service away for free what is the benefit for them for providing you with a free proxy server I I don't trust that too much so if you're doing some large-scale professional web scraping you want to use professional proxy servers and iproil offers them you can check them out in a link in description down below and how do we do that now here in scrapey it's actually quite simple there are many ways one way is also to manipulate the environment variable HTTP proxy but we're going to go a different way here we're going to Define here a constant we're going to call this proxy underscore server and we can provide the proxy server IP address here now I'm not going to provide an actual IP address here this is just I'm showing you just how this is done here you would enter IP address or maybe as a placeholder let's just go with localhost so here you would add the IP address of the proxy server then what you want to do is you want to go to the settings py file to this file here and you want to find the downloader middlewares um commented out section here so we're going to uncomment this section and you can see we have your downloader middlewares equals and we have the neural crawling downloader middleware we also want to add here the scrapey dot downloader middle where's dot HTTP proxy dot HTTP proxy middleware we want to set this to one this basically enables the use of proxies and in here so so those are the two things we need to have here first of all we need to uncomment this thing second of all we need to um add this line of code here let me just zoom in a little bit so that you can see that and then in addition to that we want to go to middlewares.py want to go down to the downloader middleware uh where is it they go to this one which is this one so we activate it here and we can now manipulate the process of the request sending of the request processing in here by adding the proxy server so what we're going to do is we're going to use this process request function and we're going to just say in here request dot meta let me just zoom in again in here we're going to say request meta and we're going to say the proxy of the request is going to be set to whatever proxy we chose so one two seven zero zero one localhost doesn't work obviously but once you have a proxy either a professional proxy for example from IP royal or a free proxy that you want to play around with that you found online you can just pass it here and by specifying it here as a proxy server by specifying it here in the process request method inside of the downloader middleware which we activated in the settings file and we also added this here with that now the requests are going to be sent from the proxy or via the proxy and I think if I try to do this right now we should see that it doesn't work obviously so let me just navigate here to the directory let's go AP crawl my crawler this should not work because we don't have this um well this is not a proxy server so as you can see it doesn't work no connection could be made but if you have a functioning proxy server in there it's going to work it's going to send the requests via that proxy server and if you rotate them if you have multiple proxy servers you can also prevent getting blocked and in addition to that probably don't spam the websites only scrape the stuff that is responsible and ethical so don't uh just scrape massively all the time just respect the resources of the site but if you want to get some information here and there you can do that and if you don't want to get blocked just because they're recognized just scraping you can use proxy servers like that so that's it for today's video I hope you enjoyed it and hope you learned something don't forget to check out the sponsor of this video IP Royal you will find a link in the description down below as well as a coupon code neural 9 which will give you 50 of the royal residential proxy and only for a limited amount of time so if you are interested in that offer you should sign up as soon as possible with the coupon code because as I said it's only available for a limited amount of time and it will save you half of the money if you sign up for that plan other than that also let me know in the comment section down below if you like this video or not and as always don't forget to subscribe and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you in the next video and bye [Music] thank you [Music]

Info

Channel: NeuralNine

Views: 105,101

Rating: undefined out of 5

Keywords: python, web crawler, web crawling, python web crawler, python web crawling, python crawler, web scraping, web scraper, scrapy, proxy, python web scraper proxy, iproyal

Id: m_3gjHGxIJc

Channel Id: undefined

Length: 34min 31sec (2071 seconds)

Published: Wed Nov 23 2022