How To Use the Scrapy CrawlSpider - An Example

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

scrapey actually has two different spider classes that we can inherit from the first one is the generic spider which we've shown plenty of times in other videos but there's also the crawl spider now this does exactly what you might think it would do it can actually crawl websites uh finding links and following them we can actually use the rules and the link extractor to really decide which links we want to follow and then based on those rules we can also pass off those links to our callback and our pass function to scrape that data out in this video i'm going to show you how once you've started your scrapey project and you've generated the spider using the default template you'll end up with something that looks like this i've got my name and my allow domains and the start urls here but the first thing that we want to do is we actually want to be able to import in the crawl spider that we need so what i'm going to do is from scrapy dot spiders import crawl spider and this is going to give us access to that other spider option that we want i'm also going to import rule from this as well and then from link scrapey dots link extractors import the link extractor so this is going to allow us to work with and set rules to specific links that we want to follow so i'm going to change this as well now because we don't want to use this spider template we want to use the cruel spider and there we go so there's a few really key things to note when you're using the crawl spider the first one is that you need to set rules and the second one is that you cannot use a pass function called pass it will it will overwrite the default if you do so you need to make sure you rename this i'm just going to call mine pass item because this is an online shop we're going to be getting items that's fine the next thing we want to do is think about the rules that we want to actually use so we're going to come down into the next part of our code here and we're going to type rules is equal to and this is always a tuple so you can put multiple rules in here i'm going to have two in this instance so let's go ahead and just grab the site real quick so if we come here and look at this website and we come and see we have all of these different categories at the top now for this demonstration purposes i'm going to ignore the fact that i think hey this is a shopify site which is not this is not the best way to do this and also there is an all section which is kind of the way that you could do it anyway but for those cases where you don't or maybe you want to be more specific i'm talking about all of these different categories here now we don't want to have to access and go through all of these manually that would take so much time and effort we can actually set the rules to the link extractor so the crawl spider goes to these pages so if i click on the first one here you'll see that the url at the top changes i'm going to copy the url let's just check how many pages have we got here not that many okay so let's try a different one let's use this one here okay great so we've got different pages here so i'm going to copy this url and i'm going to come back to my code i'm going to paste it in the top here so we can all have a look at it so what i'm saying is that we can see that after the main url which is here that our start url we have this word collections and then the actual category or the collection itself so what we can do is we can set up a rule that allows our crawl spider to go and find all of the links that have this collections word in it and follow them and if we go back to the site just quickly again and we click on the let's try the second product we'll pick one that's in stock we can see that the link changes again and i'll put this one underneath so you can see so this gives us the products part here so i'm focusing in on these two words collections and products and these are going to make up the core set of my rules so i'm going to say when the crawl spider finds a url that has the word products in we want to we want to actually call back to our fastpass items function down here and it's also just going to allow it to follow collections so let's start writing the first rule so let's have it as rule which is importing from the class up here and then the link extractor like this so this is going to say we want to extract the links like this now inside here we can use two keywords we can use allow and deny so i'm going to say allow is equal to and i'm just going to put in the word collections here so this is basically saying we're allowing it to follow any links that have collections in which is great now if you don't put any callback here or nothing else it will automatically follow those links you don't need to do anything like follow is equal to true uh you just i think you used to need to do that but you don't anymore there's a there's a curious thing here though so we're gonna have two rules so let's have this is our first rule and our second rule is going to allow products and then we're going to have our callback is equal to our pass items past items sorry and we need a comma in between these two so if we had a quick look at this and we ever think we're saying this one allows to go collections and then this one allows products however there's a really important thing to note here that if you look at the second url that we've got here this has both the word collection and products in it so what that means is this lit this rule is going to follow these product pages automatically because the url has collections in it which is match down here now when you follow with crawl spider when you follow these links it will only go to those links once so what it does is it will actually go to this link here do nothing because this following this rule and when it comes to the second rule for our callback of where we're actually going to scrape the data it will say we've already been to this link because we went here when we had our collections rule so to get around that you also want to use the second keyword which is deny and put in the word products and this looks like it counters each other out what it means is that this rule here will only follow links that have the word collection in and not the word products now i'm just using keywords for these i'm just using a match on the full word but you can have you can use regex in here there are other words that you can there are other ways as well you can say that only follow if the selector is met so like xpath or css but i find that this is a nice easy way for well-structured websites like most are these days so basically this is going to allow us to follow the first link all of them out all the way and find the product links and then go from there so now we just need to build our pass item function here so get that data out now if you were doing this properly i would highly recommend you use the item the item loader and some kind of pipeline to save your data but this is just about the cruel spider of this video so i'm just going to do yield and i'm going to put in our dictionary here i'm going to grab this url and i'm going to use scrapy shell to work out where our data is so let's move down here let's get rid of our double quote okay and let's spell scrapey right this is not not gone particularly well here we go let's let that happen and then we can actually just work out where we want to actually get the data from so we can move this down a little bit further okay so we're in it's working control l will clear all that up let's open up the page again inspect element and i'll make this all the way over here so we can all see let's zoom in a bit more on this side too far too far too far too far there we go so i'll use this little tool up here and click on that hover over the bit that you want so let's go for the brand and we can see right here we have this div class vendor and then the actual word the brand that we want is in an a tag so let's copy that and go back to our code and we will do response.css div dot for a class paste in the word vendor have a space to find all the a tags that are in there double colon word text because we want the text from the element dot get like that will actually get that text for us there we go that's the brand so that's nice and easy so let's copy that if i'm zooming through this that's because that's not really what this video is about if you're interested in uh more like sort of this basic subscriber i've got a great video for that on my channel which i'll throw a link for somewhere probably the next thing we want to do is the name so we'll go back to our website here's the product name and we can see we have an h1 class title this website really is well structured are you surprised that i picked it response dot css and uh what was it h1 dot title text dot get let's copy that because i put it up here and not in the shell there we go we know that works good and the final thing that we want is price and we will again find where that is it's probably in a tag with the word price span class price there's a space in here but that's okay we can just put price i think we'll copy that anyway so we'll go up we'll change it from h1 to the span tag it was a class so we can do dot price and we'll see if that works it does great i'm not going to do any extra with this data i just want to be able to show you that it works so let's exit out of this it's a python shell after all clear that up move that down with the bar i can't see and there we have oh that one further out there we go so there we have our completed uh spider so let's try running it but what i'm gonna do just for the sake of not taking forever and scraping a whole load of data that i don't need i'm going to put the collections i'm going to allow this link only so it's only going to follow links on the front page which is the one that we're starting on here that have this in the title so we're only going to go to this category but we should see the fact that this is working here and we should see the data come back and once it once i know it works i'll remove this deny part so we can see exactly how that works as well so okay let's do scrapy crawl and we'll see what we get back great so we can see we've got a item scraped count of 71 let me name this one bigger so it's easier to see item scrape count of 71 and if i go back up here we'll see look here's a new page that it found and it found this this product page we see just there so we crawled this one get this page because we found this one and it was on there we can see that we are abiding the robots.txt which is blocking us from these which is which is fine no problem but we got lots of products coming out so we had 71 products i believe yep 71 items that we scraped let's go back to the website let's go back from here so i was expecting to send it to the front page come to the links that max match that collection and then scrape all the products whilst following all the links that it could find so if i go to page how many is on a page one two three four five fifteen sixteen okay great so let's go have a look at right at the top here here's one i want to show you um so you can see that we're actually getting these urls down here and this one actually has a different collection in which you might think well why did we go to this one no it's just slightly off the screen there you kind of see it whiskey stuff products staff picks but there's a reason because this where this url is actually on the front page and matched with our products word keyword here we actually scrape that data so if we went to the front page again you'll see that we ran all of those rules both of those rules on the front page and we picked up all these products as well so basically you really got to know what it is that you're after and how you think the best way to go about guessing is now if i wanted to get all of the items from this website using this method i would probably quite happily just remove this keyword here and go through all the collection pages it will ignore anything else and if you notice any other pages that you're not interested in you can make this into a list and you have multiple words to deny so i'm going to remove this deny word now i'm just going to show you what happens when we don't have that in and we will need a new terminal please so let's crawl it again and we will do output this time so we can actually see what we get output.json let's save it to json file so now we're going to see that what products we get back when we don't stop the first rule from running on all of the links right we've got 17 back that time and if i scroll up you'll see that we we do hit all of these pages we do hit them all but the problem is is that when we're getting them what because of our first rule didn't have that deny product keyword in it it's actually already visited them so it's not going there it's not actually triggering our callback to the passing function that actually scrapes the data so if i go to the output you'll see that we only have these ones which are the same as the front page here these ones so i guess the most important things are that make sure that you tweak your rules and you do what you can to make them work best for you depending on what you're actually after and bear in mind what i showed you here that if you don't have deny in the first rule it will visit the link for you the first time and you will then not be able to do your callback on the link when you try and go there again the second thing is just to remember to rename your pass function to something other than pass because that will conflict with the crawl spiders defaults so hopefully you have enjoyed this video and you've got something out of it you know how to use the crawl spider a bit more if this was a ball a bit too much for you or you're not totally sure whether you should be using the cool spider or not definitely check out this video here for more basic scrapy stuff where i go into a lot of this a lot more detail

Info

Channel: John Watson Rooney

Views: 2,894

Rating: undefined out of 5

Keywords: web scraping, python, scrapy, web scraping tutorial, web scraping with python, python web scraping, web crawler, web crawling, scrapy rules, scrapy linkextrator, scrapy crawl spider, crawlspider, crawlspider example, crawlspider rules, crawlspider scrapy example, crawlspider scrapy

Id: o1g8prnkuiQ

Channel Id: undefined

Length: 14min 32sec (872 seconds)

Published: Sat Oct 30 2021