Extract Links | how to scrape website urls | Python + Scrapy Link Extractors

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everybody this is a look at link extractors in scrapy they are a great way of crawling a site and you don't have to use the crawler method you can actually use link extractors within your spider so that's what we're going to look at today so without further ado let's begin some code okay so let's begin let's consider what is a link extractor well link extractor is an object that extracts links from responses http responses from your spider so you may be more familiar we may have seen link extractors being used with crawlers but you can equally do it within your spider so a spider is a bit more precise and if you want to extract all the links from a subdomain then you can do it with a scrapey spider and the link extractor so let's um i've got vs code open here somewhere so so far so good okay so what we need to do is we need to import something called scrapey dot sorry link extractor from scrapey.link extractors and this is what we'll be using i'm just copying and pasting this from my notes so this is the key part and we'll be using this in our main code from scrapy.link extractors import link extractor now i'm just going to put up here a bit of a summary about link extractors and some sample code okay so you can see here pass see within your pass method you can use the link extractor and as they say in the scrapy documentation you can also use link extractors in regular spiders and this is the good thing because if most of you are familiar with writing spiders then it's familiar territory so what it does is it says from scrappy link extractors import link extractor and then you can adjust all of these parameters you can allow you can deny using regular expressions domains x paths css text and so on and to actually use it it's these lines here so um within your pass method you say for linking the extractor itself so within your init method or the way i'm going to do is i'm going to use an init method and then we will extract the links from the response and then we will yield a request to that url from uh response and it will keep uh it'll keep iterating through them so let's begin the code it's uh probably easier for me to do it than it is to explain it let's begin and we will start with i'm just going to copy and paste in my logo just under utf there we go and let's just save that within i've created a workspace so if you're familiar with vs code workspaces are brilliant because they just keep everything together and when you reopen it everything's there so uh documents scrapers ebay 2 and this is going to be called ebay spike spider 2 dot py and i'm sure you've now guessed what site we're going to be scraping so um first thing we'll do is import scrapey and then i'll just do the rum scraping or spider rum scraping boards request um you don't have to import these but if you do it makes your code a bit [Music] less verbose should we say and then we'll do crawler process so we'll say from cool from scrapey.crawler you see our vs code um also suggests things import so remember the capitals crawler process and then we can say this is the import the extractor and we will say here from scrapey dot link x obvious codes done as a favor again import link with a capital and it's got it again so you see our vs code actually does save you some typing and it saves you especially where you've got case sensitive uh methods well well all the math is case-sensitive so it's important that you get the case sensitivity correct and let me just import os as well because i'll do my usual thing where i check if the output is there and if it's not if it is there i'll remove it the output file rather so import os and let's just save that no error is good and let's do class ebay spider and now we can just use spider with capitals in here because we've already because we've already imported it here you see so there we go and what's next name call it something meaningful this is this would be more important if you were running it from the scrapey craw if you created a scrapey project and you were running it with scrapey crawl hyphen o and then the output file name start and score urls as usual and we will say uh we need let's say https and i'm going to say w ebay dot co dot uk and i'm not going to do the i'm not going to extract all of the links from ebay dot co dot uk because that would be too many and i'm going to do deals so um if you're familiar with ebay then there's a daily page which says deals let's just look at it why didn't i just click follow link could have been easier but never mind ebay dot co dot uk forward slash deals deals even not eels i really don't want i do not want to uh sorry about a typo and here we go you can see deals ebay.com uk called such deals and we're gonna we just want to extract all of these links um just as an example why would you want to extract all of the links um price comparison price tracking um you might be after bargains um might be a company that's paying to advertise on ebay and you want to see if your products are being listed on the uh daily deals anyway that's that's the page you're going to be scraping so next back to the code and let's do the usual so we'll say try and this is where we imported the os so we'll say os dot remove um i'm just going to do the ebay file as a text file because we don't need a csv ebay2.txt and if we get an exception so if it's not if it doesn't exist we have to throw up an error and then we will just do pass good okay so far so good and uh what's next well let's let's carry on with custom settings yeah because i've not set this up because this is kind of a very minimalist approach and i'm not using a scrapey project to begin all of this what i'm actually going to do is type all of this so concurrent requests so if you don't want to be using scrapey's settings.py file then you can just use custom settings and that means you don't need it and auto throttle enabled true i thought that would do for starters anyway you get the idea um basically if you want to add any custom settings and you don't want to do it with the recommended the recommended scrapey start project way then just add custom settings within your singular spider um so yeah this is the different bit so def in it so if you're familiar with oop object or into programming then you will want to use definite and that's when you that's when the class is instantiated or when the class is uh first brought into life so to speak then we want to do self.link [Music] underscore extract so there's no auto complete there equals link extract there we go again it's found it for us and what i'm gonna do is i'm just gonna you can use allow or deny and again it's also completing it for us and then i'm going to say um i'm just going to copy and paste this for my notes because it's you don't want to watch me type this it'll be boring um next death pass self response as usual and so we've got an implicit start request start request method so we'll leave that as is and then this is the bit that you've just seen in the scrapey documentation and we're going to say for link in self dot link x extractor dot extract underscore links response we want to get the links from the response and for each one of those links we're going to say with and this is where we use the uh the ebay.txt file again and we say ebay2.txt and then we might as well say a plus because you want to be appending each time as f f dot right [Music] and uh let's use an f string because we're gonna it's gonna be a string uh let's do a new line and then we will say string of the link because the link object will not be a string not as we [Music] initially have it so def pass and then we will need to yield so we will say um we need to get the indention indention indentation correct so we need it in line with outside of the right the width with ebay 2 right but it still needs to be inside of the four loops so it needs to be here and then we say response follow ah and this is where we want it to go url equals link callback equals self dot pass good so that's going to uh that's going to recursively go through the links good and then we just need to do the usual if done name process dot crawl this is where you could have multiple spiders if you wanted to run multiple spiders at all you could have several classes within this one spider and you could run each of them i referred to an example of that in my last video in the pop-up screenshot there we go so and then just process dot start and let's try it okay let's uh the other thing i've had to do is the usual trick where you i've had to tell yes code to run the code from the path of the script and it helps if you put open when you want to write to a file so sorry about that and let's run it now so here we go and i'm expecting it to get many many was well it's not why is that i've forgotten to add a live domains which is not good allowed domains equals ebay dot co dot uk let's test again error online 36 [Music] for link in self.link underscore extractor dot extract underscore links response with open ebay txt for append as file file.right and then we're writing the string of the link and then we're yielding so let's just investigate that and i've just noticed there's a missing underscore here as well so let's begin let's begin uh testing and see what we get we should get a lot of urls now and we're going to write each one of them to our text file and we can see the dupe filter debug is running and it's getting lots of results 200 200 goods so i'm going to cancel this shortly and i'm just going to show a couple of adjustments that you can make you can modify some of the um parameters that you pass just kill terminal there uh by view text file where so it's ignored duplicates however it has got many many links from the same page as you can see um there we go so you could pass this afterwards obviously or you can pass it as you're running it which is what i'm going to talk about next so from the scraper documentation and i'm just going to add unique whether duplicate filtering should be applied to extracted links so i'm just going to do this so um i think it's you unique equals true or unique or equals false so um there we go if you hover over in vs code this is so good and yeah unique equals true see that i was going to point to it but yeah unique what caused true [Music] unique equals true we've already used a lot so let's just add comma unique equals true and if we run it again our code should it should delete our first text file and let's just run it from here this is again this is using the code runner extension so if you use if you just installed vs code and you don't have it it's an extension and it just allows you to just go up and click play it saves you having to bring up terminal and um google about the keyboard as you can see it's getting pgn that's page number there we go and it's not getting any duplicates hopefully because we've set unique equals to true and by default we're getting the hrefs um but you can also tell it to extract um other different attributes as well um yeah read the documentation and experiment and um one last thing i want to let's just uh kill that there and let's look at the output and ebay txt and as i say it's it looks like you've got lots of duplicates but in actual fact you haven't because there's links from multiple links from the same page with different query string parameters on the end so there we go you can pass those you can do what you want with them uh you see we've got on the end we've got we've just got across we've got text uk only so you could actually filter by some of this text um uk only yeah so you could just find by just if we do control f uh okay uk only and it's found 84 results [Music] there we go so what was going to wrap it up with i was going to yeah one last thing i don't know if any of you have seen this site but it's program creek and this examples of whatever um third example you're looking for whatever methods you you're trying to investigate or learn so this one there are 23 code examples for scrapey link extractors which is what i've just demoed so um there they use the init method and all the x restrict x paths all of those are set to nothing um and the link extract is set to the keyword args kw args and what we've got here yeah they're passing kwrx to the link extractor there we've got if you uh use that out if you don't use the um self.init method then you can use rule uh ruler rules with the um lick extractor there you see set follow equals to true you could set follow equals to force if you didn't want to keep recursively searching for that more urls and there we go they're allowing anything with browse summary and what if if you allow something then everything that's not being allowed gets denied um yeah similar again there we go allow equals alive deny equals deny [Music] oh yeah one other thing was um i'm gonna put this up in a minute but if you look at the scrapy documentation um it actually will let you go and visit the page which um let's just go up here and i'll show you [Music] somewhere it's source extract links this is what we want so if we look at the source for extract links it's all on the scrapy documentation source code for scrapey link extractors lxml html um one thing with my which is just worth kind of considering is that within the link extractors um code they they've referred to w3lib.url now if you investigate that you will find that it is a very useful package and this is not the scrapy docs anymore this is still read the docs i o but um yeah you can see all of the um all of the methods within w3lib and the comments kind of explain what they each do so it also uses if you're familiar with url lib in general then you'll know about um the different the names of the different parts of the url ui so netlog download url encoding etf eight so there you go scheme net lock path query if you're not familiar with those look up url lib and each one of those describes a different section of the url url query cleaner so how to replace parameters [Music] so under the bonnet this is what scrapey is actually making use of um file uri to path uri convert file url to local file system path according to wikipedia uri scheme no see [Music] convert local file system path to legal file uri lots and lots of things to look at here netlog so net lock again um i think i showed that in a bit in a very old video but if you look up um net lock scheme uri url lib [Music] that should find what i'm on about and um url lib.pass so it passes urls into components and um yeah a long time ago i did a video on this but essentially scheme is the uri scheme specifier then you've got the net lock the network location part then you've got the path then you've got parameters you've got query fragment username password hostname so here for instance scheme scheme is empty because there's no https net lock is the www so that's the wwe the domain name followed by the port the path is the bit that comes after and then the parameters and the query so on so um yeah have a read of pass urls into components uil lib.pass have a read of w3 lib because this is what scrapey's link extractor makes use of module contains general purpose url functions not found in the standard library and here you can see it makes use of os space 64 which is encoding regular expressions strings collections url dot pass so you see all links together urlib.pass features in w3lib w3lib features in link extractors which all feature within scrapy so yeah i hope um i'll put this code on um github and put the links for you there and have an experiment it's the best way to learn so until the next time thanks for watching and subscribing all that thanks bye [Music] you
Info
Channel: Python 360
Views: 493
Rating: 5 out of 5
Keywords:
Id: kaBT35Sabss
Channel Id: undefined
Length: 28min 16sec (1696 seconds)
Published: Tue Apr 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.