Scaping the web with Scrapy (Python Frederick)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right are we good yeah alright so I am Micah nordland I am a full stack web developer I work for a small nonprofit in Virginia and percival learning Python got me into programming and I currently work as a full stack web developer so Matt layman asked me to to come and ask me if I wanted to do my talking and this was basically my my internal process if you've if you've seen Moana that's probably my second favorite character from that movie it was a great team so tonight we're talking about web scraping specifically the process of getting an HTML page extracting data from it in a repeatable way and storing it in the format of your choice now not everybody may be familiar with completely how the web works so I've got a short primer here as everyone's seen sorry I had a sore throat yesterday so I'm probably gonna lose my voice or at this but I think we'll be alright everyone's seen rogue one right good because this is a spoiler so you recall that just our plans were stored on the certain facility on scare off the planet and the rebels needed to transmit them out but the the the Empire was not particularly willing to let them do that sometimes that can be an analogy for web scraping you have data that you want to get out of somebody's server but they don't really want to let you do that scrapy is a Python library that lets you get that API get that data out of their HTML into your cruiser or database or whatever that may be this is a diagram from the security documentation there's lots more diagrams like this if you like diagrams on the documentation which is in a really small link down there at the bottom that you can't read but there will be it'll be on the slide later of resources basically you're responsible for writing spiders that produce requests and items and if you have some special requirements for how you want the data to be stored you might write your own item pipe pipeline and I'll go into what all those are a little bit later this is just a big overview of how scrapey works right so why should you use scrapey there are a number of Python libraries for accessing HTTP the very excellent request library if you're talking to HTTP you're not using requests you probably should be it makes everything easier but scrapy does a lot more work for you than requests will it will building a comparable scraper with requests versus scrapey will actually a mental comparison right here I wrote one ways request and beautifulsoup to do HTML parsing and I came out to about 249 lines of code but a scrapey crawler to do comparable tasks came out to 71 lines and if you're quick with math you can tell what that difference is but it's big it's big is a lot less work to a lot less thinking that you have to do to write a scrapey crawler so what is creepy it has about four concepts that I'm gonna go over there are projects there are spiders they're items they're item type lines it has a command line tool that you can use to create projects and it'll sort of set up the structure for you makes starting a scrappy project a lot simpler you can have it start your project it can you have it generate a spider scaffold and then you can run the scaffold run the spider with by name the the general file structure is going to look like this the scraping dot config is for managing deployments with the scrapey daemon which is optional if you're gonna have a scraper running a lot for it's going to be deployed for a long time you're probably going to look into either using scrapey daemon or you're going to look at using scrape hub which is a hosted service for running scrapey spiders all of your items which I'm going to talk about a little bit are gonna go in the various files spiders and so on setting stop UI is going to be where you configure your spiders it's where you're going to configure the order that middleware runs in you're gonna configure your item pipelines in there you're going to configure a bunch of other different things I'm not going to talk about because this is like a really high-level overview of what scrape you can do and how it can be useful to you so spiders spiders are the most important thing if you don't run anything else you will be writing a spider and basically what it will do is you'll give it some URLs to start at you will tell it where it is allowed to go and then in the parse method down here you will be telling it what to do with the content that it brings back yeah so a lot of your code is going to be going into this parse method you of course may want to break out into other methods as you see fit but you're going to be pulling things out or extracting things from the HTML and you can use two kinds of selectors or one would be CSS selectors which work just like you'd expect them to or you can use XPath as well which can be useful for more advanced scenarios if you're not familiar with XPath it's a syntax for pulling stuff out of XML documents and because XML and HTML are both based on sgml all the ml's except ml which is different it can be used for HTML as well and it can be used for some scenarios where perhaps the HTML does not have clean classes or IDs that are useful for CSS the the XPath can be used to do more like I want this pattern of elements if they're and so on and you can use either interchangeably you don't have to choose one or the other you can just use them all willy-nilly so oh I forgot to do one thing on slide up here so one thing you notice down here is I'm doing a yield of a dictionary there are different ways you can design your parse method you can have it return a request you can have it return a list of requests and items or you can make it like a generator and I find that to be most useful most I just like doing it that way that's the way the scrapey documentation QuickStart has you do it but it's your choice whichever makes sense for you but you're going to be returning either or yielding a collection or iterable more exactly of items and requests and the requests are if you are trying to follow links in the HTML say there's like page in a page a nation going on and you need to get the next page a results or whatever you're gonna use these things called link extractors that'll help you pull out URLs based on a pattern and you can you can use some callbacks on those as well if you use the if you subclass from the built-in crawl spider which is adds a little bit you can just define some rules and have them automatically apply them and go to them running a spider so you've made your spider you have two options if you've just made the spider without creating a project you're going to run scrapey run spider and then the name of your file that holds the spider if you're in a spider if you're in a starting if you're scraping project you'll use the the crawl you're gonna use the name that you gave the spider when you created it I find it's helpful to give the name to have the name and the name of the file be the same thing because that makes thinking easier and we all need that so the second thing you'll be creating our items if you have a small spider it may not make sense to make an item but if your your spider is going to be larger or used in a larger project or you're gonna have multiple kinds of items you want to return you're going to want to define those and basically it's just a class that descends from scraping item and it's a lot like a lot like de Django's models for their ORM and but you know it's just filled and that's whatever type you want it to be and then whenever you create them it'll make sure that you can only set the right names and now make sure your output formats are more consistent as well it's all filled but the way this works is you can set those fields but you can't set any other fields so if you try to set for instance of this case if I tried to set I don't know headers if I just try to set head that wouldn't work that caused an error and I'd know about it immediately so that's where the value of the items come now item pipe lines these are the things that process your scraped items so you've you've gone through the HTML you've pulled out the data you need you filled your item in now you're sending now scrape people send them to an item pipeline now these are configurable you can set the order the precedence when you want them to run so they're useful for getting additional resources like there's files that need to be downloaded in addition to the content you can use the files pipeline that's built in there's also an images Python pipeline that I'll use in the demo later that will have that has a bunch of options for handling images you download them and storing them and making sure they stay together with the content that they're supposed to you can also use these for validation you can decide not to store an item in a in a pipeline if you decide that for some reason it is invalid you can also at this point store them into some sort of format or pass them on to say I know perhaps you have a scraper that magically turns HTML site into an API on the other end which would be really cool actually you could then push these out to that service and do that and this here is wrong and it was wrong the last time we gave this talk and he's still wrong because I forgot to fix that should write that down somewhere but moving on so one of the questions that I got a lot last time I gave this talk is how can you run JavaScript because in today's world there is a crap ton of JavaScript and you may be running up against things like single page apps or progressive web apps that they are all JavaScript there's no content originally it's just empty and how are you gonna get that content or you might be running up against CloudFlare because let's face it not everybody has enough bandwidth so there's there's to two specific things if you're just trying to get past CloudFlare there is a Python package that can be used with scrapey in fact if you google it it's the first answer on Stack Overflow that's how I found it it's called cloud fair scrape and you can integrate that with your request and it'll just it'll do the necessary delays or whatever so you can get your content easily if you have to use script single page applications a scraping hub again the organization that kind of develops scrapie has this this splash setup that can run them for you it's running in a headless browser I think it's Chrome but it's all contained in a docker container and they have set they have instructions for hooking it up to scrapey so that it's it's really easy to use and you can specify how long they need to wait for the whole page to load and do its thing well CloudFlare is a it's a it's a kind of CDN that you point your DNS to CloudFlare and then you have CloudFlare point to your actual web server and it will go and get your content and cache it on all of its endpoint nodes around the world so if your server is hosted here say in the Northern Virginia region or I guess that is I guess that would be the North American region of AWS and for some reason you're not using Amazon CDN solution because that probably cost money in cloud for free you could then have that content available in Europe Australia South America and your constants going to be closed to whoever's trying to get to it it also helps for distributed denial of service attacks because cloud fire obviously has huge pipes and they can handle that influx and there's also paid plans to kind of mitigate that as well but it's a real easy way to provide some extra uptime to your service some extra oomph with very little cost but one of the things that we will run into when scraping things is CloudFlare tries to prevent bad things from happening to people's websites and a poorly configured scraper could do those bad things like now service so sometimes it will throw up a page and we'll insert a delay with some JavaScript instead of serving the cat serving the page that was requested and it'll make it wait and then it will do a redirect and the the CloudFlare client first grade module will correctly handle that so that you appear as a good citizen to CloudFlare which is on a lot of websites I even I have websites on so it's he had something to definitely pay attention to deploying spiders I mentioned before you can use your own instance of scree PD or scrapie daemon to host that I'm not going to talk about that tonight I just want to let you know that those options exist if you need them and moving up right so the reason I learned this is I have a friend who has other friends and those friends have problems and my friend finds people to fix those problems and in this case he came to me to fix this problem there's a guy over in Italy who has a website and has been filling with content for years in a it's a it's a static site let's see if I can pull it up here nice little site no content management guys not really a tech-savvy guys so is getting a bit much to to work with and he's got a lot of content in here a lot a lot of content but for all it is it's actually the HTML itself it's in actually pretty good shape so I knew it would be a good candidate for web scraping and I had come across scrapie at work working on a related thing and I knew this would be a great opportunity so extracting the the sermons was a bit difficult because there was while the structure was clean it didn't have a lot of classes or IDs that made pulling things out easy so I use XPath like I mentioned before it's good for those situations where you have a repeating kind of structure but there's no names that you could really easily pull stuff out of an XPath lets you cope with that really easily the there were a lot of media files along with the the sermons that he has on there so I used the file pipeline to handle downloading those the file pipeline is basically you fill out a property on your item called file URLs fire underscore URLs and then when the item reaches the item pipeline the files pipeline it he'll then go and download those URLs stick them in a folder of your choice and then it'll fill out in the in another field it'll fill out which files are downloaded and where they can be found on the file system for this project I chose to output the scraped data into a JSON file and it's really easy to do that I will show you how when we get to the demo or maybe it's probably when the devil and basically it's it's just straight it's really straightforward to output JSON to CSV or whatever format may really need it in all right now to the demo so I'm just gonna show you I've already set up a virtual environment for this if you're not using virtual environments they're a really great way to keep your project separate keep your dependencies separate between projects if you need a version of one library for one thing in version one library for another thing it keeps your project separate and also helps avoid polluting your operating system libraries as well with the stuff you're working on it's all around a really good idea so I am for all my stuff that's I don't do anything I'm not actually related I don't work on the the virtual environment project but it is a really great really great practice to use alright so I'm going to install scrappy it's you just install it with PIP nothing fancy real easy and already have it's good all right so and now I'm going to I'm going to start my then to start my project start it in this folder all right so it's it's created our project here and so now we need to create a spider the the macbook went to sleep i don't know if that's going to affect the recording or not so we're gonna run scrappy gen spider let's call this call this quotes spider cuz we'll just call it quotes all right oh I forgot to give it your name now quotes to scrape is a is a website that is been built for for testing scrapers and crawlers what have you to allow you to experiment and play with and basically what I'm going to do is I'm just going to implement a real easy spider that will scrape these quotes out of here and put them into a JSON file so let's see let's get into our spider here so all right so as you can see it's configured a lot of stuff for us already we have our our starting URL already there we have the allowed domains and we have the name that we're going to call this with so what we're gonna do is we're going to start in our parsed method the response that we're gonna get is going to be this this HTML page right here and so we are going to need to get this this quote body here so one of the things that you can do actually switch to one of the things that you can do is you can as you see we have the the text of the quote here and it has a nice class here that's gonna make it really easy to pull this out it's got the the text so what we're gonna do in our parse method is we're going to use the response my response parameter we're going to use a CSS selector and we're going to do dot quotes we're sorry quote and that'll give us semicolon used to writing C sharp that will give us a list of quotes and so on and just call this quotes and then that's gonna be a list of all the elements that are quotes which is going to be each one of these these boxes here it does it does with the the the response the CSS selector or XPath as you will is going to pull out the elements from the from the parsed HTML all right so we have our list of quotes now we're going to pull out went to loop through we're going to loop through our list of elements here and for each quote we are going to pull out the caps lock is on the title and actually we need to grab the text pseudo-element and then we're just going to extract actually extract first because there's only going to be one element and we just want that element instead of a list of elements so that's gonna be our it's gonna be our title and then we're going to yield a dictionary we call it I'm just call this title it'll be all right so we haven't forgotten anything we should be able to now deus creepy crawl quotes and have it run alright looks like it finished but you notice it didn't output any didn't output any data so let's have it output JSON and we're going to put it into actually we just need to put quotes Jason well run it again all right now we should have a file called quotes JSON oh wow that's interesting oh right that is right well let's go back to our spider and correct that well you'd think I knew what I was doing I practiced this but we all have our bad days let's call that text instead because this is indeed the text or the quote quote some do not normally have titles so let's run that again and we finish and you can see down here we have the thing one thing to note is subsequent runs don't erase anything that's already in the file so if you want if you want your data to be just the stuff you most recently scraped you should should remove that file before you run your crawler your spider and now you can see we have all the text of the other quotes on that page thing is people like to get attribution for the close as I made they also like to know when the hero quote who made the quote because you know if it's George from down the street the quote doesn't mean so much as if it's from you know how about Einstein so we should collect these authors as well so let's see this looks like this is class author so let's modify our spider to grab the author as well so in our list of quotes we're just gonna dot Arthur texts as well all right and now let's add this to our item your author all right let's run there that's my quotes and run this again alright looks like that was successful we can check in our quotes and there we see we have our author as well so we know who made the quote and what their quote was but we might want to give the people that we are handing these this quote data over to they might want to know for instance they may never have heard about Oh Jane Austen or something so there's a little about link let's let's save the the link as well let's inspect that element alright looks like there's no class for this one but and we're also going to need the the href for that link so let's head back to our our spider and see if we can't grab that link there looks like they are under quote looking tags under there okay so we don't want to get the tags because those are also the DEA element looks like we need to get the the span and then the a that looks like that should grab us our our element the way we want it so let's see here let's call it the about link and from the quote we will do as CSS we're going to do a ban and then directly inside that span we neither do be an A and we want to get the H ref attribute to the first item so do re and so that should get us the the about link for the each of the authors and there we are look in the quilts file and oh yes we can see that we we have gotten the about things but you notice that they are relative links and what if the Dana got a hold somebody down the line got the data but didn't it wasn't told the URL where the day to come from they wouldn't be able to use these links so let's fix that using the using the URL Lib URL join function let's see here so we'll just do about link equals here first we have to we should import it first we're gonna import actually from or Lib horse note I am using Python 3 for this demo so the same function exists in Python 2 it's just in a different place and I don't remember exactly where that place is so just so you know but from your old up parse we're going to import URL join I don't know I don't remember I know that it is all the stuff I've done with it has been on Python 3 I know that works with both of them so if you're in a Python 2.7 environment it will work for you there as well so your your options are not limited in that respect so we're going to do a URL join and we are going to take response URL which I believe is the URL of the which would be the URL the the page I was visited and we're going to join it to our about link all right I'll do it let's see here yes it looks like it let's go back and look at our quotes here and yes you can see now that that the the URL is now absolute and anyone who gets there hold on can find out where we got these um there was one more thing I was going to do ah yes items so I already have a item that I already built for this and to save time I am just going to bring it in and show it to you show you really quick how it's used so your items are going to oh that's right there was another section of that how much time do we have okay are you guys interested in learning learning how to use the the pipelines to download images from like a ecommerce kind of style site okay so we have the the quote item that can be used with the quotes of so we'd had to add the the author link to that we'd we just do or was it wasn't it was about link yeah yeah you can use these and they they make sure you have a structure that's like if you have multiple spiders and they're all returning the same items you can ensure that they're all uniform using items yeah pretty much it's just anyways let's let's Jen a new spider this will be books statues and so this is the the quotes page it's pretty simple but if we go to books to scrape you can see that it is a it's kind of a ecommerce kind of style site it has book images as book titles and as prices lots of great info there that you might want to store for some reason maybe write something that attracts surprise to your favorite book and buys it immediately once it drops below their certain impression price threshold lots of great things you could do with this so let's write a scraper for it we're gonna want to we don't want to get a list of all the books we're gonna want to get their full title and for kicks and giggles let's get their image as well because we might write a nice interface that allows us to visually decide which books we want to put on our watch list alright so let's go look at our news spider that we just created it's going to be in books all right set up pretty much the same way the thing new thing we're gonna do here is we're going to import our book item so we ought to recreate that so we can use that we just do it a relative import from item ah yes it was all right now we have our book item so let's take a look at the the mark-up here what does this look like all right so all right that looks like a book it's like that's in a class called product pod that should be fairly useful for pulling this stuff out so first let's get a list of books from response CSS pod all right that's gonna give us our list of those items those document nodes those nodes in the HTML and we're going to want to pull out the title now if you notice the the title is here on the link but if the title is longer like this one it's not all the way there and if we if we look at it we see that the text of this link is the shortened form but the the title attribute for that link actually contains the whole thing so we're gonna want to get that and not the actual text of the link that was a pick up I ran into when I was first running through this so for book title is going to equal book CSS let's see where the where the books end up in another cool thing you can do with Firefox is if you right-click in the dev tools if you right-click on the element in there and you go into copy you can actually pull out the CSS selector or the CSS passes maybe or even the XPath so if you're doing this kind of thing with scrapie and you're using Firefox Firefox really makes it really super easy to figure out how to grab a particular element it does suffer sometimes from well a lot of times from being too specific so you may need to edit a bit but I'm going to use the XPath here version here just so you can see it used all right so as you can see it pulled in a fair bit too much so the article element is the is the the product pod thing up there so we're going to just because this is relative to the to the article actually we're going to just cut that off and let's see we need to just real quick here I have to remember how we get the attributes out in XPath because I've forgotten right yeah hot spots so running there comes as you can see the the documentation really is is really quite good and for all the questions you might you might have it's gonna be a really useful resource for you okay looks like we need at H ref okay a slash at H ref and point - ah that's a good point yes so instead of href we're going to get title and we will extract the first one and then we will yield a book item and we're going to set title goals title all right let's creepy friends fighter books and let's have output to book stat JSON this works file not found books oh right it needs to be it's great crawl run spider would be if you didn't have a project and you just wrote a spider in its own file somewhere but inside of a project you if we can use crawl and just refer to the spider by name that's right items is one level up from spiders all right it has scraped so let's go take a look at step JSON and we're not again I get to save time here I'm just going to I'm just going to reference and just use CSS I surprised should've just done in the first place all right so I'm just going to replace that CSS version for you run this again okay books all right now we're grabbing the tails okay now we can start to do the now we can start doing the the interesting things let's go back into our scraper so the title was not the only thing we wanted to grab we wanted to grab the image and the price so let's grab the image first I'm sorry that the price first Oh looks like that's in a really easy place to get to it's got two classes that are right there so let's head back to our thing here and we'll do price equals books yes s and that's was in I was in product price and then price color okay and that's just the just a text right there I'll extract the first your first one and we will put it in here with all right now for you in the skin okay and well a lot we have the prices now awesome now it's time for the tricky bit which is getting the images and for this one we are going to use the image pipeline but first I need to make sure I have a pillow which is required it's a Python image library it's required because the image pipeline can do some thumbnailing some resizing and give you multiple different sizes we're not going to use those tonight because that's a little bit more complex then I want to get into but let's just make sure I have it I do awesome alright let's go into the let's go into the settings file now there's two things in here that I'm going to need to set one is the good idea we do that and is unfortunate ladies and gentlemen my editor has frozen ah there goes alright so I'm going to see right let's just grab the name for the I'm going to use the image pipelines our pipeline pipelines so grab that and we will so this creepy dot the pipeline's that images dot which is pipe and we're gonna make that priority one the lower the predatory is the more recent the higher precedent the the pipeline will have to keep that in mind the the images pipeline also has a second setting that you will need to set here images store you'll need to set that to a directory of your choice that directory will be relative to where you are to the working directory that you run the scraper from so keep that in mind right I did name that right almost and I'm just going to put it in book images so that set now so we look at our items file I've already set up the book item to handle the the image pipeline basically what we'll do is we will fill in the image URLs into this field right here and then the images pipeline will download the images in there it will save them in a file that is named by the sha-1 hash of that file so if it was if the sha-1 hash was you know 25 of a particular JPEG file it would get saved in books book images slash 25 dot jpg so and it will then fill in this this images filled with the with the file location and the other attributes of that that image okay so go back to the back to the spider so now we need to let's see let's find where the image is in the HTML okay it's got its own class so they'll be really easy to pull out we're gonna be looking for thumbnail and the image URL is in the source attribute so we will get that as well we get that's the only thing we'll get all right so image URL equals [Music] s2 and we're gonna get the thumbnail class I'm going to get the source attribute and we are going to write the first one of those as well if there were say a group of elements that you wanted just to select all of them in a single go and you wanted to extract something you could also use extract all which would give you a a collection of those items I just got my I just in the way I usually do it is I just grab this whatever the the highest level element that I want and then I for each of those I pull them out you may decide to do things differently like you might you might pull each of them out all in one and build a collection of items and just return the collection I find that the the generator style is really easy to think about in this way you may have your own preferences and those are just fine scrapy will work with either way so this is the image URL unfortunately it will also suffer from the same problem that we had before would be with the about author link so we will again him from URL Lib dot parse will import URL join and we'll just do actually image URLs equals yeah alright now I've done everything correctly yes if you had multiple images per item you wanted to do you do that as well in the the case of the the crawler that I wrote for my friend each sermon had like three or four different media types that went with it so I used the file pipelines and there were multiple URLs in that collection in this case there's only one cover image per book so we will just store those as a single item that collection that list yes that's where it's going to look that's where it's going to look for the URLs to download if you don't fill it out it won't know which images you want so that's that's convention over configuration feel like that I suppose you could write your own image pipelines that do it differently all right now this is gonna take a little bit longer because it has to download the the images for this as well looks like that was successful it didn't complain and yes it looks like we have our images here so let's look at JSON look at this so though the ones up here are from before the one that we did before and down here we have our title we have our image URLs and if you see over here you eventually get over to it it has filled out the path this is a relative path to our image a store that we said before so let's see here it has created the directory for us and these are full-size images so you see we've got the full thing of images here ah it is so you can see we have our book titles have all been downloaded with our items and we can we can do with them as we will now so yeah that's that's a pretty good overview of what kinds of things you can do with scrapey once you have the data really the sky's the limit as to what you do with it I recommend that you only that you do not scrape websites that their owners requested you not to because that's just good netiquette just good thing to do Rick anyways there are people who use this for bad purposes I recommend not doing that but on the other hand you are like another web browser so it's not wrong for you to do this either just always be sensitive to other people's content which I'm sure you all all know about and let's see extract information from right yeah at some point I would think a crawler wants to take links that bind and descend into them so the mechanisms for doing that is that you writing multiple spiders and off between or or you expected to write a spider with more logic or is that the structure just you or discreetly have guidance pearly if you need to do something more complicated yes it does it has you can you can pull the links out yourself and return a request so in the generated example I might get the list of books run through all the generators and then do some link extracting on my own or what I could do is I could subclass from crawl spider if you may have noticed I'm just pull this up here real fast the spiders I was doing tonight we're all descending from scraping spider scrapie has some other base classes I can use one of them is crawl spider and you can set a class property called rules and then whenever it gets a whenever it goes through a request and loads the page it'll run the these things these link extractors which look for URLs that match the the patterns or alternatively don't match a certain pattern that you set and then it will automatically load up requests for those to be processed next the other option is you may there may not be actual links on the page but based on information you scroll the page you may be able to infer a link that you want to go scroll so you can build a URL and then return a request instead of an item or a list of requests from your spider or you can return a list of requests and items any kind of iterable containing items and request is legal to return from the from the parsed method and any requests that are in there will be passed to the to the engine that goes and downloads and the responses will come back into your parse a method unless you set a different callback which you'll go to that any other questions all right some resources for you if you want to explore further description org is the website for scrapie the docs are doctor scraping org that'll send you to a nice tree the docs page and it'll have lots more info that's where I got a lot of the info for this for this talk if you like those that original diagram in the beginning you will find that diagram there if you want to look at these slides again you can see them on my website at Lucy dot reset react up me slash slide sets it's creepy I may even throw up the the demo bit there if you want to download that use that as a starting point probably put it under MIT license or something thanks you guys for in a great audience I enjoyed presenting yeah [Applause]
Info
Channel: Matt Layman
Views: 988
Rating: 5 out of 5
Keywords: python, python-frederick, scrapy, web scraping
Id: tdA1cl6LiCw
Channel Id: undefined
Length: 66min 8sec (3968 seconds)
Published: Thu Nov 09 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.