Scrapy: Setup and First Project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright so in this video we're going to be making use of the scrapey Python package this will allow us to scrape various content from some websites and what we do in this video is just installing this creepy package making sure that you have it on your machine and just using a very basic project to get us started specifically what we'll be going through is this project here on the main scraping tutorial website so this is in the official scraping documentation I'll leave a link to this below so if you want any more information on what we'll be covering in this video more information can be found here and I'm not going to go through the entire thing in this in this first video on scraping but I will go over scraping more detail also more focus on more practical uses of scraping but if you have any more questions feel free to either leave them in the comments or consult this webpage here and what we're going to be doing in this video is we're going to be scraping content from this website here this quotes the scrape and specifically what we're going to be doing is really just extracting all of the HTML on a few of these pages so as you navigate through there is a number of different pages that have some quotes on them and what we'll be doing is we'll just be extracting the entire HTML content of a specified URL and then saving that HTML to a file so that's kind of the majority of what we're going to be doing surely in your day-to-day scraping tasks this is probably something that's a little bit too general in scope you will most likely be focused on scraping actual content so for instance maybe authors and quotes or the quotes themselves instead of the actual entire content but in this first tutorial this is just to kind of get us started with scrapy and to get us and to get us up and running so let me minimize this and let me go to the terminal and the file explorer here and what I'm going to assume is I'm going to assume that you have both Python and pip installed pip is pythons package manager and if you have both of those things installed it's great you're ready to go if you don't I'll leave links to how you can download both of this thing in the description as well assuming you do have Python and pip all you need to do is navigate to your terminal if you're in Linux or your command problems if you're on Windows and you can type in pip or PIP 3 if you're using Python 3 install scrapie so I'm going to run this command but I've already had scraping installed so we're going to see a lot of these requirement already satisfied because it's already on my machine if you don't have it installed already on your machine it will go and install these packages queue lib you know Alex ml everything that's just listed here will be installed on your machine and then you should be ready to go so let me clear this and one way we can check if scrape is installed on our machine is we can just open up a Python shell in the terminal and we can just type in again for its scrapy and if this command goes and nothing is spit out at the end there it means that no warnings or errors were thrown it knows where this is and we have scrapey installed and ready to use on our machine so I'm going to exit out of this terminal knowing that we have that all set up going to give it a clear and what we're going to do now is we're going to start a project a scrapy project so what I'm doing here is I'm in the terminal and I'm located in the terminal of the same location as the File Explorer here on the right so what I'm going to do is I'm going to run a scrapey command that will generate automatically generate some files that we will use to use to do our scraping so that command is going to be scrapey start project and then a name so I'm going to call this project tutorial you can call it whatever you like if you follow it along and recommend just naming these things consistently so that way you kind of see where these names are used so I'm going to run this command here hit enter we'll see a message that says this was created and indeed now we see this tutorial folder here on the right and if we open up this tutorial folder we see that it's automatically generated a number of Python files so we'll go through exactly what these other files can be used for but in this tutorial all we're going to be doing is focusing on this spiders folder here which is in tutorial tutorial spiders it's a bit redundant perhaps but that's how scrapey creates the projects anyway in the spiders folder we're going to create a specific spider to scrape the content of the quotes to scrape website and that is more or less the brains of how our scraping logic is going to behave so we're going to create a Python file in this directory I'm going to call it quotes spider pi and this if we open up this file is going to be where we're going to write the majority of the logic that scrapey is going to make use of to scrape the quotes page so we're going to step through that bit by bit and let's just get to it so the first thing that I want to do is I want to import scrapey because we're obviously going to be making use of that and then I'm going to define a class which I'm going to call quotes scraper and anytime we create a spider file in this folder scrapey is essentially going to look here to see how many spiders we have defined and we can invoke specific spiders we write by making a class and all of these classes that we write are going to inherit from a spider class so spider spider is a class that scrapey provides to us and all of these these spiders that we write will inherit a lot of the properties from the base spider class if that doesn't make sense essentially scraping provides for us sort of an interface to use where we can make use of a lot of the spider or scraping content that is already provided to us from the scrapey package and we can just make use of it and use it in this class that we define so what we're going to define now is a variable called name and I'm going to call this quotes so this name is how we refer to the spider what do we actually want to run it so what we'll do after we write this spider is well get it to run and we'll do that in terminal and the way I'll refer to that spider is through whatever I call this name here so the scrapey essentially goes to the spider and looks for this thing to be defined and that's how it refers to this spider by name so the next thing that we're going to do is define a list of what we're going to call start URLs and this is just going to be a list of URLs that we're going to scrape content from and in this video we're just going to scrape content from let's say let me see let me if I go to next right so if I just take this thing here so I'm going to copy the link here from this close page put it in here and I'll paste another one here with quotes and I'm going to change that to 201 so essentially it's going to start off on this first page of quotes and then we're also going to extract the content from this second page of quotes as well and as I mentioned we're really just going to extract the entirety of the HTML of this page so for instance if I go back to this page here this first page if I just right-click anywhere on the page and say view page source what we're going to be doing is it's just extracting all of this content and storing it in an HTML file and saving that HTML file to our working directory so let's go back to our spider we have some start URLs and what we're going to do now is we're going to loop through those URLs and actually make a request so some of the stuff that I'm going to be writing is maybe not if something that you haven't seen before and if that's the case then I'll refer you to the tutorial page for scrapey I will explain a little bit of it but we'll probably get to some of the nitty-gritty about what is precisely going on this line in a later video so I'm just going to write this out and explain it a little bit so essentially what we're going to do is loop through these URLs and actually I should say this is URLs that start URLs we're going to loop through these URLs in this list and we're going to make a request using grapy and we're going to in that request send the URL that we're on and also a callback function that will actually essentially implement logic that will scrape the content of the site that we happen to be passing to the request so let me actually write that out so we're going to say yield scrape aDOT's request and we're going to pass it the URL that we're on in the loop and also a callback function which we have yet to implement in this in this class but we will call this callback function parse so essentially what we're going to pass this request is the URL that we're on in this loop and then also the callback which is the function that it's going to refer to for this URL so we have to actually write this thing and implement it in this class so let's actually do that so I'm going to actually what I should do is I should put this in a function so I'm going to put this cut that we were it so far into a function that I'm going to call start request so this is a class function here part of the quote spider cost I'm going to indent that so that's part of this function here and then what I'm going to do is I'm going to implement this callback function let's go ahead and do that so man there parse is the name of it and this is going to take self because it's a member of this quote spider class and also l so I probably I have quote scraper up here so I should change this to quotes spider and the second thing that's going to take is a response so now what we want to do in this parse function is implement the primary logic with which the content of the page that we're on in the loop will have the information extracted and saved to an HTML file so the way we're going to do that is I'm going to define a variable called page which is going to extract the number from the URL that we're on and the way in which it does that is a little bit a little bit verbose and clear so I'm just going to write it out and tell you that exactly what's going on here is all it's doing is it's just checking the URL splitting it by these characters here and then essentially extracting the last two characters of the URL that we're on so that is essentially going to extract the number of the page store that in this page variable so as we loop through the links we'll have the page variable that corresponds to one two three so on and so forth and then what we're going to do is we're going to use that page variable to define the file name which is the name of the file that will save the content to and we'll call the file quotes - % s HTML and the percent s is an argument where we can pass a string to this location here and in that case what we want to pass it is the page so this is the page number that just gets put in here so well in this case - file names quotes 1 dot HTML in quotes - dot HTML so now we want to do is we want to for each of these links store the HTML content in each of these files that we will generate so I'm going to say with open filename and we're going to write the file right bytes as f and then the way I want to do is I want to say after that rewrite response dot body so essentially what this is doing is it's referring to the response variable that was sent through this parse function here and it's just extracting the body or the HTML content of that entire response so the response is the entire page that was sent back to us from either of these links and the body is just all that HTML so we're literally just writing all of the HTML to the filename which again is defined up here and then that's pretty much all we need to do the last thing that I'll do here just for our own benefit is I'll say self dot log saved file % s and then prosthetic file name again this is just passing in the string file name here depending on whatever file name we happen to be on for this call to this parse function and this will just be something that would be spit out to the terminal where we actually run the quote spider will see this is just kind of a helpful little message to ourselves so that way we know what the spider is doing so that's pretty much all we need to do for the spider so now what I'm going to do is I'm going to save this file here and I'm going to go back to the terminal and I'm also going to open up the spiders folder here and what I want to make sure that I do is I CD over to the tutorial directory so that was the project that was created from scrapey and what I'm going to do is I'm going to say scrapey crawl and then in this case quotes this quotes again refers to the name that we gave this this quote spider so in this case I called it quotes and that's how I'm going to actually invoke this spider in this call here so I'm going to hit enter and then we'll see a bunch of output here there's a lot going on you'll see a few things like debug crawled 200 which means that it successfully was able to navigate to this particular page here there's a lot of other information that is a little bit I guess not really so useful to us for this particular tutorial but in coming videos some of this other information will be will be quite useful so we see our log message here saved file quotes one HTML we see our saved file quotes two HTML log statements that we put out here in the terminal and indeed if we go to the tutorial folder we see that now there's these two quotes HTML files that were generated from running this this scrapey spider so if we look at what these look like then we see that this is just again the HTML from the quotes one page so again that is really just the content of the view page source so as I mentioned really what you want when you're doing these things for your own purposes you probably want to scrape individual or specific content off of a page and in the coming videos we will indeed touch on that in greater detail so this was just kind of getting scrapy up and running so I hope this video was helpful to at least get a very quick working example of scrapey if you enjoy this video please let me know if you have any questions or comments please leave them below as well and thanks again for watching
Info
Channel: LucidProgramming
Views: 87,314
Rating: 4.7708592 out of 5
Keywords: python3 tutorial, python tutorial, scrapy, web scraping, python scrapy, screen scraping, web automation, scrape web pages, data mining
Id: OJ8isyws2yw
Channel Id: undefined
Length: 15min 58sec (958 seconds)
Published: Sat May 27 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.