Intro To Web Crawlers & Scraping With Scrapy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] this video is sponsored by kite which is a Python extension for vs code and many other modern editors it offers intelligent snippets and one-click documentation using a program called co-pilot that I'll actually be using in this project to download kite on any platform click on the link in the description below what's going on guys in this video we're gonna look at scrapey which is a Python framework for crawling websites and extracting data and there's a bunch of reasons why you might want to use something like this for data analysis for data mining information processing a lot of services and websites give you data api's to work with but not all of them do so there might be a website where you you want some data but there's no API available so you can scrape the data yourself now you have to keep in mind that there's a lot of ethics and even legality that goes into web scraping so if you're using it in a professional sense for a product or your company or something you really want to look at that if you're scraping data from another website you want to look at their Terms and Conditions and you know you want to be ethical when you're dealing with web scraping and that's really all I'm going to say on the subject I'm not going to tell you what to do and what not to do so what we're going to do is scrape the this right here this blog it's called the scraping blog scraping hub blog and I've seen this in a bunch of tutorial so I figured it's fine for us to use in this video and it's just a regular blog and you can see that it has a bunch of posts and then it has other pages of posts so what I'd like to do is scrape every single post and get the title so I want the title the date and the author and of course you could get other stuff as well but those are the three things that I really want to target and by the end of this I want to be able to run a command to scrape the website get all that stuff and put it into a large JSON file and of course you could do whatever you like with it you could save it to a database you could create a CSV you can run it through what are called pipelines and mess with the data after you extract it there's a lot you can do but this is kind of an inch doctoral level course so we're gonna create a script like I said to extract to crawl and extract the data from the blog and we're also going to be working in the shell because scrapy has a shell where we can directly run selectors and run methods on those selectors to get data all right so let's go ahead and jump into vs code and I just I also have my terminal opened down here I'm using my integrated terminal and as with any Python project we're gonna create a virtual environment and I'm just going to use the virtual env for that so if we do Python 3 dash M V env and create a folder called V env so that will be our virtual environment and you can see inside the bin folder there's an activate script so we just want to call that with source V and V slash bin slash activate okay so they'll activate our virtual environment and if you're envious code you want to just command shift P or ctrl shift P search for Python select interpreter and just select your virtual environment mine is called V and V okay so now we should be all set so I'm going to install scrapey and we want to use pit for that or pipi and V if you're using that so let's install scrapy okay so once crepe is installed we can go ahead and create a project by saying scrapey start project and then i'm just going to call this we'll call it post scrape our post crawl whatever you want to call it and then we're just going to CD into post grade and then let's take a look at the folder up here that was created now there's another folder called post scrape inside of it along with this scrapy dot CFG file and inside this folder we have a file for middlewares for pipelines we have something called item pipelines where when you scrape when you crawl and you extract the data you can do certain things with it you can cleanse it you can run validation on it different things we're not going to get into that that's that's more advanced and what I want to do here all I want to do is create a spider so inside the spiders folder we're going to create a new file called posts underscore spider dot pi okay so this is going to be our main file that we work with and the first thing we're going to do is import scrapie so that we can use it and then we need to create a spider class so let's say class we'll call it posts spider and this needs to extend scrapey dot spider now I want to take a look at this this spider class and I'm going to use kite to do that so the kite extension gives us this docks link which will actually open up copilot and it'll show us everything that has to do with this class okay so you can see all the different members I just want to show you a couple things so name is something we're going to need it's a string to identify the spider and it has to be unique it has to be a string so let's go ahead and add a name property here we'll call this posts and then if we go back to kite copilot we also want this start urls which is a list or an array of URLs that we want to crawl from so let's say starts underscore URLs set that to a list and we're gonna do HTTP blog dot scraping hub comm and we can do slash let's do /page /parent eyre page and put it into a separate HTML file which isn't very useful on its own aside from maybe offline web viewing but it'll give you an idea of how this works now we also want a parse method so let's see right here so parse which takes in a response this is the default callback used by scrapey to process download responses when their requests don't specify a callback so it's basically in charge of the response in returning the scrape data so let's go ahead and define parse and we want to pass himself since it's a method of this class and then it takes in a response okay so the response is basically the the data that we scrape now in this case we're just going to basically copy both of these pages and create two new HTML files with the same exact HTML so we're scraping the entire page later on we're gonna target certain certain elements of the page using selectors and and put them into a JSON file so I'm going to create a variable called page and set that to response dot URL so that will give us both of the dealing with both of these URLs and I just want to get the page number so in this case one and two so we can use the split method and we'll say we want to split them by the slash and we want to go one from the end so one slash from the end and we should end up here at the number so that should put either one or two depending on whatever page or scraping into this variable then we want to set a file name and we're gonna set it to posts - and then whatever the page number so I'm gonna use my percent s by my placeholder here dot HTML and then we want to replace that with whatever the page number now to actually create the file we need to let's say with open and that's going to take in the file name and the right mode or the the file mode which is going to be write binary as F and then we're gonna as F and then we'll say F dot right and we're gonna write to the file the response dot body which is the entire HTML all the HTML from both of these pages will be put into these files so let's save that and we can run this with scrapey crawl and then whatever we cut whatever the name is in this case posts okay so whatever we put here is what we want to put here okay so we ran it now you can see over here we have post - one and post - two HTML so post - one is going to be this so we just scraped the entire thing and put it into this file and I can even open this I'll open it with live server and right on my localhost now we have that entire page okay so I mean this is it is good I guess for like offline viewing or or for some reason if you need to scrape an entire website so now we're gonna take a break from from the file for a little bit and we're gonna work in the terminal because we're gonna work with the shell just to kind of show you how to select things how to use methods and so on so the way we go into our shell is we call scrapy shell and then the the domain or the URL that we want to crawl so in this case HTTP blog dot scraping I can't even I can't remember the URL scraping hub dot-com okay so from here I'm just gonna clear this stuff up with ctrl L so from here we can we can use selectors CSS selectors so we do that as we take that response object which is the same as this right here okay so later on we're going to do this the same stuff we're doing in here we're going to be doing in our parse method so to use a CSS selector we just do dot CSS and let's say we want to get the title so what this returns is something called the selector list which represents a list of selector objects that wrap around HTML elements so you can see selector it has the XPath which I'll talk about in a little bit and data is going to be the the actual element in this case the title with the tags and the text inside of it and if we look at the title of this it's the scraping hub blog alright so if we want to get just the element then we can run response dot CSS title and we can run a method on it such as get so what get does is it takes the first match and returns it so you can see now we just get the actual element now let's say we want just the text because a lot of the time you're not going to want the actual tag so in that case we can just add here a double colon title whoops what I do I'm sorry not title text let's run that again it's a double colon text and that gives us just the text okay so let's let's experiment with this a little bit so we have a we have h threes H 2's paragraphs let's go ahead and say response dot CSS and let's pass in h3 text dot get okay so what that does is it gets us the text of the first h3 so keep up to date with this scraping it's this right here all right now if we want the second one we can just go right here and put in a set of brackets and put in a 1 okay that'll give us a second if we want the third we can put in 2 so it's just like a list or an array and that's the third one if we want to get all the h3 s then we can get rid of this and we can use the get all method so that gets us all of the h3 texts and puts it into a selector list if we want to include the tags and attributes and all that then we can just do that without text and that will put you know the actual h3 ID classes whatever else is attached to it ok so that's how we can select by tag let's see what else we can also select by like class or ID so if we open up this and let's open up our google chrome tools here so the the dev tools are they definitely come in handy when you're dealing with web scraping because you need to know the structure of what you're scraping so if we take a look at a post right so this is post listing wraps all of them and then each post has a div with the class of post item and in that post item we have a header post header post content inside post header we have an h2 with a link and then the text for the heading okay then we have a byline span with the class of date that has an icon and a link and then the date text so it's important to know the structure it's it's like using CSS or jQuery where you need to select certain things so let's say we want to get the whole post header so I could do it's a response CSS and let's do dot post - header so we can select by a class and we could get all so that gives us all the post headers or we could get the first one you can do get like that let's say we wanted to get the first link in the post header we could just add an A okay if we wanted to get just the text of the first link could do that if we want to get the second one I could do that let's just the date alright so pretty easy we can also use regular expressions okay so there's a method called re for regular expressions so if I say response dot CSS and let's say we want to get all the paragraph text and let's say dot our e and in here we have to format this with an our string like this and let's say we want to get all the instances of the word scraping that'll do that if we want to get all the instances of just anything that starts with s so we can put in a word character here and then + and that just gives us everything that starts with a s even if we wanted to get like right here it says whether you are say we wanted to get every word that has you the word you in the middle of it we could go ahead and do let's get rid of this and for our regular expression we'll put a word character and then the word you and then another one so slashed up U+ and it gives us every word that has you in the middle so whether you are actions you will when you should and so on okay so you can put any regular expression you can get data that way so now what I want to do is take a look at using XPath selectors so XPath is a it's a language for selecting nodes in XML documents and it can also be used with HTML it's it's really difficult for me actually it's kind of confusing but these CSS selectors are kind of like syntactic sugar for the XPath selectors the XPath is what's happening under the hood but you can use them directly like say response dot XPath and let's say we want to get all the h3 s we can do slash slash h3 so if I do that then we get all the h3 s if we wanted to get all the h3 to just the text we could do slash and then txt with parentheses oh I'm sorry we need to call a method so we can use extract we can also use get all here as well okay now with the chrome tools let me just clear this up with our chrome tools we can select we can get the XPath for certain elements so let's say for this author right here if I select that this link and then I right-click and I copy you can see there's an option to copy the XPath so if I do that and then I go back over here and I do response dot XPath dots extract and then here I put in some quotes and paste in what I copied and run that I get the link with the author okay and I can even pass in here at the end slash text and just get the text okay so you can use XPath selectors as well and you can actually target better with XPath there's more things you can do it's just much more difficult at least in my opinion so let's see the next thing I want to do is what I said I want to get the the title the date and the author and we're first going to do that in the terminal here and we're going to do it with just the first post and then I'll show you how we can kind of loop through the posts and get each each set of data so let's set a variable so we can set variables here and we're gonna say set this post to response dot CSS okay so these queries we can put them inside of variables and we want to grab the div with the class of post - item and the reason for that is like I showed you each post is wrapped in this post item so I'm basically selecting the whole post and putting it into a variable and I want the I just want the first one so we're gonna put 0 if I type post it's going to show you the selector or the selector list now to set the title variable I'm going to set it to post dot CSS so instead of using your spawn CSS I'm actually using that post variable and I'm gonna go into let's say we want to go into the post header and from there I'm going to go into the h2 and then into the link and then I want the text from that link so that's going to give me the title right and I'm gonna get first link the first one and we're gonna use the get method so if I do title that gives me just the text of the the first post title if I want to do the date let's see let's change this to date now the date and stuff they I mean the the structure of the HTML makes it either more difficult or easier in this case there's no class or anything on the date if there was if there was I could just do like a dot date or something like that but in this case we're just gonna grab from the header the second link I'm sorry should be one so the second link is actually the date so now if I go ahead I type date gives me the date okay for the author same thing there's no class directly on the link so we're gonna set to the a third link in the post header in the post so author there we go alright so that's how we can do that with just a single post now I'm going to show you how we can loop through so we're gonna say for post in response response dot CSS and we want to grab say div dot post - I eat them okay so we're going to create a loop here so we're looping through our posts and we want to tap make sure you tab over here since we're in the shell or else is just going to end the line when you hit enter we want to create a variable for each one so title will take the post CSS and inside here let's say post dots I'm sorry not post on head our post - header dot post - header and then from there we're going to get the h2 from there the link and we want the text I want the first one and we want to get okay so I'm going to enter and then tab over and then let's add the date and we're gonna say post start CSS post header link we want the second link this time we want to get it and then we want the author so that'll be post header link text we want the third one and we want to get it and then for the last line I'm gonna go ahead and print and I want to print out a dictionary so I'm going to use the the D ICT function here and then pass in title is going to be the title variable date will be the date variable and the author will be the author variable okay so we'll run that and you can see that now we have a bunch of dictionaries that have the title the date and the author so we've looped through all the posts and outputted that data so we now what we want to do is kind of the same thing but we want to do it within our file and we want to create a JSON file not just print it out so now I think we're pretty much done here yeah let's just let's get out of this so I'll just go ahead and quit so let's go back into our file and I'm gonna get rid of this second one here and I'm just going to use the route URL because again I'm going to show you how we can go through all the pages without having to actually manually add them so we want that there and then in the parse we don't need any of this we're not just gonna copy the pages like we did before we want to loop through like I just showed you we're gonna say for post in response dot CSS so we use these selectors the same way as we did in the shell div dot post item okay so we want to loop through the post and then we need to yield so since we're in this file we're going to yield a dictionary with the title and that's going to come from post dot CSS ultimately it's going to be it's going to be the post header h2 link text first one dot get okay and then I'm just gonna copy that down twice and then this one is gonna be one we don't need the h2 here or here and this will be too so pretty much the same thing we did yeah so that should do it we just need to change the keys so this one here will be the date because the date is the second link and then the author is the third link and again this could be a class if there was an actual class on the link or an ID or something all right so just doing that I think we should be good so I'm gonna save and then I'm gonna go down here and if I just run scrapy crawl and then post all it's gonna do really is just show me down here in the console you can see the data the title and so on so we want to actually put this into a JSON file so we just want to add on to this the output flag and say post dot jason and you can do that jl format which is a jason list you can do CSV you can do all kinds of stuff we're gonna do a JSON file and check it out so now we have adjacent array with all of the posts but notice it's only on the first page okay so we have all the posts on the first page so now I'm going to show you how we can actually follow links and scrape data from other pages so if we go back here and we go down to the pagination so this right here older posts and we take a look at that so this it's this is going to be pretty easy because this link has a class of next post link right so remember that and then it has an H an href attribute that goes to the next page that we want to scrape so back here we want to go on the same level as our for loop here and create a next page variable and we're going to set that to response dot CSS and remember it's a link with a class of next what was it next post link but we want the actual attribute so we can do this we can do double : attribute and we want to get whatever's in the href okay I should I could have showed you this in the shell as well and then get so that'll give us the actual link to the next page now we want to make sure that there is a next page so let's say if if next page is is not none because if there is no next page if that link doesn't exist it's gonna be none so we want to say if it's not none then let's say next page is gonna equal response dot and then there's a method called URL join and we want to join in the next page okay so basically we're scraping the next page as well and then the last thing we have to do is just call yield scrape beat dot request and that takes in our next page and a callback and in this case it's going to be our parse because we want to run our parse again so we'll set callback equal to our self dot parse method so I'm going to save this and now what should happen is it should scrape the entire site so I'm going to delete this JSON file that we just created and I'm gonna run this again so output post Jason and it's going to take a little longer because there's more data to go through so right now it's just scraping the entire blog and if I go to my post Jason so there's now there's tons more data because it went through every single page and it took the title the date and the author alright so I mean that's pretty much it there's a lot more you can do that's much more advanced but I think that just for the amount of code that we wrote here what is this it's 21 lines counting the the spaces here so you know less than 20 lines of code and we're able to scrape an entire website and get certain pieces of data I don't know how useful some blog fields are but if you go to like an e-commerce site for a certain category maybe you want to have a list of all the products or something like that scrapy is really good for for stuff like that so hopefully you learned something here and you enjoyed it and that's it I will see you in the next video
Info
Channel: Traversy Media
Views: 249,529
Rating: undefined out of 5
Keywords: python, web scraping, web crawlers, scrapy, python scrapy, web spiders, data scraping
Id: ALizgnSFTwQ
Channel Id: undefined
Length: 28min 55sec (1735 seconds)
Published: Tue Jan 14 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.