Scrapy From one Script: ProcessCrawler

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you ever wanted to use scrapey but you didn't want to use a full project you just wanted the spider and nothing else you didn't need any of the extra stuff that comes with it then today's video is really going to help you out so what i'm going to show you how you can do is you can actually run scrapey from a singles pie script you can still get the functionality that you need the basic functionality for exporting and seeing what's being scraped on the screen so we're going to do a basic web scraper which is going to keep it nice and simple we use the css selectors that are built in with scrapy we're going to output our data to a csv and i'm going to show you how to do all of that in one pi file without using a start project so let's get started okay so i've created a virtual environment and i have installed scrapy i'd always recommend using a virtual environment when you are creating new projects it just helps keep everything stop not conflicting with each other makes it better so what the first thing i'm going to do is i'm going to import scrapy because that is what we're going to need and the next thing i'm going to do is i'm actually just going to import right away is the crawler processes which we're going to need to use to run the scrapy file directly from here so we do from scrapy.crawler and we're going to import crawler process i believe it's with capitals like this so we get no errors that's good so what we want to do is we want to basically just mimic what a spider is and we're going to add in a couple of bits of information that lets us either output the data to the screen and also so we can export it to a csv file so the first thing we need to do always is we need to create a class because we are going to be inheriting from the spider class and this i'm just going to call whiskey spider because that is what we're going to be scraping today and i'm going to create this is a scrapey dot spider as you would new normally the next thing that we want to do is we need to give our spider a name so i'm just going to call this one whiskey for no real reason other than what it in fact we'll do single malts because that is the section of the website that we're going to be scraping there we go the next thing that we need to do is we need to start our requests this is the same in any scrapey spider so i'm going to say def defining our function and this is the start requests function within scrapy and we need to give it itself because we are within our spider class the next thing we want to do is we want to yield out of this the actual url and the response so the sorry the request so we do scrapey dot request and i'm going to give it the url here so i'm just going to go straight over to the website for a minute we'll have a look at it and then we'll come back so here it is here and this is the um the section of the website we're going to scrape there are 2 800 odd products and i have it set up to show 24 per page but i think we can do more than that so what i'd like to do always is just click on the next button and see what happens obviously right away up here that the url has changed i'm going to go ahead and change the page size i know that this can do 120 so that's what we're going to do so that's our url there i'm going to copy that and we're going to come back here i'm going to paste it in and where it says page is equal to two i'm just going to change that to one because we're going to start on page one the next thing that we want to do is we want to now write our next function now this is our pass function this is where we're going to actually take the data off each page so we're going to do define and then pass and we do self and we need to give it the response so this is the response that's going to come back from our scrapy request here if we don't give it the response we don't give it anything that can uh anything that has any data that we can get out of it the next thing that we want to do is we want to check out our website and see where all of the information is that we want to scrape now if you're in if you're new to scrapy or web scraping in general or even css selectors i've got beginner videos on all of those this video i'm going to focus more on just running the scroller with the crawler process so i might go quite quickly through the pass function here but essentially what you want to do is you want to come to your website do inspect element and then use that to find out where all the product information is so if i make this a bit bigger you see right away on this hand side we've got a list class product grid dash item so i know that i'm going to need this which is where all the products are so what i'm going to do is i'm going to say our products is equal to response dot css and it was a list item so l i and the class was product dash grid like that and now what we can do because we found all of those elements and save them in this products variable we can just loop through them so for item in products and now we can actually access the individual bits of data within each one so if i open this up there's extra information here we could grab the link if we wanted to but for this case it's not really about the data that comes out it's about the process so i'm just going to pull the name and the price and the uh the metadata here so what i'm going to do is we can see this is the name and i'm going to grab that product dash card name and it's a p so it's a paragraph and we can also see right there that there's one that says meta and this one which says data one has one underneath it says price there we go so that's nice and easy couldn't have asked for a more simple uh simple way of doing this so what we want to do is want to yield out and we want to do item.css in fact we need to call it something don't we so we're going to say name and it's the item.css because we are looking in each item variable here that we're storing for each one and it was a p tag product dash card name now we actually want the text from this we don't want the raw element so we can do dot text uh sorry double code on text dot text is not installed scrapy and then we do get like that so that is going to basically go out and get the text from that element so in bigger and better scraping projects you would absolutely use an item loader and the item class and you would clean your data that way and you would pass it all through these scrapy pipelines like you would want to but in this case we're just doing it in one script so we're going to happily do dot get and then we're going to move on and we're also going to say the i'm just going to call it meta for lack of a better thing item.css and again it was a p and this one was meta so we can just write that and then exactly the same thing we can do text and dot get again and finally price there we go item.css i like to use css selectors over xpath is just what i prefer you can use the xpath selector selectors if you rather and again text and dot get that is going to yield out of this response so basically the html from the page all of this bits of data for everything that we've got here so now that we're at the point where we would be able to get all of the products and the information that we've specified here off of the first page we might want to actually loop through all of the pages and get all of the data there's a few different ways to do this i'm just going to do it with a simple range loop because i can see right away here if i scroll down that there are 24 pages so i'm just going to go ahead and what i'm going to do is for x in range let's move this up here and i'm going to say 2 to 25 because remember we're doing the first page here and then i'm going to say yield and we want to do scrapey dot request request now we can give it the url again but this time with an f string and copy all of this we don't actually need that sort thing on the end but we'll leave it in anyway and where it says one i'm just going to change that to two there we go so uh sorry x not two x there we go so we're basically saying for x in range two to twenty five we're gonna get a whole new bit of information from this url and then right at the end of this um sorry this is off the screen a little bit this is bad from make it one smaller we need to basically say the callback is equal to self.pass so what that means is for every one of these pages that we go through we're going to call back our pass function and we're going to get this information if you don't put that in there you don't get any data back so that's very important to add in so now that that's there we need to set up the crawler process that we've imported at the top here so crawler process allows us to run this uh with just within itself within this script without the rest of the scrapy project now scrapey is built on the twisted asynchronous library which is why we need to use a process and we need to do the crawler process settings first otherwise we're not going to actually get anything to work so i'm going to say process is equal to crawler process which is what we've imported up here i'll make that one bigger now that we're actually there we go and i'm going to say we need to give it some settings so i say settings it's equal to and we'll give it a dictionary we'll give ourselves some space so in here what we're going to do is we're going to put in the settings that we would normally have in our scrapey project and because we don't have the rest of the project to back this up we're just using this one script we need to put them in manually here so what i'm going to do is i'm going to put in the feed data which we can use the feed export so we can actually export our data out so the first thing i'm going to do is i'm going to write feed uri and this is going to be the name of the file that we're saving to so i'm just going to call this whiskey.csv and the next one is going to be the feed format and we're going to do here we're just going to say it's a csv so that should work for us now we can actually use our process to run the scope this script within itself so we can say process dot crawl and we want to give our spider name here and paste that in and then process dot start so what this is going to do is it's basically going to use as i said the twisted asynchronous library and it's going to go ahead and grab this data for us so now if i run this we'll go over and correct any mistakes i've made because i've just written this all out as is and we'll see what happens okay we can see everything zipping by that's going quite quickly and we can see down here we have scrapes count two eight six seven which is exactly what we had on the page and if i open up here we'll have a csv file there we go with all of the data that we requested so there's all the names of the products and the uh the metadata and then the name of the product again so there was always something i was going to mess up and that is because i didn't change the card name here to price so let's change that real quick delete our csv file and we'll run it again and this time we'll have the actual data that we want out there's always something right it would be uh no fun if it was all like you know straight away no errors and again two eight six seven scrape counts um we can see 24 responses which is right because we had 24 pages and somewhere around here stored csv stored csv feed in wiki.csv there we go let's open that up and now we have the prices much much better now we can see that our data actually has some spaces on the front and maybe we would want to change the prices what i'm going to do in this case is i'm just going to add dot strip onto each of these just to remove the white space but basically that's it so this is a really useful way of if you wanted to actually write your own scrapey spider but you didn't need to have a full project because you weren't going to be using the rest of the project stuff in there like the middlewares the pipelines or anything like that you only just wanted to do it like this this is a nice and easy way of doing that so thank you very much for watching guys hopefully you've enjoyed this video and got some value out of it if you have let me know drop me a comment below and please subscribe there's lots of web scraping content on my channel already and more to come so thank you very much for watching and i'll see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 2,971
Rating: 5 out of 5
Keywords: scrapy from script, scrapy crawler process, crawlerprocess, python web scraping, learn scrapy, scrapy, web scraping, web scraping with python, python scrapy, python scrapy tutorial, scrapy tutorial, python scraping, scrapy for beginners, scrapy spider, scrapy python tutorial, python web scraping tutorial
Id: 5Is-QdbKmEI
Channel Id: undefined
Length: 12min 46sec (766 seconds)
Published: Sun May 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.