Create simple spider with Scrapy - Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello guys in this short video I am going to show you how you can make simple spiders extracting data from websites with the scrapey framework scrape is a Python framework that is written in Python and you can write menus cat spiders extracting data in different formats with it so let's start by first maybe creating a directory to hold our project let's call it scraping we go there and now this is not exactly necessary but it's really good practice every Python project you have to use a virtual environment for your it these are Python virtual environments and with Python 3 5 3 6 you can create it just by typing write the Python version - M VN and then the name of the environment so call it creepy the end and this will create a directory named scrapey p.m. in the current directory and now i can activate this environment by sourcing this is in bash or that show by sourcing the activate file from this directory it's in name of the environment slash bin slash activate and sourcing it and now I'm using Python from this environment if I type which fightin to see which Python interpreter I am using you can see it's from the directory I just created why do we do this this is so when you're installing dependencies and Python packages you don't mess your global Python packages because when you have multiple projects let's say you have you you're using boss crappy version source or something like that it's everything is separated and get a clean environment every time they don't over they don't step on each other's toes so now that I have the environment and I have activated it's time to install scrap it's as simple as I think it installs crappy I just realized maybe I should call it scrap I I don't know really how it's pronounced maybe the information on their website so it takes few seconds to install and while we wait for that actually it's ready ok so now there is a comment available names crap I and you can do several things with it first thing you want to of course is start a project start project and you type the project name and this will be the front simple crawler we're not going to do anything fancy with this : now this command creates a directory named simple crawler and it even gives you instructions how you should continue from here on you go in this directory and you can generate a spider with scrap engine spider but first before we do that I opened this directory in the editor and I'll just go over see the file that it created now as you can see what it did it creates the scrappy scratched graphic CFG file which is right now you don't need this file to know anything about it because it's mainly for deployment meaning when you're running scrappy on on a centralized server or something like that and here under under simple simple crawler are all of the all of our spider files everything all the settings that are related to scrappy are configured in the settings by file and they're in simple format usually setting equals something like that we'll go over in details later pipelines middleware sytem these are items desire scrappy concepts for how you organize things in your spiders and here in the spiders directory will be the actual spiders so let's go and just type scrappy and Jen's spider and now I'm thinking to show you how you can scrape this one it's a shop for sports sports direct comm up there are plenty of products we're just going to scrape some product names with prices maybe images let's see how it goes so I'll call spider sports direct so we can spider not get spider and domain what is this domain okay now we have a new file here called sport direct and we have the start URLs of the spider meaning it will start crawling from this URL downwards or wherever and it dial out domains we put only domains we want to be crawled meaning when the spider start crawling and we pass it links to CRO you know further deep down in the site if you if we don't want to care and check every link we can just now I put a full URL here it should be a domain name so the spider will not make any requests outside of the allowed domains here so and what we have here is only a single empty method part which will be called once our start URL is is set now the thing is I don't really want to crow the node for and in the start page it's way too many things I'll just pick a subsection so we have you know something saner to work with let's say men's shirts and here we have a nice product list so I'm taking this URL and I'm using it and start URL and now as I think this thing that's actually the simplest thing I can do is maybe just print what's on them on this page this should just print HTML I got from them from the from the spider now how do you run a spider interest type squad by phone and the spider 9 sports indirect this will start the spider will fetch and what do we have oh okay see this with a lot of domains so basically obviously they're doing a local redirecting since I'm in Bulgaria they redirect me to be gtalk sportsdirect.com and so this is not in bailout domains actually cuts out the requests so let's do it like that so don't care which domain exactly it is we just want everything on sport direct comp can we run it again and what do we have this time here this is our print because I'm printing the response yes if I want that show HTML and I should do response dot txt and now when I run it again I have so much HTML output how do I extract what I need right here is where scrapey makes it really really easy to to extract data basically the response method has the response object has two two ways of extracting data from HTML one is CSS which is using CSS selectors and the other one is expat which is using to select elements with extract expression now usually especially if you know a little bit about CSS and HTML it's much easier to work with the CSS selectors that what I'm going to show you but I mean they're equally well it's matter of first choice so now what do we want to extract from here we have a nice grid of products with a picture what is this brand name product name and a price let's try extracting the brand of the product and the price as a starter how do we approach this so I would just inspect one element here and chrome also Firefox makes this really easy because like you can just hover like this and see what the bounding box of single element and if I do like this I will select okay it's deep that contains all this information and has a class that is s - product term box so if I select this if every all the information I need will be inside of these days because the price the the brand name the product name even the picture are inside are under this event if you if you keep on google inc like that you can even see the doorstep with and the names like this is a picture somewhere here there will be a son or something face the price I mean I guess I just need to me yeah I am sure there is a gift in the price and so on so that's that's another thing that scrape I have that makes really easy to see how how to how to actually extract the data it's called the crap I show so you just run it and now you can do it it gives you an interactive Python console and in this console you can just type for Python code and it gives you also few useful metals like for example the fetch method what do we do in the fetch method we just type fetch and the URL we want and now it has such the URL and creates the response object the same response object I will get here in the forum in my spider experts method this is so now I can use it to to experiment with the page see what's there you know so I've already identified the the disk that has all the data for a single product that I need and it's with the class s / s - product and box so if I want to select it with a CSS selector I can even show you here in Google Chrome this is going to be in okay where's my Kindle document document query selector all and now I type dot this is for CSS class s - Auto turn box okay make product and box and one more be yes circle disciples and now it gives me a list of hundred elements so if I just use this selector I will get all the product boxes here on products here on this page I assume they're there around 100 so let's see I'm typing a response CSS and I have again how many hundred is okay I can work on all the hundred this at the same time but actually I don't want that so I just save it into a list so let's call it products so now I have an product so now I have a list of all the discs that contain the problems actually this is good so far I can just move it here in my spider now I don't want to extract all the products at once I want them extracted one by one so I can make a nice object up I'll pick one let's call it in products is URL just like the first one and now if I just use it it gives me this deep now the other nice think about the crap I select versus I once you get something with the CSS or Xbox selector you can keep drilling down with more CSS or xbox selectors so let's say now I want to to get the brand name let's let's see how the base channel looks like you can just click the mouse and select and you can see this is a fan with a class name product description Brown so I mean here we have a single product song I can adjust the CCSS again dot for CSS class product description from and have a span you see is the same span the same time that's on the page here so how do I get the text out of this time so here are some magic selectors that are available in scrapping they're not actually valid CSS selectors of day they make this kind of text extraction easier so for example what once I select a pan or a deep or anything you know I can just type here in the selector double double column text and you see now data is only pierre cardin which is the brand name of the first product I guess but I still get only a selector not a string and I want to string with the brand name so what I do is I go extract extract my soup extract although like it gives the textual representation of the selectors which in this case is just the what's contained in the some element now as you see it returns a Python list even though it's only one element so when you know this because you know selector generally are they're abstract concept and they mean they may contain a single element or list of elements and this is just because of how selector works like a either CSS for Xbox selectors they can always use the list of elements that's why the extract method always returns list and if you know that it's only one element in the list or you know where you care only about the first element you can use extract first so this gives me a nice string that contains the brown so how do I transfer this into my sprite what I can do now is iterate over the products product 4p in products and now P will be each separate selector for product and now I can use this letter to get the brown and let's see how we get the product name the product name is contained in element with the CSS code product description name product description name here here the product name so me and what else we can get we can get the price again we selected select the element and here you can see this one has multiple classes like yes current Seidler currency side large current price car price product has red and so on like I don't need to be very specific about it like I can use I assume here current price this car price actually means this is the price because they have several several ways of listing a price here here they have a discounted price so is element with the real price not the original discounted price contains the under current price class and here where it's not discounted let's see what kind of classes it has again it has current price so by using only current price we can get the price no matter where it's discounted or a regular price so let's see if I just select an element that has good price okay and I get the price in Europe cries so now that we have this maybe we can try to print it see what we have let's say we print the brand the product name the price okay cool let's run our spider see what happen and I get plenty of products printed on the console with all kind of brands product names and so on okay this is great how do I save this date on this now basically of course you can just open a file and save it yourself using some piping so but pray by gives you a way to automatically manage this for you so what do you do you create items they're called items in square by and they you define them in the items of pie so you open here and it created some kind of item for you already empty but I'll rename it now call it product item and here you need to define each field that the item will help so for example I have one I have named Half Price now you see our defined us crave I don't feel like it's irrelevant these are just so you don't put extra fuel oh you can put anything in each of these you can put a string a number a list or dictionary whatever the only thing to do here in this class is defines that the names of the attributes of the item you have issued for serializing in JSON or CSV or selector and now how do I use in this in my spider I just go and I import from items for product item once I import this I can make items product ID I create a new item and then I can assign to each attribute like this brand equals from item name products name nice and bright oh right okay I created the item but crepe I doesn't know anything about it so in order to give it to scrape by for processing I appealed it filled item all this happens is this using content from Python code generators and I'm not going into details now what generators are for how they work basically this will push the item if your item will push the item into the scrape by pipeline of processing items or request I mean it's a mixed pipeline is more to do with each object based on force it represent and now scraping notes about the items already so if I rerun the same scrap or the spider instead of printing the items it will just print okay I need to from my turn super okay oh yeah relative imports and now instead of heavy income then simply print on a console I can see that scrap is printing the following messages scraped from from which URL it was scraped and I get a JSON with the actually the content of the item so you see I get out of this printer but still they are only printed to the console I cannot play a lot on disk I not very useful to me so they items go to the pipeline but the pipeline goes nowhere it just goes on the console and here is here comes a scrape by setting that where to save the pipeline or the feed there are a few ways to do it you can just go set set is an option that allows you to set scrape by option and the option is called feel yummy let's make it produce GL GL stands for json lines because what scrape I will do it will save a simple JSON object on each line in the file this makes it easier for processing like especially processing large amounts of of JSON data so let's rerun the spider we get this printed and we get also products table GL let's open it and here you can see we have single JSON object on your chart with the brand the name and the price and now we can process this JSON file however you want now that we have our spider saving data from a single page all the data we have all the products from a single page maybe it's good to add crawling all the pages I could get five pages of production we want them all so how can we approach that one way is to use the next link here on the pager see we have a single page or now we have two pages actually so let's let's inspect this next link and see see what it got basically when we click that link we will go on the next station we can scrape the next page so let's go back in the show fetch the URL and now let's see if we can find the link we can see that the link has two classes swipe next click and next link let's see what next link or response top next link and here we get two links because there is one such link on the top of the page and one on the bottom they are basically the same so I can just use the first one and what do I need out of this thing I need the href attribute how do I get it here is another special selector as with the double context you can use double colon as attr and you get unnamed attribute in which cases engage with and extract okay here is only obstructing this yes okay so now I have a new RL of the next link and we can even check what happens when we run on the fourth page because we should not yield any link when you're on the last page nd there is no source to just work what I can do now take this and say you next page I will have an explanation in here and what will happen if I don't have a link let's just check the T here like I can I can fetch again the last page and see what happens when this selector is non-existent or I can simply put a non-existing selector and of course these sales because you should make the logic a little bit smarter select row so what we'll get is we'll get all the selectors and only if there is at least one we'll extract it so basically what happens here with with this year's a selector we get all the links that are the next week the links are linked to the next page this will be either an empty list or a list of one or two links depending there will be always two links in this case this shows for starett so if it were mentally if this will be false so we not gonna go down here so we can safely extract the first element so we get the next place the the next day link link next page link the only question now is how do we give it to scrape I so it can set more bling this is where we are using a new request a 5.3 place with URL that we want it to fetch basically what this will do this will again push this object into the straight pipe pipeline but this time this is a request object so instead of saving it in our JSON file with items scrapey we'll just push it in the HTTP fetching fight one where it will slash the URL and call again the parse method with the new space the newly searched date so basically what this will do is that scraping that is automatically for you but what it will do is it will fetch this URL that we pass it with the URL parameter and we'll call again the same parts method with this URL because it's the same kind of station and we want to process it in the same way there are ways to to make different parsing methods for different we don't need this right now so let's try it let's see if we can scrape all the pages now I rerun the spider by the way let's see how many products we get or do we have hundred products ok I will delete this file so we can see because how this works in this field or we just feed a URL it will always append you know it will always append to the end of the file so when you run multiple times you get duplicate items so we want to save it into a new file every time that you want you're making a brand new script so I'm deleting the file so I get the same products but this time we should get more because we should have scraped 5 pages so I'm running and but we can even see here in the in the statistics I can scraped only 100 why why didn't he fetch end and we can see there is some kind of error missing scheme in the request URL so this is a gap you need to know you here in the URL you always need to pass full URL you know like this with the schema and domain and everything you cannot pass relative URLs this because scrappy doesn't know relative to work this URL is and see we pass relativity alright so what do we do about it scrape scrape I have a way to make full URL out of them out of relative 1 by using response URL ok so something could join URL it should be response you're okay I just searched in documentation I don't know maybe or maybe we can play this game for a long time yes so on your drive and this is under the response object of course on my way here I'll join probably in doubt suggest you choose okay let's see now I will remove my products geo again and rerun describe scraper seems like this time we're getting more items and yes we see at 8 items great country at 413 let's see the JSON time so we get four pages hundred items and I guess on the last page is only turkey my true that's why let's go commit in 13 see yes quite a few items okay this was quite simple right not so complicated to make a scraper that will scrape some products out website hope you enjoyed this lesson tutorial whatever it is and
Info
Channel: Coding Bits
Views: 8,364
Rating: 4.8571429 out of 5
Keywords: scrapy, python, spider, crawler, scraping, css
Id: 4I6Xg6Y17qs
Channel Id: undefined
Length: 37min 47sec (2267 seconds)
Published: Mon Jun 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.