Scrapy Items and Itemloader - Beginner Scrapy Project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how to use the scrapy items and the item loader to make your scrapey projects better so basically in my last scrapey video i got a basic spider up and running scraping some products from a website right here this is where we left it off what we have is we are returning as you can see the items here we're yielding the back the name the price and the link and we're basically doing some minor cleaning of the data here and we also had to put in a try and accept because some of the products had no price because they were sold out and the code was failing so i'm going to show you using the items and the item loader how that is how you can make this much tidier and much better so the first thing that we want to do is we want to come over to the items file this is a pi file that is created for you by scrapie is in your project folder and we can see that it's already started and created our class for us here to make this one bigger so it's easier to see and that has some information in it so what i'm going to do is i'm going to uncomment this line because we do want to use this i'm going to say name is equal to scrapey.field and we can see here that i have a name field in my code as well so i'm going to do the same for price and link price is equal to again it's a field and then also link two uh there we go so what this is doing is it's creating a sort of template for a basic item that we are we've called whiskey scraper item here this is a default you can change this if you want to and i'm going to remove the parts because we don't need that what we can do now then is we can actually import this class into our main spider and we can then use it to put the data into and then yield back the actual item class instead itself so if we come back to our code and i'm going to remove the accept here because we're not going to be using that anymore we have a better way of dealing with it i'm going to move the try and let's move this back over so it's in the right sort of place so what we want to do first is we want to actually import our item in and to do that we're going to do uh from and our project is called whiskey scraper i can never remember dot items and we're going to import in our class which is here so i'm going to copy that and paste that in there so basically saying from our items pi file we're going to import this you can do this a slightly different way as well but this is the easiest way so now what we want to do is we want to create an instance of this item class so in our code i'm going to put it above our loop here i'm going to say item is equal to and again we're going to copy this and there we are put the parenthesis at the end so basically creating an instance of our object our class here that we can then use so to do that you want to remove your yields we're going to keep this information but we're just going to change it slightly so we're going to say item and then put the name in the square brackets is now equal to this bit of information and this is where we actually extracted the name of the product we can see we have the dot text here and the dot get so we can do that i'm going to do the same for the price equal to and i'm going to leave in the replace for the pound sign at the moment but when we move on to the item loaders in just a second i'll show you how we can do that in a better way and again item link is equal to so the reason why this is item here is because this needs to be the same name as this piece here so you can call this whatever you like um but obviously you can't call it the same name as the scraper because you'll get an error it will confuse everything so i call mine item or if you were going to if you had multiple items that you were scraping from multiple parts or a web page you would call them appropriate you would name them appropriately so now that we've got that information put there when it comes to our yield we just simply need to yield the item so i'm going to save that and i'm going to come back to my code over here i've got my terminal open and we're going to run that i'm going to remove the output now this is going to error because there are some products on this website that don't have a price because they're sold out so i expect to see the code fail but we'll be able to scroll up through the text and we'll see what the items look like now that we're putting it into our item class and then yielding that for itself so we can see that it failed but if i come back up here we can see that for example here we have the link name and the price and it's got that information in it because it's still got the uh it's removed still removing the pound sign which is good which is what we wanted it to do so that's all good and well but we're still having the same problem where we're not getting data and it's failing and we're still having to put in for example our dot replace here now if you had lots of data that you wanted to clean up this could get really messy really quickly so this is where the item loader comes in now we can now set up our item loader class and we can tell it to do everything that we want to do including removing html tags and giving it our own functions like replacing a pound sign in a string so we want to go back to our items file and under import scrapy we want to do from scrapey dot loader and we want to import the item loader typing is proving difficult for me today one second item loader okay so we just need to import a few more things and i'll show you how they work with it in just a second so from item loaders dot processors we want to import uh take first map and map compose we'll just do those two for now and also from w3 lib dot html html import remove tags there we go so this is the one that is going to this is the function that will remove the html tag so if i hover over it there it says remove html tags only that's good if i do the same for take first it says returns the first non-null or non-empty value received so that's we'll definitely want to use that one and map compose is basically the processor that basically lets us execute functions on the code so what we want to do is we can use all of this and put it within our scrapey field and get it to do things with the data um as it brings them in so as we're doing it at the moment what we're doing is we're actually as we grab the data we are asking for specifically the text and then we are getting it and then we're replacing so this is all very well but again if you had a large project uh you'd have to put these all in manually and it would be a bit of a mess and very untidy a bit of a disaster to be honest that's why we use this way so we can in our classes get that to do it for us so every time it gets an item that matches this class it's going to do this to all the fields so what we want to do is we need to use to load two um processors we're going to use the input processor and the output processor they do just a couple of different things so we're going to say input processor is equal to and we're going to give it map compose because this is the one that lets us execute functions on this line and i'm going to say remove tags and then the output processor is equal to that was really bad sorry it was tedious to watch take first i'm going to make this one step smaller so the whole thing fits on the screen so you can see it all so what this line is basically doing is it's saying that when we match the name field in our spider which is here we're going to say put it through the map compose remove the html tags and then take first which again returns the first non-null value i'm going to do exactly the same thing for the price i'm going to copy that and i'm going to put it in here but we're also going to add in our own function and we're going to put in to remove the pound sign so we don't have to do it manually into our code just like that so i'm going to say like you would do with any normal python function def and i'm going to say remove currency and we have to say we're giving it a value and this is a simple function so i'm just going to return the value dot replace and the pound sign just like we had in our main bit of code with nothing let's move that space and then i'm going to do dot strip as well so what we can do is we can just call this function on any item class any item bit that we have it could be a currency and we can remove the pound sign you could change this and you could have anything here you could have it removing dashes from your code you could have it removing new lines you could have it just stripping white space and we're going to leave it like that for now so inside our price i'm also going to put in my function there here so now every time it goes through this it's going to remove the tags and then do our code here execute our function i'm going to leave the link as it is because we're going to be going directly and getting the href attribute which we don't need to change so i'm just going to leave it like that for now there's a few more things that we need to do to actually make the item loader in our code work so i'm going to come back and we need to change a couple of things up but first we're going to import it so i'm going to go back here to the top and i'm going to say from scrapey dot loader import item loader and now we want to change a few bits of our code so we need to as we pull the data in we need to put it into the loader so the information comes from the html that we grab and we give it to the item loader which then takes that information does everything that we defined it to do so remove some of the removes the html tags it gets the first one the non-null value and then it sends it into the item clean so we want to i'm going to just remove this and i'm going to put it underneath in our main loop because this is where all of our information is coming from now for every item loader you need to have a selector and our selector is product so if you think of in our loop if we weren't doing it this way we were doing uh like in some of my other videos where i use requests beautiful so we do full products in blah blah and then we do product dart find so basically that is the selector the product is the selector so anything i called it here we need to go in so i'm going to say l is equal to item loader and then we need to give it our items class so we called it item before i'm going to call it that again it's this here so this is our item class that we're using and then we need to say the selector is equal to products so basically this is it here now we need to change our code here a little bit too i'm going to leave this here in a for the moment and then i'll remove it later but we want to say l.add css so we're saying here's a css selector that we want to use go and get this bit of information i would say the bit first bit we're going to grab is the name and this is here so we want to say where is that info and it's in this part here so you want to take that we don't want to grab the dot text sorry the the text part and the get because we're doing that here with our map compose instead so that's that one done and then l dot add css always css because we're using css selectors if you were using xpath you would do add xpath price and we grab the price span.price put that there and then l dot add css and the last one was link now because we're doing this one slightly differently i'm actually going to grab this and we're going to instead of having the dot attribute there i'm just going to do it inside with attr href so we're not actually touching this one we're just saying going to go ahead and grab that as it is now underneath here let's just remove this we don't need these now and instead of yielding the item we're going to yield l for our item loader dot load item so that's going to do it for that one so what we're going to do is we're saying our item loader here's our item this is our selector go and grab this data it gets it all it sends it through the loader through all of the data cleaning that we've done here and then splits out the other side so now i've saved that if we go back to our terminal and we're going to run this again and this time i am going to output it and i'm going to say whiskey.csv i'm going to let it run and hopefully all the information comes out the other side including products that don't have part of that information in so some that don't have the price will still come through the system still get put into the item loader but we just won't have the price information there because it doesn't exist so you can see it's flashing by at the moment it's grabbing all the pages i think there are eight or so pages but we're getting all the information out here right now so that's just finished and we can say we can see that it says item scrape count 762 and if i go back to my vs code and come to the here we can see we've got all the information there's a nice long csv file um i'd actually already run this which is why it's doubled up um there we can see all the information if i find a line somewhere that doesn't have a price we'll see that it's just there's just no information there i'm sure it's worse there we go so you can see this one didn't have a price and it just has no information uh because you couldn't find anything so that's essentially it guys that's how you would use the item class and the item loader in scrapy it's definitely worth doing even for small products or even even for small projects because if you want to come back to them and expand them or it's all there as well a nice base don't forget you can create your own functions that you can give to the map compose that will let you execute that on those codes so if you wanted to do anything different you could do that so hopefully you've enjoyed this video and it's been useful for you in some way this will be the second or the third one in my scrapey series there's going to be more coming we're going to be doing more scraping projects i think it's really cool way to do things especially with splash and the items and the item loader that i've showed you we can do a lot of cool things in extract data so thank you very much for watching guys uh like the video if you like it don't forget to leave me a comment tell me what you think and subscribe if this is of interest to you thank you very much and i will see you guys in the next one goodbye
Info
Channel: John Watson Rooney
Views: 8,859
Rating: 5 out of 5
Keywords: scrapy items, scrapy item example, scrapy itemloader, scrapy items.py, scrapy item class, scrapy item fields, scrapy item loader, scrapy item tutorial, basic scrapy project, learn web scraping, web scraping tutorial, scrapy web scraping, scrapy beginner, scrapy tutorial, scrapy lessons, online coding lesson, learn python, scrape websites
Id: wyE4oDxScfE
Channel Id: undefined
Length: 15min 6sec (906 seconds)
Published: Wed Dec 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.