Learn Scrapy with a project based example

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is going to be a full scraping project it's a storefront website we're going to be taking some basic product information it's html and we're going to be doing pagination so let's get started so i am loving pycharm at the moment so that's what we're going to be using i've got a project started inside my virtual environment let's pip install scrapy and now we can do cd into our project folder grapey genspider i'm calling this tiles and this is the website here so i'm just going to grab the end of the url there just for my spider there we go so now if i come over to the file manager i should have my project folder and then here is the actual spyder itself let's get rid of the terminal make this nice and big so i'm going to go ahead and come to the website now we can see that we have some products here if i scroll down we've got different pages this is the easiest way the pagination will happen for you it has says the page when you change this goes products page slash two slash three etc so when you go back to one you can see it changes so what we're going to do is we're going to change our url to that products slash page slash one now what we want to do let's move our project out of the way now what we want to do is look inside the scrapy shell so we can work out where the bits of information is that we want and how we're going to get them out so in the terminal i'm going to do scrapey shell and then we're going to grab the url up here that we could just take it from here and let's paste that in and let's run that now hopefully we don't get any errors and we should have our 200 response up here so you can clear that up and if i type in the word response now we get r200 so what we need to do now is we need to start looking at the website and seeing how we're actually going to pull the data out from each of these respective products so if we open it back up we want to use the inspect element tool here i'll move this down let's make this full screen there we go so it's at the bottom what you want to do is you want to hover over one of the main products just so it gets everything and we can see there that we have our ul class of products and if i collapse all of these each these represents a product on the page so what do we want to do well we know that we want to be inside this ul class and we want this li this list item class but this is a really really long convoluted name but fortunately using css selectors we don't have to worry about that this one's nice and easy so let's go ahead and grab this and let's come back over here and let's do response dot css and let's go for our ul which was products and we use a dot for the class and then a space and the li so this is where all the information is so let's save that in a variable now so let's do our products is equal to and it was response dot css ul dot products ally so this is going to be where all of our product information is now within scraper you can do dot get and that will get you the first one with all of the information so just scanning through this we can see that actually down here is kind of where all there's a lot of information within these but we could go ahead and get that but it's going to be nice and easy just to get it out of the html because i can see already here there's the price and i can see there's lots of information there's a name in the alt tag and there's an h2 class with the title as well so that's going to be nice and easy so we want to start by thinking about what bits of information that we want so i'm just going to make some notes up here in my code and in fact what we can do is we can grab this first bit right away and stick that in because we are going to want that in there so what bits of information do we want well we want the name i'm going to say sku name and price there's three bits of information this is just so i can make notes from when we are working down here working in the scrapey shell we can actually use for loops as well it will work with python so let's do 4p in products and then indent and let's do p css and i saw that there was an h2 tag let's do text and dot get there we go that's all of the names for the product so all i've done here is i've looped through each one of the li tags i saved into our products and we are looking for the css selector that matches the h2 and we're asking for the text and we're doing dot get just to get the text information so i'm going to copy this because we know that this works let's put this up here as well and p dot cs this was the name so let's do name in fact we want to do it like this name is equal to excuse me that there we go so the next bit of information we want is skew so if we go back to our product products.get and we look around here we're looking for where we can find the skew of the item so for me i found it here already data product id if you don't know where or you're still looking for it you can just come into the inspect element again and have a look around here there's data product id and data product sku so this is what we want so let's go back to our code now how we're going to get that out data product sku here it's within this whole tag here so where is it it's in this a class button tag so let's do 4pm products and let's do i think it was p dot css and it was an a tag so a dot button and let's do get and we can see that we have now got all of those a tags and that actually has the data product skew part in it so now we just need to access that attribute so let's do it again and we do the double colon attrs and paste that in and enter and that didn't work why didn't we get it oh is it attr no s so let's run that again there we go so this one didn't actually have a skew which is interesting but the rest of them do so now that i know that this is where the skew information is we can put that in there and the last bit of information we wanted was the price i'm just going to delete this now i'll tidy this up in a minute don't worry i'll show you all um so we're going to go back to products.get which is the main bit of information that's the price there and is it it's in this bdi tag which is interesting because if we have a look through here and i use the inspect tool we can see that we have the span here class woocommerce products amount amount and then this bdi tag here which actually just has this information in it here so i'm assuming we can probably just grab this bdi tag or you could go span class price bdi so let's do that just to make it a bit easier so let's go back to our loop and we want to get rid of all of this and let's do span price bdi text dot get and i missed a end of my quote mark there we go so that is all the pricing information so now we know that let's copy this and put that in there right we're finally done with the terminal now uh we're done with the shell at least we'll need the terminal again the second but if we come back here what we actually have is where the product information that we're looking for is stored let's get rid of our pass so we want to return this information somehow to the screen so what we're going to do is we're going to yield all of this out and i'm going to change this all to a dictionary just to show you so we're going to say the key is name and the key sku and price and let's change this over like so so we know it's a dictionary and we need our commas at the end because it is a dictionary so we're just about there with the very bare bones of our scraper so what i'm going to do now is i'm going to test it and see where we're at so i'm going to come back to the terminal let's scoop this down a bit so it's a bit tidier and we're going to do scrapey crawl and tiles which is the name of our spider and we're going to see what we get and if we need to make any changes so i can see we've got item scrapes count is nine which is good and i can see here there is our dictionary that we're returning with the name and the sku and the price so we're happy that this is working we're actually getting the first page only so now we need to work on our pagination now to do that what we're going to do is essentially say within our response if you find this link which is for the next page follow it and then come back to our pass function and if you don't it's going to end now to do that i'm actually going to head back to the shell because we want to see how the pages work so if we scroll down and we can see the little page buttons down here this is very common there is always almost always a next one so let's click on that and we can see we have this li tag list tag a class next page and also this is inside the unordered list class of page numbers so what we're going to do is we're going to find this element then this one based on this class and we're going to grab the href so i'm going to copy this we're going to head back over here and we're going to say inside our pass function the next page is equal to response dot css and it is going to be equal to whatever we find down here so let's do response.css and it was a ul with page numbers then l i dot next for the class let's try that a t t r h ref as before dot get okay so that is quite long convoluted but all we're doing is that's the first element and then inside that one that's the next element and we're asking for the attribute of the href for that that's not working so let's go back and let's find out where we went wrong so that exists there's some li class next no okay ah that's because it's an a tag that's why so if we go back up here you can see i was looking for the list item with the next uh attribute but it wasn't it was an a tag obviously because it's a link there we go so i can copy this and let's put that in and now we all will have our next page link so what we can do is we can say if next page is not none so basically if it does if it exists we can join it together so let's do next page is equal to response dot url join like this next page and once we've got our url all nice and done we can yield our scrapy.request to the next page and then the callback is equal to self dot pass so this might look a little bit more we don't need those parentheses there and this should be response thank you so this might look a little bit more complicated but all it's doing is it's basically saying that we're looking for this um link inside the href on this a tag and if it exists we are creating our full url if we need to and then we're going back and creating a request for that next page back to this pass function so it will keep going through this until it doesn't find that anymore so now when we run this let's close out of our shell so if we go back to crawl we should get all the items on all the pages and we should see page one okay we've got scraped count it's still nine and that is because we haven't changed this up here so this error basically means that it's not going to go to this page because it thinks it's off-site but i think that's just because i've made an error in my allowed domains up here i left that that trailing slash so if we run it again hopefully we'll get all the pages now and all the products yes there we go you saw the page numbers come by so you've got to make sure that you have your allowed domains correct and now we have a total scraped count somewhere around here please 25 so that's 25 items and i believe that's correct to the page so this is all well and good but what you might want to do is you might want to consider tidying up the data that you get back by using the items and the item loader now what that will do is will mean that the item has to conform to the parameters that we give it and using the loader we can actually remove parts of the data that we don't need as it goes in it makes everything a lot neater and tidier so to do that we need to create an item first so i'm going to come over to the folder browser here and i'm going to open up this items.pi so within here we see it says we can define the models for our scraped items so this is basically just going to keep everything neat and tidy by actually defining how our data should look so i'm going to start here and i'm just going to we can leave this as this name or you can change it if you have multiple different items we're only going to have one because we're only getting the products from this page whereas if you were getting more items from the same website different items you'd want to use this and separate them out nice and neatly so i'm going to say let's change this to skew is scrapey dot field and we've got two other ones are which is name and price so we could just use the item but i am going to use the item loader as well because that's going to allow us to do a bit more data cleaning and make it a bit tidier for us so what i'm going to do is i'm going to import the item loader in here so we're going to do from item loaders dot processors because these are what we're going to use on our data we want map compose and take first so map compose just lets us execute functions onto the fields of data and take first lets us take the first item if they're fine to multiples and we're also going to do from w3 w3 lib and we're going to html sorry we're going to import the remove tags function so this is going to be the function that's going to remove all of the html data from around the information that we want so we actually did that already over here with dot get but we're going to do it through the item loader instead it's just an easier way back to our main spider file now we need to import our item into our main file so we can use it it is a class so we can import it directly in here so we can do from and i'm using w uh the double dots just to go back up from items to import the what would we call it scrapy tiles item there we go what we can do now is just tidy up our past bit here a little bit to actually work with the item loader so let's import the item loader as well let's import and we want to do from scrapy dot loader import item loader okay so what we're going to do now is i'm actually going to remove some of this i'm going to take this because we're doubling up here we don't need to do that let's put that in there so we can remove this line here we're going to no longer yield a dictionary because we're going to be taking the data and loading it into our item and then returning our item back so we're going to leave this outside here and we'll come back and remove it when we're done but what we want to do now is create our instance of the item loader and in the documentation it just shows il so i'm going to do that item loader and now we can give it the item that we are importing so item is equal to this scrapey tiles item that we have imported from our items file here so this is what we're bringing in and we're telling it we can use it with the item loader now i like to use css selectors so when we want to use this item loader we need to say that our selector is equal to and the p here because this is the selector that we're going to be using you see i've got p and i've got p all the way down here so we need to give it that there as well now we can just add these in so il dot add css and we want to say which field do we want to put it into i will do skew first skew and then we're going to say let's get the skew data which is from here and the next one il dot add css and we'll do name next and then name data which comes from this h2 tag here so we can just put h2 and finally the price i add css price and let's grab this bit of code here and put that in now what we want to do is we would want to return this out so we want to you want to yield out the il dot load item so this is now redundant let's get rid of that and what we've done instead is we've created our instance of the item loader class using our item which has the defined fields over here and saying that our selector is p because we are inside here we're inside our for loop with p and now we want to think about what we want to do with that data when it comes over so we want to basically say in here which bits of these we want to use so now i know that when we were looking at the sku it was already just a number so we don't need to worry about the tags or the map compose but we do want to use the output processor so you can use an input and an output processor the output processor is the one that we're going to use in this case so we're just going to say output underscore processor it's equal to take first so this is going to basically make sure that we return one item because if we didn't do this it would return a list of just this item which we don't want so in this in the name we actually do want to remove the tags so we're going to want to use the input processor so when the data comes in it we can run the remove tags function on it using map compose so let's do input processor it's equal to map compose and now we can run the remove tags function but we still want to actually access the output processor as well so i'm going to copy this and we're going to use that here in the scrapey dot field for the price we're going to do exactly the same here but we want to actually write our own function to clean this data up as well so we're going to write up let's just remove this pass we're going to write our own function in here so we're going to say define and this is going to be remove currency if you remember if you noticed that when we were scraping the data it was returning it without the currency symbol but when we do it like this and we remove the tags we would actually end up with the currency symbol so i'm going to make sure in this function that we remove that dollar sign and we actually make sure that it's a floating point value so i'm going to just return float and we're going to say text.replace the dollar sign with nothing and now we can actually do this remove currency on the map compose we can do remove tags and remove currency we don't actually need these in here for remove tags that's fine so again this might look a little bit confusing especially if you've never used the item loader before but all we're doing is we are passing the item information that we get from the website in here and just doing a little bit of data manipulation on it so let's open up the terminal now and see if anywhere we went wrong and let's run our caller again and hopefully we get the data back you can see it coming by now so let's scroll up and we can see that we have the name and the price and the skew returning nice and neatly from going through our item and our item loader now in what the benefit of this is that if for any reason one of these products didn't have a sku or that didn't have a price or something like that we would only pick up the information that was there it wouldn't cause the rest of our program to fail so this is as far as we're going to go in this video we've basically taken the the skew the name and the price data from each of the pages in each of the products using scrapey as our framework we've done pagination and i've showed you how to use the items and item loader classes to make it easier and more neat and tidy when you actually want to pull that data out in the next video i'm going to show you how to save this information into an sqlite database if you haven't seen if you haven't used sqlite before check out my video from last week i'll put that up here which will show you more about sqlite and how to use it with python
Info
Channel: John Watson Rooney
Views: 2,318
Rating: 4.9652176 out of 5
Keywords: Scrapy, Scrappy, scrapy project, scrapy shell, allowed domains, web scraping, python web scraping, scraping framework
Id: pSyiJKdCKtc
Channel Id: undefined
Length: 23min 13sec (1393 seconds)
Published: Sun Aug 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.