Hidden APIs with Scrapy - easy JSON data extraction

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so my favorite way by far to scrape a website is if i can find the api endpoint using the network tab this negates the need for any html passing or any javascript rendering the data's in json format it's nice and easy to get out it's clean i've shown you how to do it using requests before and this time i'm going to show you how to do it using scrapy so this is the website that we're going to be looking at i actually did this in my last video as well using requests and insomnia if you wanted to check that out that's cool otherwise let's move on so what we're going to do is we're going to basically go to our inspect element tool head over to the network tab and we're going to check out and see what's happening here so what i'm going to do is nothing's actually here at the moment but i'm going to scroll down to the bottom and i'm going to ask for the load more button here so when i hit this we're going to see there is a request made to the server it's get request and it's got all of the product information in it that we're after if we just scroll down we can see it's all here we've got different bits of info now what we can do is we can actually just take the url from this request it doesn't require any of the headers or anything to work although you could put them in if you wanted to but i'm just going to copy this url here see i've gone to headers and i've gone into the get request it's quite long but it will work so i'm just going to copy that and what we're going to do now is we're going to hit new project on our pycharm and let's just call this one grapepsg hut and we're going to create a new virtual environment here within pie chart let's open up the terminal and we can see down here that i'm inside my virtual environment my vemv so i'm going to install scrapey that's going to run i generally find that sometimes pycharm is easier or more convenient to use than vs code but all of this code will work in any code editor that you use now that's installed so we can do our scrapey start project and let's just call this one sg hut yt for youtube now we've got that done we can see it's saying cd into this file and then you can use genspider so that's exactly what we're going to do so let's cd into this now we can just do a scrapey gen spider sunglasses and i think the url was sunglasses hut.com so now that we have created our spider here so if we come over to our project folder it's going to pop up on the left hand side i'm going to go all the way through this tree now until i get to the sunglasses dot pi that's the file that we're going to be using we don't need to worry about any of the other scrapy config in this video so i'm going to collapse this and we're going to make this a hair bigger so we can see now instead of this start url here what i'm going to do is i'm going to paste in that really long one that we took i'm going to change page 3 to 1 and pycharm is going to have a fit and say that it's far far too long so i'm going to let it fill the paragraph for me and just shorten it up a bit so i know i'm happy i've got that i've got our page one at the end so i'm going to be getting the first page what we can do is we can start to work with the response now if i was to come into our pass function here and print out the response dot body and then we run scrapey crawl and our sunglasses spider which is this here this is what we've called it we should get back some data so if i move this up we'll see that we've got all this information back here now this looks exactly like what we just saw in our browser and that's because it is this is what's working for us so what we want to do now is we need to navigate through this load of json data get the bits out that we want in this case i'm just going to do all of it but i'll explain how you could trim it down a bit at the end so what we're going to do is we want to put this response.body into a variable so i'm going to say data is equal to json.load s because it is load s for a string and pycharm is going to tell me that i need to import json so let's do that so i just clicked on it and there it is up there so now that we have our json object in our data what we can do is we can actually just yield from that in our pass function here so we don't want pass we're going to do yield from data and now we need to find those specific keys that were in the data that we were just looking at so let's just quickly head back to the website and scroll down to on the response again so we can see what we get back where it is there we go so the first key that we have is this one plp view so let's copy that then we can ignore the attributes because we're not interested in that one then we have products these are all the keys here products again and then products you can see that that's telling us that that's where all the information is there so let's go plp view products products product again let's go back to our code so the first key is plp view then the second one which i said was products and then the third one was products and then the final one which was a list which is why we can yield from it was product so if i save that now and then come back and run this again we should get back from our output of scrapey here item scrape count 18 and that's possibly because we had the page size which was set to 18. we can just scroll up and you can see all the data here now in its simplest form that's pretty much it but we can expand on this we could look at using the item and the item loader as well which we could get rid of some of the bits of the data that maybe we don't want but what i'm going to do now is i'm actually going to look at the pagination now the great thing about getting the api responses like this is that down here somewhere we have some kind of next page url or there is a next page or in case total pages which we have here as well so what i'm going to do is i'm just going to use the scrapey follow request.follow method to check to see if this next page url is here join it up to the main url which we've got and then go to that page so what we can essentially do is get scrapey to do all the pages for us without having to say how many there are working out doing any kind of mathematics like that what we do want to do is copy this key we can see that this isn't a full link it's only a partial link but that's not a problem come back to our code so inside our pass function we're going to say our next page is equal to now we need our data and the whole the main key was the plp view and then it was next page url let's just double check that so we're not inside products no we're not inside attributes no okay that's fine so what we're saying is that this is where our next year page is so we can now do if next page which is what we've said here is not none we can do next page is now going to be equal to response dot url join which is going to create our url for us and let's give it our next page so we're basically just taking that end bit of that url and making it into a complete url with our url join then we can just say we want to out of this we want to yield the scrapey dot request and the next page which is now a full link which we can use and the callback which is where it's going to go to once it goes to that page is back to our pass function so self.pass so this is essentially creating our loop so we're going to start up here i've actually just hidden the start url and we're going to take the response from that we're going to put it into our pass function we're then going to yield from all of this to get all of the product information we're going to find the next page url and if it doesn't exist sorry if it exists we're going to follow it create a whole new rl follow it and then come back here again so this is going to loop through all the pages until here if next page is not none so when the next page is none i.e on the last page of the api request there will be no next page key or next page url it's going to see that and we're going to get no data and we can go from there so let's get our terminal back let's clear all this up i'm going to move this all the way up let's run this again it's interesting to see that i've made a small data error here which is why we've got this so we can see that our spyder is saying that we're not going to this this website because it is off-site however it isn't off-site but that's because i put in the wrong url in the allowed domain see i've got sunglasses huh it's not sunglasses sunglass hut which is why scrapey has gone and said you're not going to that because we're filtering out run this and we're going to get loads of data flashing by and that's each and every page so i'm going to stop that before we go too far without doing anything with it and we're going to do dash o and we're going to say sunglasses.json because we're going to create a json file with all of the information in that we can then work with elsewhere so we'll run that and we'll have a look at what comes out at the end there's about 58 pages so this might take a minute or so so we'll come back when it's done so it's finished and we have how many requests a thousand and twenty seven uh item scrapes sorry across the whole thing and if we come to our project file and open up our json file that we just collected here we can see that we have all this information here uh that nice and uh neatly and adjacent file now just looking at this we can see that some of them start with this part number key and some of them start with the colors number so what you would want to do is you would want to actually spend a bit more time working out what bits of information from each json response for the project product sorry that you want and creating an item and maybe using the item loader to do something with that i will show you that in another video but for now that's going to do it for this one if you've enjoyed this you should try this video here which shows you the same thing but using requests

Info

Channel: John Watson Rooney

Views: 3,427

Rating: 4.973856 out of 5

Keywords:

Id: xjieRVnuPcQ

Channel Id: undefined

Length: 9min 59sec (599 seconds)

Published: Sun Aug 08 2021