Python Web Scraping: Scraping APIs using Scrapy

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi guys how are you all doing so in this video I'm gonna show you a different form of web scraping and this time we're gonna scrape api's I'm not gonna bore you with the details of what an API is but there is one thing I want you to remember is that when scraping api's most of the time but not always you won't scrape the HTML markup means you won't use XPath or CSS selectors instead you will be dealing with a big JSON object that you have to parse it using Python okay so as an example we can scrape this website quotes to scrape comm slash scroll this is a very known website and you will find it pretty much on all the tutorials out there on the Internet we're gonna start with it so we can be able to explain to you the basics because it's very easy and then we're gonna tackle a real-world example okay now first thing first on this website there is no pagination but as we scroll down new quotes gets inserted this means we are dealing with what we refer to as dynamic pages or in other words pages that are dependent on JavaScript but this doesn't necessarily mean that we have to use splash or selenium in order to scrape it let me show you so I'm gonna scroll up first and then let's open up the developer tools so ctrl shift I you can see up on the network tab and since we want to check if there is an API we have to apply the xhr filter as you can see here which stands for xml httprequest okay now I'm going to refresh the web page okay so we have one request quotes and then page equals to one let's open it up under the header step here is the request URL as you can tell it's completely different than this one so we have quotes to square calm and then slash API slash quotes page equals to one so one thing to note here is that the API URL is pretty much always different than the actual website URL okay now let's check the preview tab so as you can see we have what we call a JSON object we have some key value pairs we're gonna look at them all later but what's important here is this quotes key so let's expand it alright so we have a couple of objects inside it and each object does represent a quote so we have the other object which contains the name is like the link and then we have the tags which is a list as you can see and then we have the quote text so this was everything for this video in the next one I'm gonna show you how to scrape the first page of that API okay guys so first of all let's start with a fresh project so inside my projects folder I'm gonna scaffold in your project so scrapie start project I'm gonna call it demo and the line API now let's get project folders OCD demo and the line API and then let's create a new spider so scrapy shan spider I'm gonna call the spider quotes and then I'm gonna set the URL asked quotes to scrape calm can I change it later now one quick remark here when scraping api's always use the paste template and that's because if for example you choose to use the close by the template you won't be able to define the rule objects and most of the time there will be no links to follow okay now I'm gonna press ENTER an effect as you can see this spider has been created so let me show you a trick so if you want to launch vs code from the Anaconda prompt itself you can type code space period okay I'm going to press Enter alright now let's sleep everything as it is so back to Chrome and let's go ahead and copy the request URL so I'm gonna copy it back to vs code again let's open the Spyder file sorry and then I'm gonna go ahead and replace this URL by the one we've just copied there we go now down here in the past method let's go ahead and print response dot body so we can see the response we get back control has to save the file let's open up the integrated terminal you can go to terminal and then new terminal okay now let's launch the spider so scrapey Crowell quotes I'm gonna press Enter you okay now let's see the response we got back okay so we have a object as you can see and from that JSON object we want to get the quotes key because it's the one that contains all the quotes okay so the next thing we need to do is we need to convert or cast this JSON object to a Python deck so we can extract whatever we want from it okay so to do that we need to import a module called JSON so at the top let's import JSON then down here in the parse method let's define a new variable called rasp equals to json dot load as and then as an argument we pass response dot body so basically json dot load as we'll convert or cast the json object we get from response dot body to a python deck okay now if you want to get the quotes key we call rasp dot cat and then the key quotes now let's solve the result in a variable called quotes and then let's print quotes control has to save the file let's execute again i'm going to clear up everything and then let's call the previous command okay now let's check the output so as you can see we got a list which contain all the quotes including the other key and the tax key okay now since we have a list means we can loop through all the quotes and extract the data points we want so let's say it this line and then let's type for quote in quotes then I'm going to yield a Python dict I'm going to set the first key as author the second one as tags and the last one as the quote text now back to Chrome the preview tab so for quote we need to get into the other object and then from the other object we want to get the name key okay so back to vs code so we call quote don't get the key is author and then from the other object we want to also get the name key okay now back to Chrome again next we want to extract the tax key and then the text key okay so we call quote dot get tags and then finally we call quote dot cat name oh sorry txt now control as to save the file let's execute again okay now let's check the output so as you can see we got all the quotes of the first page we have the author the tags and the quote text itself so this was everything for this video and the next one I'm gonna show you how to handle pagination so we are done with parsing the API now the next step is to handle pagination so let's go ahead and scroll down a little bit until we get to the next page okay here's the request so let's open it up so we have this key page equals to number two and then we have this key has next equals to true this means if we reach the last page this has next key will be set to false right so I'm gonna scroll down until I reach the last page so five six seven eight we still have a couple of pages let's check okay so here's the last request so let's open it up and as you can see we have house next equals to false now the reason why I've been focusing on that key is because we can use it as an indicator in our spider to see if there is a next page or not if yes we can a construct or build the next page URL if it's set to false this means we've reached the last page and there is no other page we can scrape okay so back to vs code and outside the for loop I'm gonna define a new variable called has and the line next equals to R asp.net and the key is called has and the line next next we need to check if this variable is set to true then we need to get into the next page and scrape it so if has underlined next in that case we can I get the current page number and add one to it let me show you so back to Chrome so we're gonna get a current page number from this page key and then we can add one to it okay so back to the S code again I'm going to set a variable called next and the line page and the line number equals to R asp.net the key page and then we can add one now finally let's yield scrapy dot request let's set the URL equals two I'm gonna use Python three string formatting so we start with F then quotes then let's go ahead and copy this URL let's place it here and then all we need to do is we need to change this page number to let's open two brackets and then we call the next page number variable and then let's go ahead and set the callback method equals to self dot parse ctrl s to save file let's open up the integrator terminal I'm going to kill up everything well let's execute again okay now let's check the output and first of all as you can see we got items scrape it count equals to 100 this means we got 100 quote so we have the author the tags and the quote text now one quick remark before we finish up this video so not all api's are the same they don't share the same JSON object nor the same structure they are all different than each other and that's why when scraping api's I highly recommend you to explore it first try to understand its structure or how it works because that's the main key that will help you scrape it as quickly as possible

Info

Channel: Human Code

Views: 6,709

Rating: 4.8811879 out of 5

Keywords: web scraping, scrapy, API

Id: _q6QlaBakTM

Channel Id: undefined

Length: 12min 58sec (778 seconds)

Published: Thu Apr 23 2020