Dynamic Site Scraping - Digging Deeper into APIs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so i was browsing reddit and i came across this question where this guy was having trouble in scrapping this site so i opened the site and examined it and i thought it would be a great idea to share this with you so hi my name is upend and i'm here to show you one more way to scrap dynamic websites so let's get started so the first thing that we are going to do is we are going to open developer tools and we are going to reload the site make sure that you are on network tab and you are on fetch xhr tab okay so just reload this site and this search old this looks like this may be something interesting and yes there is something here but not the entire information so let's scroll down now this site is dynamic so it is going to load more items as we scroll down and you can see that as we are scrolling down more and more requests are being sent to the same url so let's scroll down further so let's open this and we can see that in the preview we can see that yes it is a json response this one is json response and we can already see some of the things which are varying okay so for example this one here we can see these two parameters nb pages and page so this is set to six if we come to the request one before it we can see that mb pages was six but page is six now let's look at the headers so the first thing that we can see is it is a post request so if it is a post request then you have to make sure that you are sending the correct headers and you have to make sure that you are sending the correct data in the body and in the request headers let's take a closer look and here you will see two things number one this application json as accept so what it means is that the request is being sent with a request that i can accept application slash json so if the server is if everything is fine the response will be json so let's come down and there is one more interesting thing here content type is application json so what it means is that we have to send the body in json format let's come down and here is the request payload so this payload as you can see that this is json so now here is the fun part so what we are going to do is we are going to mimic these requests and we are going to try to get the first page data as well right so if we come back here you can see that page number three is there but can we get the data right from the beginning so for this we have two options number one right click copy and copy as curl so curl or c url is a utility which is built inside all the unix and mac and on windows 10 i guess it is already built in but you can install it anyway so this is one way so copy s girl and then you come to terminal and then paste in this curl request so here is the curl request and what we can do is we can play around with this data which is being sent the better alternate in my opinion is to use any of the api testing tools which allow you to have a better control over what headers you are sending what data you are sending and the most commonly used api testing tool is postman so let's open postman so in postman what you can do is you can create a new collection click on import and in the raw text paste in this curl that you just copied okay click on continue and then click on import now we have set up postman and we have imported all the data so we can see that all the headers are here in the body we can see that this is being sent as json okay so it's just to make it readable we can click on this beautify and now it is much more readable so the first thing that we are going to do is without making any changes let's click on send so this confirms the first thing that the imported request is working correctly here so what we can see is we are on page four total number of pages is six the per page data is 50 and the data is inside this hits okay and note that this query was for rice so we search for rice and we have 259 so what we are going to do is we are going to start reducing the current page index so maybe we can go to page number two and click on send so yes this is changing but note that we send 2 and we are getting 3 that means to get to the first page current page index to 0 so yes we are on page number 1. now let's try increasing the page size so right now the page size is 50 let's set it to 100 click send now the number of pages is down to 3 and hits per page is 100 and the total is 259 so what if we send 300 so we are sending page size as 300 and awesome this means that with this information we can get entire data in one go but now i have become little bit greedy what if we keep this query to blank and let's click on send wow now this is interesting so we got 7923 hits okay total 27 pages okay so what i'm going to do is probably i'm going to increase it little bit so let's increase it to 1000 and see if this site can handle 1000 items per page yes it can handle 1000 trying for 8 000 directly and let's see if this website can return all the data in one single request awesome so you see here that the total size of the response is 9.29 mb so it's not bad 9.29 mb is is okay a web server can handle 9.29 mb and we have all the items in one single request all those 7923 now if you're liking this video so far don't forget to give it a thumbs up and stick around to the end of the video because i got some exercise for you and this solution is going to be the next video now the point is how do we create the code now this is where postman actually shines we have this button for generating the code and we actually have a bunch of language options which are available to us so we can use c sharp if you want a curl request there you have it a curl request but in our case we are directly going to jump to python and in again in python you have two options we can use http client library or we can use the request library so i'm going to go with the request library so let's copy this code come to visual studio and paste in let me show this to you what we have right now so this is the url and then this is the payload so this payload is containing the all the parameters that we adjusted this is the header so headers is the dictionary which contains everything so i'm going to let me collapse this i'm going to create a function okay let's call this get json all right so i've wrapped everything in this and instead of printing it let's return response dot json so this function is ready so let's create one more function main and in this function we are going to get our data by calling this function and now what we are going to do is we are going to create one more function and the purpose of that function is going to be to convert this json into a csv file or excel file because typically you will need something like that if you just want json you can directly save it done but in our case what we are going to do is we are going to create a new function save to csv or export to csv and this is going to accept the data but what we are going to do is when we call this function what we will do is we will send only the key hits okay so here we will check if data whether we have the data or not this is the first thing and then we are going to use pandas now we are going to make use of a json underscore normalize function of pandas so this is going to be pd dot json normalize okay now this is going to take at least the data and this is going to return a data frame and now we can simply call data frame underscore 2 so we can create csv or excel and finally what we are going to do is we are going to call the main function oh by the way let's uh print something like done and yeah that should be enough for now so let's run this code so let's look at the csv and we have all the data so this is the csv let's open this and there we have it all the items are here so there we have it the entire data set now uh there is one exercise which i want to give it to you so here you can see pictures url this one you can use to download the images so give this a try so i'll see you in the next video bye
Info
Channel: codeRECODE with Upendra
Views: 511
Rating: 5 out of 5
Keywords: python web scraping tutorial, Python Web Scraping, web scraping python, how to scrape data, browser scraping, scrape web pages, website scraping, python scraping, screen scraping, data scraping, Scrapy Spider, scrapy splash, web scrapping, CSS Selector, scrapy shell, web scraping, web crawler, webscraping, scrape, scraping, python web scraping, web scraping with python, scraping dynamic sites, web scraping using python, web scraping tutorial
Id: ThKiZjLNN8Y
Channel Id: undefined
Length: 9min 2sec (542 seconds)
Published: Sun Sep 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.