Async Requests Made Simple - Grequests for Web Scraping with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we're going to be looking at a real simple and easy way to do asynchronous requests in python so i'm going to be using a package called g request which handles a lot of the async for us but i think it's important to start with this one because it shows us what we can and can't do asynchronously before we move on to perhaps aio http or request html async so we need to send multiple requests to the server to get the information back what we want we could be looking at multiple product pages or maybe working on pagination when we're scraping synchronously we would loop through perhaps a list of urls and as we go through we're waiting for that data to come back from the server before we carry on with our code so to do it asynchronously we can perform all those requests at the same time and we can then wait for the results to come back and it greatly speeds up our code because instead of going one by one we can ask for it all at the same time and get the information back now the upside of this is it's going to be much faster the downside is that not everything can be run asynchronously so we can't pass asynchronously but what we can do is we can get a list of urls and then we can get the data from those using async and then pass the results that we get back so as i said i'm going to be using g request for this which is a nice small library which prepares us and explains the basics and i'm going to run through a quick example of how to make a load of requests using this and we'll see how it performs against doing it synchronously are you one by one so i'm going to jump into the code now we'll go we'll have a look into this and we'll see how fast we can actually make these requests so here we are i've got a basic scraper um it's going to run synchronously this is just generating our url list i just happen to know that there are 50 pages here in which we're gonna be doing 50 urls we're going to loop through each one here create a separate request and as i said before it's going to do one request and then wait and then the next one and then the next one and then the next one and so on and we're just pulling out two bits of arbitrary information just to make it a bit more realistic and we are using time.perf counter to tell us how long the time takes so we've got a start time and the finish time is the current time minus to start so what i'm going to do is i'm going to run this and we'll see the data come by on the screen as it goes through each and every page uh we'll see that coming through here and we can see it's just looping through you can see it kind of chunking through i'm just going to let this finish and we should be there and we can see it took 17.84 seconds you can see that just down there so i'll just copy that out and we will close this and i'll just write it underneath and i'll say time taken 17.84 seconds so what we want to do now is we want to convert this to instead of using requests to use g requests and basically take advantage of the async that it allows us to get to so i've got an empty file here now the first part is going to be the same because we are that's what we're interested in and i'm going to just remove the requests and i'm going to call it g requests if you don't have this package you can pip install it i'll leave a link down below to the pi pi so you can check it out as well if you're interested and the next thing that we want to do is i'm going to copy part of the past part as well where i need to modify this a little bit let's put this here and in between these two i'm going to write a new function which is basically going to get the data for us so we'll say define get data and we're going to give this our urls list and then we're going to say that the requests because the way async works you need to give it a load of tasks to do and then it runs them for you so what we're saying is that these are the requests that we want to make and then i'm going to do a little bit of list comprehension i'm going to say g requests dot get and then link for link in urls so that's basically a tiny bit of list comprehension that's basically saying all of these links that we're going to give you here these are the pages that you need to go to and it's going to run through that so those are the tasks that we're creating and then we want to say that our response that we get back is going to equal g request.map and we're going to say the request so the map basically lets us execute these requests and then we can just return the response that we get back so nice and simple it's only a few lines of code again what we're saying is that the requests that we want to make are all of these the links which are in our urls list that we're going to give it it's going to create all the tasks for us we can see there it says concurrently converts a list of requests to responses so what we're saying is that we're saying here's our list of urls that we want to make in our requests map them out and get that data for us and then return them all in our response so what we need to do with our pass function is we just need to change it slightly and we need to say i'm going to just change it to response here so it looks a bit better and then we want to say 4 we need to get rid of our request because we are not doing that every time because we've already got all that information here so i'm going to say 4 and we'll just say r in response like this and we'll say our beautiful soup is equal to r dot text that matches up well i've done r here is because we've got r here and then the rest can be exactly the same and again i'm going to copy this little bit from the bottom as well because that is going to be mostly the same too so let's just double check that we've got that there and we want to call it urls and then we need to add in our function here i made a typo there that should match that obviously there we go and because we are returning our response here we just want to say response is equal to and it was get data of the urls called it get data matches here and then we want to pass the data with the response that we give it so that should be it there so we're basically doing exactly the same thing we just have one bit more one more step in here this step is basically our async part so we can't do this function here asynchronously because of the nature of the way that it works it's a blocking function it's always going to take it all up we can't get that to run asynchronously but what we can run is the getting of the data from the the url the request that we're making is which is what we're doing here so if i save this and run this now we'll see much quicker and it ran the whole thing in 1.911 seconds so let's put that in there time taken 1.91 seconds much much faster and that's because this code here is basically dependent on the amount of time it takes to make the request to the server and wait for the response but using async and in this case g requests we can make a load of requests concurrently at the same time and when the response comes back for each one we can store it in our response return from our function and then we can pass it all at the same time so if i run it again it's going to be hopefully equally as quick okay 2.9 seconds that time okay so we must have got lucky the first time with a 1.91 it's probably my network or something like that maybe but you can see it's coming in at two to three seconds as opposed to 17 to 18 seconds which is a much much quicker thing so how can we use this in a more of a real scraper that we are wanting to do well what i would suggest is if you're looping through multiple pages so you're dealing with pagination you use this to go through all the pages and save the responses or if you've got lots of products that you want to go through you could create a list of all the product links you could say get the first page and then go through all of the links asynchronously with g requests and then pass the response and then do the same for the pages on and on and on so coming up in the next few videos i'm going to be looking at more of this sort of thing but i'm also going to be using ai http which is a um the main sort of asynchronous request library there's more to it um again but the pro the concepts are basically the same so that's going to be coming out shortly and then we're going to be looking at request html as well because some of that can be run asynchronously we have the async session so hopefully you guys have enjoyed this video i'm going to have some more web scraping content coming out real soon we're going to expand on this we're going to be doing more scraping so stick around for that and if you like it go ahead and hit subscribe so you get those notifications too and thank you very much for watching guys and i'll see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 3,318
Rating: 5 out of 5
Keywords: async scraping, web scraping, python web scraping, scrape fast, grequests, python grequests, simple async http, asynchronous, asynchronous web scraping, learn web scraping, intermediate web scraping
Id: UDATm1CwIR8
Channel Id: undefined
Length: 8min 50sec (530 seconds)
Published: Wed Mar 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.