Massively Speed Up Requests with HTTPX in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys welcome back in today's video we're going to learn how to massively speed up our web scraping by sending multiple requests concurrently using the httpx module in Python so let us get right into it [Music] all right so when we do some basic web scraping in Python we usually use the request module which is an external python package we can install it by opening up the command line and typing pip install requests and then we send a request we get a response we feed that response into something like beautiful soup to extract some elements and this is the basic process of web scraping now sometimes we might have multiple URLs not just 10 but maybe hundreds or thousands of URLs that we need to send requests to to get some responses to do some web scraping on and this can be quite tedious if we do it one page after the other because we send a request we have to wait for the response then we send the next request we have to wait for the next response and so on it takes quite some time and it it is way more efficient to do this concurrently simultaneously so to say instead of just waiting for one request to finish or to be processed before we start the next one and this is what we're going to learn about in today's video but we're going to start with a very basic example first so that we can compare the runtime of basically two equivalent scripts but one is going to run concurrently it's going to send multiple requests concurrently and one is going to do its sequential and we're going to start with the sequential one we're going to import the core python module time to measure the execution time and we're going to import requests which as I mentioned you need to install first and then we're going to just Define a basic function let's call it fetch and we're going to Define in that function a list of URLs and for this video we're going to use this books.2scrape.com website which is a website made for web scraping and what I did here is I prepared a couple of links that I'm just going to copy paste here so essentially what I did is I went to the categories here non-fiction for example I saw okay we have six pages here so let's just click on next see the URL pattern copy and then just generate some links I'm not going to do that here in the video because it's uh yeah quite simple and quite repetitive but this is the list here so I have a couple of links you can choose whatever links you want you can also choose a different website to scrape but I'm going to pick this one for the video because this is a website made for web scraping and we're not going to do anything um that is actually web scraping here so we're not going to analyze something we're not going to extract some elements we're just going to Center requests and we're going to get the responses which would then um afterwards be fed into something like beautiful soup to do the actual web scraping but what we want to do here now is we want to say that the results are going to be equal to a list comprehension and this this comprehension is just going to be requests.get we're going to get URL for URL in urls so basically we're just iterating over all these URLs and we're sending a get request and we're saving the response inside of this results list here that's all we're doing and then as a result of that we can print the results in this case this will only print the status codes but of course we could also just say result.txt or for for an individual result we could say result.txt and we can feed it into a soup object and then do some web scraping um so that is the basic function let's just go ahead now and say start equals time.perf counter and then int equals time.curvecounter and here we're just going to say fetch and in the end we want to print end minus start now how many links are those one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen Seventeen eighteen eighteen links will take let's see how long it will take to get all the responses I ran the script here and uh it's now going to send a request to the individual URLs going to wait for the responses until it sends the request for the next URL and so on and this takes your 11.95 seconds you can see we have these 200 responses which means success so we got the response we got the actual content that we were looking for and this took almost 12 seconds here now let's do it more efficiently using the module called httpx and for that we're going to open up the command line once more and we're going to type pip install httpx this is going to install the respective library and instead of importing requests we're now going to import up here first of all async IO which is core Python and then also httpx and all we're going to do now here is instead of doing a list comprehension like this we're going to do it asynchronously so first of all we're going to define the function to be async so async def fetch which means that it can be run concurrently and then we're going to say down here async with httpx and we're going to use an async client as client so this runs an asynchronous client and with this client we're now going to send um the request so we're going to say requests or maybe to not use the name of the module we're going to say Rex is going to be equal to client dot get URL in or actually for URL in url so basically the same syntax instead of using requests we use client and client is an httpx async client and for these requests what we're going to do now is we're going to say the results are going to be equal to a weight and we're going to await async IO dot gather so we're going to gather all the results from and we're going to use this unzipping operator and we're going to pass the requests here and we're going to print the results here as well and all we need to do now to run this here is we need to call async IO dot run and the rest is the same that's all we need to do and we're going to run this now and you're going to see that this takes 0.765 seconds and we have the same responses we have all these 200 responses here which basically means that we did the same work in in a way shorter time so actually we can go ahead and open up the calculator uh we had around what was it 11.95 divided by what do we have now 0.765 0.765 uh the other the sequential process took 15 times longer more than 15 times longer than this concurrent process here so that's it for today's video I hope you enjoyed it and hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you in the next video and bye [Music]
Info
Channel: NeuralNine
Views: 10,630
Rating: undefined out of 5
Keywords: python, networking, asyncio, async, asynchronous requests, httpx, python httpx, python async requests, python httpx requests, python concurrent requests, python async http, http, http requests
Id: mrtsk9B9_Ho
Channel Id: undefined
Length: 7min 34sec (454 seconds)
Published: Thu Jan 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.