Async Python Tutorial: Web Scraping Synchronously versus Asynchronously (10x faster)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this tutorial follows on from our previous one on asynchronous Python from the absolute foundations today we'll be using our newfound skills to scrape episode pages of the Joe Rogan experience podcast both synchronously and then asynchronously with async i/o and a i/o HTTP then we'll write the beginnings of an asynchronous web server and client which will allow us to build our own fully fledged asynchronous blog with great functionality including a sync database calls async uploads and much much more let's waste no further time and begin we're in PyCharm and have created a pure Python project this is simply a shortcut for creating a Python virtual environment in the usual way so feel free to use any IDE or editor you like we'll create two Python modules for later client and server make sure you pip install a i/o HTTP IO SQLite PI test a IO HTTP as we'll be taking a look at testing async frameworks later and the requests library the latest stable version of python is 3.8 but anything 3.7 or higher will do async is moving thick and fast and is very exciting but is liable to great changes between minor version changes of python the version info attribute will give you this information as a named tuple with the major minor and micro version numbers clearly labeled we have a file URLs text that has 38 URLs one per line of some recent podcast episodes our first program will download these to our local drive synchronously ie a request will be sent for the first page the contents downloaded safe to its own file locally then once that's done the next request for the next page is sent we'll take the time just before we send our first request and take the time again after we finish downloading and saving the 38 the URL therefore we can work out the time elapsed and print this to the our first function download file takes a URL as its only argument we then called the gap function of the requests library assign the result of this to response then return its content attributes the next function we define is to save the content to disk so we take two arguments n will be what we add to the base string in order to make a unique file name content is the content returned by our previous function we use a context manager to create and open a file in write binary mode then write the contents to the file by using a context manager we ensure that the file is then closed after this block is done running lastly we have if name equals main so that the subsequent code runs when we open this module and doesn't run if we import it we're timing things with perf counter of the time built in module which is really interesting in its own right and preferable to using time time in my opinion on any given platform perf counter uses the clock with the highest available resolution for measuring a short duration it includes time elapsed during sleep and it has an undefined reference point so only the difference between results of consecutive calls is valid which happens to be exactly what we're doing here if you open up the repple and enter time perf counter you'll see that you don't get seconds past the unix epoch at all lines 19 to 21 we're getting our URLs by opening URLs txt then calling read lines which will return a list with the contents of each line as a separate string in the list we need some unique file names for saving to disk though so one way of doing this is to use enumerate which will give us a two item tuple with not at index naught and the first URL at index 1 the next couple with the second run through the for loop will give us 1 at index naught and the second URL at index 1 and so on and so on we call our download file function first then we call our writes file function let's give this a go it works and as you can see it's going through the URLs list one at a time the whole thing has taken just over 20 seconds and our folder has now filled up with 38 separate HTML files now let's write the equivalent asynchronous scraper requests itself isn't an async library so that's where a i/o HTTP comes in we'll start off like we did before defining our download file function the most obvious change to make is to change things from death to async death as it's an asynchronous function recall that with the synchronous version we called requests get passing in the URL the preferred interface for making HTTP requests in a i/o HTTP is client session as you can see from the documentation there are many potential arguments that you can alter but the defaults are all sensible and in this case we have no need to change any of them the fundamentals that you ought to be aware of is the session encapsulate schenectady no connection pool it supports async context managers and will make use of this by writing an async with statements this is good practice to ensure that it closes itself after the body has finished executing context managers can be arbitrarily nested and within this first with statement we have a second async width which is directing the session we just instantiated to get URL and this is available in the body of the context manager as resp short for response we can call the read method of response to get the content we're after but we need to make sure not to forget to include a weight as it's an async method we're calling finally we return the content secondly for the write file function we'll change it from death to async death again but apart from that there's nothing in the body of the function that needs changing you'll recall from the first tutorial in this series that async functions need to be driven by something else we don't get our desired outcome just by calling them alone as part of this will define an additional ASIC function scrape task that takes in a number to the file name and a URL to download here we await downloading the URL assigning it to content ivenna waiting writing to disk lastly we have a main async function in our sync version we have a for loop that made a list containing all of the URLs of interest then one by one went through each URL downloaded it then wrote it to disk not entirely dissimilar to our async script task function now clearly that for loop is going through each URL one by one so if that was our strategy in our async version then it would defeat the purpose of making everything else async it wouldn't run any faster as we would still be going through each URL one by one so what we need to do in this async version in the main function is to make an empty list tasks then we can run our for loop going through each URL one by one but appending the tasks scraped tasks URL to this list after we've populated our list with the 38 tasks we want to be run we can then await async IO dot weight which is similar to async IO gather if you've come across that before they do have their differences but we'll cover that another time all that leaves is - async IO dot run our main function making sure to take the time before and after as you can see this async version only took just over two seconds as opposed to the 20 seconds it took the synchronous version that's a 10x improvement in the next installment of this async series we'll build our own asynchronous server blog page with full command of async database entries async uploading of files and much much more
Info
Channel: Live Python
Views: 12,212
Rating: 4.9480519 out of 5
Keywords: async python, async python 3, asynchronous web scraping tutorial, async web scraping tutorial, asynchronous python tutorial, async python beginners, async python tutorial, python tutorial, python 3 tutorial, python 3, python
Id: 5tWIxBcvy10
Channel Id: undefined
Length: 9min 18sec (558 seconds)
Published: Tue Oct 29 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.