Hi! My name is Yelyzaveta and I’m a Content
Manager here at Oxylabs. In today’s video, we will talk about scraping
multiple website URLs with Python. We will discuss synchronous and asynchronous
approaches to scraping multiple URLs and, by providing sample codes, explain why asynchronous
web scraping is more beneficial. Since this video is targeted at more experienced
scraping experts, we would recommend checking out our Step-by-step Python Tutorial Video
for those who are just starting to discover the web scraping world. Now, we can finally get started. Before immersing in the technicalities, it
is important to understand what these two approaches mean and what their main differences
are. In simple terms, synchronous approach to scraping
multiple sites refers to the process of running one request at a time and moving on to processing
the next site only after the previous one has completely finished processing. Asynchronous web scraping, on the other hand,
is an approach that allows you to run all of the needed requests concurrently by constantly
switching back and forth between the pages. But keep in mind that scraping multiple website
URLs can also be achieved through multiple threads but, in this video, we will specifically
focus on synchronous and asynchronous approaches. The main difference between sync and async
As you may have already realized, the main difference between the two approaches is that
synchronous code stops the next one from running while asynchronous allows multiple URLs scraping
roughly at the same time. This leads us to the main benefit of the asynchronous
approach – great time-efficiency. You no longer have to wait for the scraping
of one page to finish before starting the other. Now, let’s move on to the web scraping tutorial
part and take a look at the implementation of both approaches. In this tutorial we are going to scrape URLs defined in urls.csv using a synchronous approach. For this particular use case, the python “requests”
module is an ideal tool. Let’s start by creating an empty python
file with a main function. Tracking the performance of your script is
always a good idea. Therefore, the next step is to add a code
that tracks script execution time. First, record time at the very start of the
script. Then, type in any code that needs to be measured. In this case, we are using a single “print” statement. Finally, calculate how much time has passed. This can be done by taking the current time
and subtracting the time at the start of the script. Once we know how much time has passed, we
can print it while rounding the resulting float to the last 2 decimals. As you can see, the file contains a single
column called “URL”. That column contains urls that have to be
scraped for data. Now, we have to open up urls.csv After that, load it using the csv module and
loop over each and every URL from the csv file. Looks like the job is almost done - all that’s
left to do is to scrape it! But before you do that, don’t forget to
take a look at the data we're scraping. The title of the book “A Light in the Attic”
can be extracted from an <h1> tag, that is wrapped by a <div> tag with a "product_main"
class. What about the product information? All the product information can be found in
a table with a "table-striped" class, which you can now see in the developer tools part. Now, let's use what we've learned and create
a `scrape` function. The scrape function makes a request to the
url we loaded from the csv file. Once the request is done, it loads the response
html using the BeautifulSoup module. Then we use the knowledge about where the data in
stored in html tags to extract the book name into the `book_name` variable and collect
all product information into a `product_info` dictionary. Great, we've scraped the URL! No results are seen, however. For that, we need to add yet another function
- `save_product`. `save_product` takes two parameters: the book
name and the product info dictionary. Since the book name contains spaces, we first
replace them with underscores. Finally, we create a json file and dump all
the info we have into it. Now, it's time to run the script and see the
data. Here, we can also see how much time the scraping
took – in this case it’s 17.54 seconds. For the next step, let’s take a look at the asynchronous python web scraping tutorial. For this use-case, we will use the `aiohttp`
module. Let's start by creating an empty python file
with a main function. Note that the main function is marked as asynchronous. We use asyncio loop to prevent the script
from exiting until the main function completes. It's always a good idea to track the performance
of your script. For that purpose, let's add code that tracks
script execution time. First, record the time at the start of the
script. Then, type in any code that you need to measure (currently a single `print` statement). Finally, calculate how much time has passed
by taking the current time and subtracting the time at the start of the script. Once we have how much time has passed, we
print it while rounding the resulting float to the last 2 decimals. As you can see, the file contains a single
column called `url`. That column contains URLs that need to be
scraped for data. We open up urls.csv, then load it using csv
module and loop over each and every url in the csv file. Additionally, we need to create an async task
for every url we are going to scrape. Later in the function we wait for all the
scraping tasks to complete before moving on. All that's left is to scrape it! But before we do that, we need to take a look
at the data we're scraping. As we can see, the title of the book can be
extracted from an <h1> tag, that is wrapped by a <div> tag with a "product_main" class. Let's also take a look at the product information. Seems that all the product information is
displayed in a table with a "table-striped" class. Let's use what we've learned and create a
`scrape` function. The scrape function makes a request to the
URL we loaded from the csv file. Once the request is done, it loads the response
HTML using the Beautiful Soup module. Then we use the knowledge about where the
data is stored in HTML tags to extract the book name into the `book_name` variable and
collect all product information into a `product_info` dictionary. Great, we've scraped the URL! No results are seen, however. For that, we need to add yet another function
- `save_product`. `save_product` takes two parameters: the book
name and the product info dictionary. Since the book name contains spaces, we first
replace them with underscores. Finally, we create a JSON file and dump all
the info we have into it. Now it's time to run the script and see the
data. And now we can finally compare the performance
of two scripts! As you see, the difference is huge. While the async web scraping script ran the
requests in around 3 second, it took almost 16 for the synchronous one. So, in today’s video we looked at two approaches to scraping multiple website URLs with Python
- synchronous and asynchronous. On a practical example, we proved that asynchronous
approach to web scraping is more beneficial due to noticeable time efficiency. If you have any questions about this or any
other topic related to web scraping, feel free to leave a comment below or contact us
at hello@oxylabs.io We hope this video was helpful for you and
encourage you to share it on your social media! Thanks for your time and see you in our next
videos!