How To Scrape Multiple Website URLs with Python?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi! My name is Yelyzaveta and I’m a Content Manager here at Oxylabs. In today’s video, we will talk about scraping multiple website URLs with Python. We will discuss synchronous and asynchronous approaches to scraping multiple URLs and, by providing sample codes, explain why asynchronous web scraping is more beneficial. Since this video is targeted at more experienced scraping experts, we would recommend checking out our Step-by-step Python Tutorial Video for those who are just starting to discover the web scraping world. Now, we can finally get started. Before immersing in the technicalities, it is important to understand what these two approaches mean and what their main differences are. In simple terms, synchronous approach to scraping multiple sites refers to the process of running one request at a time and moving on to processing the next site only after the previous one has completely finished processing. Asynchronous web scraping, on the other hand, is an approach that allows you to run all of the needed requests concurrently by constantly switching back and forth between the pages. But keep in mind that scraping multiple website URLs can also be achieved through multiple threads but, in this video, we will specifically focus on synchronous and asynchronous approaches. The main difference between sync and async As you may have already realized, the main difference between the two approaches is that synchronous code stops the next one from running while asynchronous allows multiple URLs scraping roughly at the same time. This leads us to the main benefit of the asynchronous approach – great time-efficiency. You no longer have to wait for the scraping of one page to finish before starting the other. Now, let’s move on to the web scraping tutorial part and take a look at the implementation of both approaches. In this tutorial we are going to scrape URLs defined in urls.csv using a synchronous approach. For this particular use case, the python “requests” module is an ideal tool. Let’s start by creating an empty python file with a main function. Tracking the performance of your script is always a good idea. Therefore, the next step is to add a code that tracks script execution time. First, record time at the very start of the script. Then, type in any code that needs to be measured. In this case, we are using a single “print” statement. Finally, calculate how much time has passed. This can be done by taking the current time and subtracting the time at the start of the script. Once we know how much time has passed, we can print it while rounding the resulting float to the last 2 decimals. As you can see, the file contains a single column called “URL”. That column contains urls that have to be scraped for data. Now, we have to open up urls.csv After that, load it using the csv module and loop over each and every URL from the csv file. Looks like the job is almost done - all that’s left to do is to scrape it! But before you do that, don’t forget to take a look at the data we're scraping. The title of the book “A Light in the Attic” can be extracted from an <h1> tag, that is wrapped by a <div> tag with a "product_main" class. What about the product information? All the product information can be found in a table with a "table-striped" class, which you can now see in the developer tools part. Now, let's use what we've learned and create a `scrape` function. The scrape function makes a request to the url we loaded from the csv file. Once the request is done, it loads the response html using the BeautifulSoup module. Then we use the knowledge about where the data in stored in html tags to extract the book name into the `book_name` variable and collect all product information into a `product_info` dictionary. Great, we've scraped the URL! No results are seen, however. For that, we need to add yet another function - `save_product`. `save_product` takes two parameters: the book name and the product info dictionary. Since the book name contains spaces, we first replace them with underscores. Finally, we create a json file and dump all the info we have into it. Now, it's time to run the script and see the data. Here, we can also see how much time the scraping took – in this case it’s 17.54 seconds. For the next step, let’s take a look at the asynchronous python web scraping tutorial. For this use-case, we will use the `aiohttp` module. Let's start by creating an empty python file with a main function. Note that the main function is marked as asynchronous. We use asyncio loop to prevent the script from exiting until the main function completes. It's always a good idea to track the performance of your script. For that purpose, let's add code that tracks script execution time. First, record the time at the start of the script. Then, type in any code that you need to measure (currently a single `print` statement). Finally, calculate how much time has passed by taking the current time and subtracting the time at the start of the script. Once we have how much time has passed, we print it while rounding the resulting float to the last 2 decimals. As you can see, the file contains a single column called `url`. That column contains URLs that need to be scraped for data. We open up urls.csv, then load it using csv module and loop over each and every url in the csv file. Additionally, we need to create an async task for every url we are going to scrape. Later in the function we wait for all the scraping tasks to complete before moving on. All that's left is to scrape it! But before we do that, we need to take a look at the data we're scraping. As we can see, the title of the book can be extracted from an <h1> tag, that is wrapped by a <div> tag with a "product_main" class. Let's also take a look at the product information. Seems that all the product information is displayed in a table with a "table-striped" class. Let's use what we've learned and create a `scrape` function. The scrape function makes a request to the URL we loaded from the csv file. Once the request is done, it loads the response HTML using the Beautiful Soup module. Then we use the knowledge about where the data is stored in HTML tags to extract the book name into the `book_name` variable and collect all product information into a `product_info` dictionary. Great, we've scraped the URL! No results are seen, however. For that, we need to add yet another function - `save_product`. `save_product` takes two parameters: the book name and the product info dictionary. Since the book name contains spaces, we first replace them with underscores. Finally, we create a JSON file and dump all the info we have into it. Now it's time to run the script and see the data. And now we can finally compare the performance of two scripts! As you see, the difference is huge. While the async web scraping script ran the requests in around 3 second, it took almost 16 for the synchronous one. So, in today’s video we looked at two approaches to scraping multiple website URLs with Python - synchronous and asynchronous. On a practical example, we proved that asynchronous approach to web scraping is more beneficial due to noticeable time efficiency. If you have any questions about this or any other topic related to web scraping, feel free to leave a comment below or contact us at hello@oxylabs.io We hope this video was helpful for you and encourage you to share it on your social media! Thanks for your time and see you in our next videos!

Info

Channel: Oxylabs

Views: 21,834

Rating: undefined out of 5

Keywords: how to scrape multiple urls, how to scrape multiple website urls, web scraping multiple urls, scraping multiple urls with python, data extraction from multiple websites, scrape multiple pages with python, python scraping multiple urls, scraping websites using python, scraping urls with python, the difference between sync and async, sync web scraping, async web scraping, scraping multiple websites, multiple url scraping python, scrape multiple urls, requests-html tutorial

Id: Raa9f5kpvtE

Channel Id: undefined

Length: 9min 10sec (550 seconds)

Published: Mon Mar 28 2022