Scraping Amazon With Python: Step-By-Step Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ever wondered how to extract Amazon product data efficiently whether you intend to monitor the performance of your products sold by third party resellers or track your competitors you need a reliable web scraper and today we're going to build one to follow along you will need python if you do not have python 3.8 or above installed head to python.org and download and install python next create a folder to save your code files for webscraping Amazon once you have a folder creating a virtual environment is generally a good practice the following commands work on Mac OS and Linux this commands will create a virtual environment and activate it if you're on Windows this commands will vary a little as follows the next step is installing the required python packages you will need packages for two broad steps getting the HTML and parsing the HTML to quiry relevant data requests is a popular third party python library for making HTTP requests it provides a simple and intuitive interface to make HTTP requests to web servers and receive responses the limitation of the request library is that it Returns the HTML response as a string which is not easy to quiry for specific elements such as listing prices while working with the web scraping code this is where beautiful soup steps in beautiful soup is a python Library used for web scraping to pull the data out of HTML and XML files it allows you to extract information from the Page by searching for tags attributes or specific text to install these two libraries you can use the following [Music] command if you're on Windows use Python instead of Python 3 the rest of the command remains unchanged note that we are installing version four of the Beautiful soap Library it's time to try out the request scraping Library create a new file with the name amazon. py and enter the following code save the file and run it from the terminal in most cases you cannot view the desired HTML Amazon will blog this request and you will see the following text in the response to discuss automated access to Amazon data please contact API Services support at amazon.com if you print the response status code you will see that instead of getting 200 which means success you get 503 which means an error Amazon knows this request was not using a browser and thus blocks it it is a common practice employed by many websites Amazon will block your requests and return an error code beginning with 500 or sometimes even 400 the solution is simple you can send the headers along with your request that a browser would sometimes sending only the user agent is enough at other times you may need to send more headers a good example is sending the accept language header to identify the user agent sent by your browser press F12 and open the Network tab reload the page select the first request and examine request headers you can copy this user agent and create a dictionary for the headers the following example shows a dictionary with the user agent and accept language headers you can send this dictionary to the optional parameter of the get method as follows executing the code with these changes should show the expected HTM with the product details another note is that if you send as many headers as possible you may not need JavaScript rendering if you need rendering you will need tools like play rate or selenium when web scraping Amazon products typically you would work with two categories of pages the category page and the product Details page for example let's open the ovar headphones page on Amazon the page that shows the search results is the category page the category page displays the product title product image product rating product price and most importantly the product url's page if you want more details such as product descriptions you will get them only from the product Details page let's examine the structure of the product Details page open a product URL in Chrome or any other modern browser right click the product title and select inspect you will see that the HTML markup of the product title is highlighted you will see that it is a span tag with its ID attribute set to product title similarly if you right click the price and select inspect you will see the HTML markup of the price you can see that the dollar component of the price is in a span tag with the class A price hole and the sense component is in another span tag with the class set to a price fraction similarly you can locate the rating image and description once you have this information add the following lights to the code we have written so [Music] far beautiful soup supports a unique way of selecting tags that utilize theine methods alternatively beautiful soup also supports CSS selectors you can use either of this to get the same results in this guide we will use CSS selectors which are Universal ways to select elements CSS selectors work with almost all web scraping tools that can be used for web scraping Amazon product data we are now ready to use the soup object to query for specific information the product name or the product title is located in a span element with its ID product title it's easy to select elements using the ID that is unique see the following code for example Le we send the CSS selector to the select one method which returns an element instance we can extract information from the text using the text attribute upon printing it you will see that there are a few white spaces to fix that a strip function call as follows scraping Amazon product rating needs a little more work first let's create a selector for rating now the following statement can select the element that contains the rating note that the rating value is actually in the title attribute [Music] lastly we can use the replace method to get the [Music] number the product price is located in two places below the product title and also on the buy now box we can use either of these TXS to scrape Amazon product prices as the price element doesn't have an ID we will have to use a combined CSS selector to get it span a price and then specify it with a P off screen this CSS selector can be passed to the select one method of beautiful soup as [Music] follows you can now print the price [Music] let's scrape the default image this image has the CSS selector as Landing image with this information we can write the following lines of code to get the image URL from this SRC attribute [Music] [Music] the next step in scraping Amazon product information is scraping the product description the methodology Remains the Same create a CSS selector and use the select one method the CSS selector for the description is as follows it means that we can extract the element as follows [Music] one last thing we could scrape from a product page is its reviews now the process of scraping product reviews can be more complex seeing as one product can have several reviews not to mention a single review May feature a lot of information that you might want to capture let's start by getting all the review objects we'll need to find a CSS selector for the product reviews and then use the select method to extract all of them we can use this selector to identify the reviews and the following code to collect them this will leave us with an array of all the reviews over which will iterate and gather with the required information we need an array where we can add the processed reviews and a for Loop to start iterating let's begin by getting the author's name the following CSS selector will select the name we can collect the names in plain text with the following snippet the next thing is to extract is their reviewed rating it can be found with the following CSS [Music] the rating string has some extra text that we won't need so let's remove that [Music] [Music] we can get the element that contains the title by using this selector getting the actual title text will require us to specify the spand as shown below [Music] Dre [Music] [Music] [Music] the review text itself can be found with the following [Music] selector and extract it accordingly one more thing to fetch from the review is the date it can be found using the following CSS [Music] selector here is the code that fetches the date value from the [Music] object finally we can check if the review is verified or not the object holding this information can be accessed with this selector and extracted using the following [Music] [Music] code now that we have all this information gathered let's assemble it into a a single object then let's add that object to the area of reviews for this product that we created before starting our for Loop [Music] [Music] [Music] so far we have explored how to scrap product information however to reach the product information you will begin with product listing or category Pages for example here is the category page for over ear headphones if you examine this page you will notice that all the products are contained in a div that has a special attribute data Asen in that div all the product links are in an H2 tag with this in mind the CSS selector would be as follows we can read the HRA attribute of this selector and run a loop however note that the links will be relative you would need to use the URL join method to parse these links now let's see how we can handle pagination the link to the next page is in a link that contains the text next we can look for this link using the contains operator of CSS as follows [Music] [Music] [Music] [Music] now let's export Amazon data the data where scraping is being returned as a dictionary this is intentional we can create a list that contains all the scraped products this page data can then be used to create a pandas data frame [Music] object congratulations you extracted and exported Amazon product data now let's learn some best practices scraping Amazon without proxies or dedicated scraping tools is full of obstacles just like many other popular scraping targets Amazon has rate limiting in place meaning it can block your IP address if you exceed the established limit apart from that Amazon uses B detection algorithms that can check your HTTP headers for any suspicious details also you should be ready to constantly adapt to the different page layouts and various HTML structures considering these factors it's recommended to follow some common practice is to prevent getting detected and blocked by Amazon here are some of the most useful tips first use a real user agent it's important to make your user agent look as plausible as possible in the description below you can find the list of the most common user agents second set your fingerprint many websites use transmission control protocol and IP fingerprinting to the tag Bots to avoid getting spotted you need to make sure your fingerprint parameters are always consistent finally change the crawling pattern to develop a successful crawling pattern you should think about how a regular user would behave while exploring a page and add clicks Scrolls and mouse movements accordingly and this is only a small portion of the requirements you should keep in mind when scraping Amazon alternatively you can turn to a readymade scraping solution designed specifically for scraping Amazon Amazon scraper API with this scraper you can scrape and parts various Amazon page types including search product offer listing questions and answers reviews best sellers and sellers Target localized product data in 195 locations worldwide retrieve accurate parse results in Json format without installing any other Library enjoy multiple handy features such as ball scraping and automated jobs let's look at Amazon scraper API in action consider the example of getting product data from product pages all you need is the product URL irrespective of the country of the Amazon store for example the following code extracts details for the B qc45 from amazon.com you will get the Complete product data returned in Json format another way to get the information is by asking of the product the only line you need to modify is the payload [Music] [Music] [Music] note the optional parameter domain you can use this parameter to get Amazon data from any domain such as Amazon co uk searching for the products is very easy again the only code that changes is the payload here is the payload for the search for Bose notice how it requests 10 pages beginning with page one also so we limit the search to category ID which is Amazon's category ID for headphones and this is it for today let me know in the comments whether you prefer to build your own scraper or use Reddit made scraping Solutions such as Amazon scraper API thank you and see you next time
Info
Channel: Oxylabs
Views: 21,560
Rating: undefined out of 5
Keywords:
Id: w3XcMfyUGxY
Channel Id: undefined
Length: 23min 2sec (1382 seconds)
Published: Wed Nov 01 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.