How to Bypass 403 Forbidden Error When Web Scraping: Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm more videos in this tutorial I'll explain the 403 Forbidden error you might get when scraping and how to solve it with three different approaches stick till the end as I'll walk you through the python code for each method so you can get back to web scraping without a hitch the 403 Forbidden error is an HTTP response status code meaning the website server received your request understood it but declined to authorize it additionally any subsequent request to reauthorized will be rejected for the average web user it usually means they have insufficient permissions to access a Target web page however when the error appears while web scraping it might imply that the website detected bot activity and thus banned access to the server there are three main ways to solve this error adjusting and rotating user agents optimizing request headers and using rotating proxy servers it's essential to try out the first two solutions as poorly set up user agents and basic request headers are the primary signs that give away bot activity on the web so let's analyze each solution a user agent is an HTTP header sent to the Target website that contains information about the user's device and operating system for web requests the website examines the user agent and decides to either Grant the requester access to the site or the client in addition to choosing the content that's fit for the user's device fundamentally user agents include details about the device operating system platform and application being used here's an example user agent of a Windows 10 computer that runs the Chrome browser now the first mistake anyone can make when web scraping is not using user agents at all every web browser sends a user agent with each web request by passing it in the hdb header as a string therefore websites expect such information that identifies a web user and when it's missing it can raise suspicions leading to block access you can set up user agents manually or use the action ones that you can find on the internet the latter option is the best as user engines follow certain logical rules that can be accidentally broken when set up manually the device type strictly dictates the operating system and browser information that should be used for example a website would block access to a request coming from an iPhone device that's running on the Linux operating system and using the Internet Explorer browser this is just impossible in reality thus it can signal a website that there's something off with the request leading to block access and the 403 Forbidden error message popular python libraries like the requests Library send the user agent that identifies the library by default the solution is to use and rotate legit user agents so that HTTP request would be identified as coming from an organic user let's see a code example [Music] foreign this code rotates a list of different user agents with each request then it prints the status code and the headers that we've sent without request using and rotating user agents can reduce the chances of getting the 403 Forbidden error but by no means is the guaranteed way to not trigger this response suppose you're still receiving the error message with your requests in that case it means the Target website has a more complex antibod detection procedure and you should consider adjusting the information included in your HTTP request headers so let's talk about setting up the right headers for web requests the main focus should be on their complexity and consistency with the user agent as every person uses internet browsers to access the web for everyday web surfing websites also expect specific information in the headers modern web browsers send a lot of headers with each request like the screen size operating system and even the user's location most of the time it's enough to come off as an organic web user by using the basic headers that include accept accept language accept encoding referrer and user agent but if you're still getting the 4-3 error it might be time to add more headers let's compare three examples first we'll send the request using this simple python code [Music] the target webpage received these basic headers now here are the request headers from the website received from a Chrome browser and here are the headers sent from a Firefox browser it's evident that the browser send much more information than our scraper and you can also notice that Chrome and Firefox browser send slightly different headers this sort of complexity should be taken into account when optimizing your own request headers there's a great resource you should bookmark that explains most of the request headers and shows which browsers support them you can find the link in the video description below use that information to optimize your headers and solve the 403 Forbidden error as mentioned before the request headers must also be consistent with the user agent for instance the Safari browser both on the desktop and mobile versions doesn't include a sec fetch user header therefore if your user agent specifies that so far you is used and you have included the SEC fetch user header the Target website will suspect that your website isn't coming from an actual Safari user so it's crucial to be accurate and consistent with the details you send with request headers let's see an example of how you can set up headers in Python to bypass the 403 error in this code we we're using a Chrome browser user agent with the headers that are supported by Chrome [Music] again this solution can only lower your chances of receiving the 403 error if the issue persists the Target website might have blocked your IP address thus the best option to use next is to use proxies when a server suspects the requester to be a bot its course of action would be to block the user's IP address as it's the simplest solution even if you have spotlessly optimized their user agent and request headers it won't make a difference when the IP address is blocked therefore rotating proxy servers are the best bet to overcome IP restrictions as each web request can be constructed to come from a different IP address essentially disguising your actual web activity check out this video on our YouTube channel to learn how to rotate proxies in Python we also have a blog post on the same topic so make sure to take that a look you'll find the link to the blog post in the description below as you can see it's quite a hassle to set up a web scraper that would bypass the 403 Forbidden error for such cases we have a solution that automatically implements the three methods presented in this video for Easy Web scraping operations web Unblocker is an AI powered solution that automatically picks the most suitable combination of proxy servers headers cookies and browser attributes for your Target website he said it has many great features like JavaScript rendering and session control but in this video we will highlight its Auto retry feature the automatically retrieve function powered by Machine learning determines whether the retrieve content and the status code are valid so talk about the 403 Forbidden error when the system recognizes that this error is invalid it automatically retries the request with a different combination of proxies headers cookies and browser attributes you don't have to do anything from your side but if needed you can always adjust the query parameters as you wish web embarker is easier to set up and effortless to use so you can be sure to save precious time for other projects if you'd like to learn more you can find a link to our web on blocking solution in this description below if you found this video helpful make sure to like And subscribe and let us know whether you were able to solve the 403 error thank you and until next time foreign
Info
Channel: Oxylabs
Views: 27,627
Rating: undefined out of 5
Keywords: how to bypass 403 forbidden error when web scraping, web scraping, web scraping with python, python web scraping, web scrapping, cybersecurity, bypass403, bypass, 403bypass, #bugcrowd, #toturial, web scraping tutorial, user agent spoofing, 403 forbidden, 403 error, http code 403 bypass, web scraper, data scraping from websites, oxylabs, oxylab, web scraping tool, oxilabs, oxilab, scraping the web, learn to code
Id: JesHXRoJbzw
Channel Id: undefined
Length: 11min 34sec (694 seconds)
Published: Fri Mar 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.