How to Scrape Websites Without Getting Blacklisted or Blocked

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video I'll talk about five tips on how to reduce the chances of being blocked when scripting websites let's dive right in hello everyone this is merely with the octo parts in the last video I talked about how to extract data on a large scale with octoparse cloud extraction now let's describe more websites you may come across a situation where the website asks you to prove that you're not a robot or even worse you can get access to the website this is because the website is trying to identify or have already identified you as a scraping bot once you've been termed as a scraper you're no longer able to visit the website we all know that web scraping is a method often used to extract it from websites and it is much more effective than copy and pasting manually but some of you may not know that it comes at a price for the site owners a straightforward example is that web scraping may overload a web server which may lead to a server breakdown to prevent such a situation more and more site owners have equipped their website with all kinds of anti scraping techniques which makes web scraping even more difficult nevertheless there are still some method we can use to get around blocking let's take a look switch user agents a user agent is like your ID number which helps the internet identify which browser is being used your browser sends user agent to the website you visit as a scraper website the website would attacked a huge number of requests from the same user agent and this mainly due to the block to prevent getting blocked you can switch user agents frequently many programmers add fake user agent in header or manually create a list of user agents to avoid being blocked with auto parts you can enable automatic user agent rotation and customize the rotation intervals in your crawler to reduce the risk of being blocked slow down a scraping most scrapers try to get data as quickly as possible however when a human visit a website his browsing activity is much slower than that of a robot therefore some websites cat a scraper by tracking its exact speed once it discovers there's a browsing activity going on too fast it will suspect that you're not a human and the walk you naturally to avoid this you can add some time delay between requests and reduce concurrent page access to one or two pages every time set up a wait time between each step to control the scraping speed better yet set up a random time delay to make the scripting process look more like it's done by a human treat the website nicely and you'll be able to keep scraping it use proxy servers when aside attacks they are a number of requests from a single IP address and will easily block the IP address to avoid sending all of your requests from the same ip address you can use proxy servers the proxy server acts as a middleman it retrieves data on the internet on behalf of the user it also allows you to send requests to website using the IP you set up masking your real IP address of course if you use a single IP setup in the proxy server it's too easy to get block you need to create a pool of IP addresses and use them randomly to read your requests through a series of different IP addresses to get rotated IPs there are many servers that can help such as VPNs web scripting tools usually make it fairly easy to set up I Pro tation in a crawler for example octoparse local extraction allows users to set up proxies to avoid being blocked you can also set up the time interval for IP or rotation and enter the IP addresses another approach is to use cloud extraction it's supported by hundreds of cloud servers each with a unique IP address when a scrubbing project is set to execute on the cloud requests are performed on a target website through various IPs minimizing the chances of being traced Korea cookies a cookie is like a small document containing helpful information about you and your preferences for instance you are an English native speaker you open a web site and change the preferred language to English cookie will help the website remember your preferred language is English and every time you open the web site it will automatically switch the preferred language to English if you are describing a website constantly with the same cookie it's easy to be detected is a scraping bot activity octoparse allows you to clear cookies automatically from time to time you can either customize the time interval for switching user agents or choose to clear cookies when IPS which be careful of honeypot raps panel puzzle-like links that are invisible to normal visitors but they exist in the HTML code and can be found by web scrapers they are traps to detect scrapers by directing them to blank pages once a visitor browsers a honeypot page the website can find out it's not a human visitor and start throttling or blocking our requests from that client when Beauty a scraper for a site it's worth looking carefully to check whether there are any links hidden to users using a standard browser to precisely click and capture web page content octoparse uses XPath to locate specific elements on a page XPath the HTML pad language is a query language used to navigate through elements in an XML document all the web pages are HTML documents in nature octoparse provides an XPath engine so that we can use XPath to locate data on webpages precisely which helps avoid clicking the fake links all right those are all the anti blocking techniques we're discussing today if you find this video useful would you give a thumbs up and subscribe to our Channel thank you so much what other anti blocking techniques do you use share with us in the comments down below our next video about octoparse api is coming soon stay tuned

Info

Channel: Octoparse

Views: 43,811

Rating: 4.9052525 out of 5

Keywords: web scraping, data extraction, web crawler, data collection, automated data extraction, extract data from website to excel, data extraction from website, data scraper - easy web scraping, web scraping without coding, web scraping tools, data scraping tool, scrape without being blacklisted or blocked, ip blocked, get blacklisted, scrape without being blocked, Switch user-agents, Slow down the scraping, Use proxy servers, Clear cookies, Be careful of honeypot traps

Id: B4VPmdteI5A

Channel Id: undefined

Length: 6min 33sec (393 seconds)

Published: Tue Jan 14 2020