The Biggest Issues I've Faced Web Scraping (and how to fix them)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey my name is Forest welcome back so I've done quite a bit of web scraping over the years and I've definitely gotten my fair share of 403 Forbidden errors 500 internal server errors captures I didn't account for or my IP just straight up getting blocked if you know you know unfortunately but that's the game and over the years I've learned how to combat these issues like handling complex web Technologies like spa and Ajax optimizing my scripts aor handling and adaptive algorithms utilizing proxy management now ai driven anonymity and intelligent rate limiting and handling extracted data with data storage and Big Data integration and I know a lot of buzzwords it seems but it we'll dis demystify all of that it's really not as complicated as it sounds once you're on the other side of it oh yeah and doing all of that ethically and legally I mean if we have to and what I've done is I've taken all of that experience plus some research beforehand just to make sure that the way I do things is the actual correct way to do things and packaged it nicely into this little video so I hope you enjoy it and to get everyone up to speed I'm going to quickly trust me very quickly answer these questions what is web scraping how does it work why is it important and how is it used in the real world if you already know all of that feel free to skip around using the timestamps web scrap being simply put is a process of extracting data from a website you programmatically send requests to a website receive the data you specified in your code it all goes well parse the data to extract specific data points and then use said data for whatever need you have so for me instead of going to every single Tech News website trying to manually pull all of these articles of software engineering and computer science in order to curate devotes newsletter what I've done is I've written a script that programmatically checks all of these websites based on the categories and what I've written and pulls in all of these articles into a Consolidated area so then I can go through them and see what I want to include that's a real word example other real word examples would include Gathering product information for comparison shopping monitoring stock prices extracting news articles for cinnamon analysis aggregating real estate listings and a whole lot more it's a very useful skill to have because businesses need this information they need to leverage publicly available data for analysis decisionmaking automation a lot of things that could potentially save them or make them millions of dollars which makes you the webscraper a very valuable resource to them that's web scraping and some of the most popular tools are selenium playright and Puppeteer which allow you to create scripts and various programming languages for browser automation there's also tools like beautiful sup for parsing HTML and XML documents and there are a whole lot more there's there are a lot of weed that you can get into what I do is I actually use this tool kind of in conjunction with the former three that I just listed called scraping browser that I just put my Puppeteer script into or playwright or selenium scripts and it helps with managing any complex browser interactions it'll also automate proxy rotations for avoiding IP bands and make sure that Dynamic content loaded via Ajax or JavaScript is actually rendered and captured because I've had trouble with that sometimes and I also use this tool called Web unlocker that helps with bypassing anti-scraping protections that would be your capes browser fingerprint and and all that stuff and in preparation for this video I actually contacted the company behind those tool to two two tools as well as a an entire Suite of tools called bright data I wanted them to sponsor this video and they happily obliged I'm sure you've already heard of them I will be talking about them a little bit more later in the video but I mention them now because I want to clarify this video is about web scraping and the issues that I faced and the solutions I've implemented to well solve those issues I've made this video as if there was no sponsor just talking about my knowledge and what I just said it just so happens that uh however many years a ago I went on this trip with a bunch of data scientists and data engineers and the like which was sponsored by bright data and ever since then I've been using their Suite of tools they have tools they have proxies they have existing data sets so you can just use their data sets instead of scrape everything yourself and a whole ly of other things that are just pretty dang good at scraping the web and that's why I use them but again in this video you will learn everything you need to know about these Solutions about these problems and these Solutions regardless of what tools you use so that's the overview of web scraping how it's done and what tools you use to do it so now let's get on to the more advanced stuff first handling complex web Technologies look I love making websites that are single page applications or spas and I love utilizing asynchronous JavaScript in XML or in other words Ajax I'm pretty sure that's the only time I've ever said that acronym out the entire thing they make websites more interactive and dynamic but when it comes to web scraping I hate them if you don't know what it is just scroll all the way to the bottom of this web page right now actually like on YouTube Just scroll down you'll notice that it stops if you go you got to go real quick and notice that it stops then it'll load some more comments or some more recommended videos over there that's Dynamic content loading and I hate it when web scraping because it poses a challenge that I just wish I didn't have and that is instead of just being able to go to the website and say okay give me all that data the initial HTML on web pages like this it's not all there however it's loaded asynchronously to the D so what we can do is write a script using selenium playwright or Puppeteer to navigate to the Target website that again loads HTML without all the data we need so then in our script we will Implement weights or intervals to allow Ajax loaded content to appear as well as trigger ack's calls by interacting with the web page by doing things like clicking and scrolling this is what loads the data that we actually need directly in the domm for us to then be able to extract however even with that I've had some of this content failed to properly load so what I'll do is Port my script into scraping browser to make sure the Ajax loaded content actually renders or if instead I get hit by some anti-scraping protections I'll route my request through the web un Locker to account for all that now script optimization error handling adaptive algorithms we've actually discussed this over idea in one way or another in three of my last four videos that is sometimes you can get away with writing trash code using incorrect data structures and inefficient algorithms when your project is on a smaller scale but if you try to do the same on a large scale project those really take their toll and just like that in you know a typical programming project this would be the same when you're dealing with large scale data extraction you have to make sure sure everything is efficient and reliable let me give you an example say you're scraping for price comparisons on Amazon or Walmart these are large websites that want to protect their data so you'll likely face issues with server timeouts or changes in website structure within the same website because it's so large and rate limits you have to figure out how to address these things if you want any chance in actually scraping this data and there are many many ways to do it to optimize your script you can use efficient xath or css selectors which reduces unnecessary parsing cutting it off your workload to handle errors you can Implement retries for Server timeouts taking the manual aspect out of it and logging unexpected HTML structures for analysis and finally you can utilize an Adaptive algorithm to detect when product pages layouts change then automatically adjust the scraping pattern accordingly implementing just these few things will greatly improve the efficiency and effective of your scraping process which will impress your boss because you got it done so quick or maybe you decide to hold on to it you free out some time at work and now you can read a book or play some video games I don't care what you do in all seriousness it will free up your time your future time because it's now future proofed in perpetuity what you do with that information is completely up to you what will also happen more often when you're scraping data on a large scale IP banss so another example you guys know how much I love examples it's just how I'm able to like really understand something like using an analogy and example so I assume that it helps y'all too I don't know hopefully it does but it's like if you're scraping a travel site for flight prices all right they can see all of these requests coming from the same IP address and they can see that these requests are coming in at a rate much faster than a human can actually do it accessing whoa that was a difficult one access accessing a lot of data this makes it very very easy for them to detect you as a web scraper and flag you for doing it and banning your IP the solution proxies but not simply proxies Aid driven proxy management using a pool of proxies to distribute requests which masks your IP address if nothing else this is really when you want to use a service like bright data because there is really no point in trying to whip all of this up yourself I'm pretty sure they have like over 72 million yeah no yeah 72 over 72 million IPS rotating from about I think 195 different countries yeah don't try to do that yourself and you can see what else they have here namely rotating proxies for web scraping at scale you also want to ensure anonymity so even if they do kind of detect you they can't you know Point directly to you which rotating proxies also do that and intelligent rate limiting so what you do is you you dynamically adjust the rate of requests so it seems a little bit more human it it it really just helps you avoid like triggering those anti-scraping protections that we previously discussed and when I say you do all of this I don't mean you I mean you write the algorithm and then the algorithm does all of this it learns the optimal rate over time and adjusts the frequency of requests according to that but scraping isn't the entirety of the job now you got to figure out what the heck do I do with this data so once you successfully scrape it you have to store it then integrate the data for analysis and usage for example a market research firm may need to scrape uh reviews of something from various different sites like Yelp over here Google reviews over here whatever the heck else is over here so then they'll have it all Consolidated but if they're scraping this data that means there's a lot of it otherwise if there's like three over here four over there six over there when there's 13 you can just manually do that unless you're a true developer that takes three days automating something that otherwise will take 3 hours and you only really need to do it once so what you need to do first is choose a database solution for the large scale data there are no SQL databases like mongodb or Cassandra Cassandra if it's unstructured or SQL databases like postgressql or MySQL my SQL if it's structured and Implement data partitioning and indexing strategies as needed to improve query performance then you use ETL tools to clean transform and integrate the scrape data into existing systems or we should probably say extract transform and load the data that's that's what ETL stands for and ensuring data formats and schemas are consistent for seamless integration and the formats for this will typically be Json Indie Json CSV and xlsx now you can leverage Big Data platforms like Apache Hood dup or spark for distributed storage and processing and then delivering said data you can do this via email for immediate reports web hook for real-time data integration cloud storage Solutions like Amazon S3 or Google Cloud Storage for scalable access SFTP for obviously secure file transfers and Microsoft Azure Azure Azure if and really only if you're already invested in the ecosystem to where now you can finally Implement data analytics and business intelligence tools to extract insights from these data sets using Tableau or powerbi or something and in doing everything I laid out here today you must make sure you do it all ethically and legally I know I joked about it before but it's kind of important for example if my company needed me to scrape social media sites for Consumer sentiment analysis we must ensure what I'm doing does not violate violate any privacy laws or the platform's terms of service I just saw some people irk a little bit like oh the terms of service are unlawful that was my next point because some of these terms of service are technically not lawful themselves if I'm pulling public data displayed on their site without breaching any technical barriers you know like login or something like that it may not constitute a legal violation even if it does go against their to cuz there are Protections in place but guess what I know this may come come as a surprise to some of y'all I'm not a lawyer I'm not here to give legal advice nor could I even if I wanted to I'm only here to say that before you start web scraping make sure you ain't breaking any laws I don't want a bunch of YouTube comments under this video coming from the iPhone you snuck in jail that's all I'm going to say and what I actually find beneficial and that's the reason that's all I'm going to say not just because I'm not a lawyer but because I use tools that make sure that I can't be breaking these laws right data their tools their platform they really make sure like they're very proud of it they have a whole bright data bright dat.com SL trust Center to make sure that everything is in compliance is ethical not just legal because there's a difference there sometimes but also ethical when it comes to web scraping and they make sure that I am completely above board so that's one one more thing off my plate and with that I hope you gained some knowledge here if you could click the Subscribe button subscribe to the channel turn on the Bell notifications I'd greatly appreciate it oh and when I said that did the Subscribe but button light up it's been doing that for some users or is it the like button let's try that like the video did that work what about subscribe to the YouTube channel did that work also did you actually hit the like button or subscribe button or you just getting distracted with the animations I just let me know if it worked let me know if you actually liked it and subscribed well you don't have to just make sure you do it anyway I genuinely hope that you have a wonderful day again I'm forest forest forest forest pronounce it how you wish I'll see you in the next video
Info
Channel: ForrestKnight
Views: 18,418
Rating: undefined out of 5
Keywords:
Id: vxk6YPRVg_o
Channel Id: undefined
Length: 15min 2sec (902 seconds)
Published: Fri Mar 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.