Industrial-scale Web Scraping with AI & Proxy Networks

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

the internet is packed with useful data but unfortunately that data is often buried deep within a mountain of complex HTML the term data mining is the perfect metaphor because you literally have to dig through a bunch of useless dirty markups to extract the precious raw data you're looking for one of the most common ways to make money on the internet is with e-commerce and Drop Shipping but it's highly competitive and you need to know what to sell and when to sell it don't worry I'm not about to scam you with my own Drop Shipping masterclass instead I'm going to teach you about web scraping with a headless browser called Puppeteer allowing you to extract data from virtually any public-facing website to access precious data even for websites like Amazon that don't offer an API what we'll do is find trending products on websites like Amazon and eBay build up a data set then bring in AI tools like gpt4 to analyze the data write reviews write advertisements and automate virtually any other task you might need in addition I'll teach you some tricks with chat GPT to write your web scraping code way faster which is historically very annoying code to write but first there's a big problem big eCommerce sites like am Amazon don't love big traffic like Bots and will block your IP address or make you solve captchas if they suspect you're not a human but that's kind of racist to non-biological life luckily bright data the sponsor of today's video provides a special tool called the scraping browser it runs on a proxy Network and provides a variety of built-in features like captcha-solving fingerprints retries and so on that allow you to scrape the web at an industrial scale that being said if you're serious about extracting data from the web you'll very likely need a tool that does automated IP address rotation and you can try bright data for free using this code after you sign up for an account you'll notice a product called the web scraper IDE we're not going to use it in this video however if you're serious about web scraping it provides a bunch of templates and additional tools that you'll likely want to take advantage of as a developer myself I want full control over my workflow so for that I'm going to use an open source tool from Google called Puppeteer which is a headless browser that allows you to view a website like an end user to interact with it programmatically by executing JavaScript clicking on buttons and doing everything else a user can do that that's pretty cool but if you use it a lot on the same website they'll eventually flag your IP and ban you from using it then your mom will be pissed that she could no longer order her groceries from walmart.com that's where the scraping browser comes in it's a remote browser that uses the proxy Network to avoid these problems to get started I'm creating a brand new node.js project with npm then installing Puppeteer will actually Puppeteer core which is the automation Library without the browser itself because again we're connecting to a remote browser now go ahead and create an index.js file and import Puppeteer from there we'll create an async function called run that declares a variable for the browser itself inside this try catch block we'll try to connect to the browser if it throws an error we'll make sure to console log that error and then finally when all of our scraping is done we'll want to automatically close the browser you don't want to leave the browser opened unintentionally now inside of try we're going to await a puppeteer connection that uses a browser websocket endpoint at this point we can go to the proxy section on the bright data dashboard and create a new scraping browser instance once created go to the access bank parameters and you'll notice a host username and password back in the code we can use these values to create a websocket URL you'll have your username and password separated by a colon followed by the host URL now that we're connected to this browser we can use Puppeteer to do virtually anything a human can do programmatically let's create a new page and then set the default navigation timeout to two minutes from there we can go to any URL on the internet then Puppeteer has a variety of API methods that can help you parse a web page like the dollar sign which feels like jQuery corresponds to document query selector in the browser it allows you to grab any elements in the Dom then extract text content from it whereas an alternative you can use page evaluate which takes a callback function that gives you access to the browser apis directly like here we can grab the document element and get its outer HTML just like you might do in the browser console let's go ahead and console log the documents outer HTML and now we're ready to test our scraper out to make sure everything is working as expected open up the terminal and run the node command on your file and you should get the HTML for that page back as a result can congratulations you're now ready to do industrial scale web scraping now I'm going to go ahead and update the code to go to the Amazon bestsellers page and my first goal is to get a manageable chunk of HTML what I'm doing is opening up the browser Dev Tools in Chrome to inspect the HTML directly until we highlight the list of products that we want to scrape ideally we'd like to get all these products and their prices as a Json object you'll notice all the products are wrapped in a div that has a class of a carousel we can use that selector as our starting point Chrome devtools also has a copy selector feature which is pretty cool but usually it's a bit of Overkill back in the code we can make sure that the page will wait for that selector to appear then we can use the dollar sign query selector to grab it from the Dom and finally evaluate it to get its inner HTML now let's go ahead and console log that and run the script once again at this point we have a more manageable chunk of HTML and I could analyze it myself but the faster way to get this job done is to use a tool like chatgpt we can simply copy and paste this HTML into the chat and ask it to write puppets to your code that will grab the product title and price and return it as a Json project literally on the first try it writes some perfect evaluation code that grabs the elements with the proper query selectors and then formats the data we requested as a Json object let's copy and paste that code into the project and then run the node script once again now we're in business we just built our own custom API for trending products on Amazon and we could apply the same technique to any other e-commerce store like eBay Walmart Etc that's pretty cool and if we wanted to extract even more data we could also grab the link for each one of these products then use the scraping browser to navigate there and extract even more data we Loop over each product and use the go to method to navigate to that URL just like we did before however when doing this I would recommend also implementing a delay of at least two seconds or so between pages and just so you're not sending an overwhelming amount of server requests now that we have all this wonderful data the possibilities are endless like for example we could use gpt4 to write advertisements that Target different demographics for each one of these products or we might want to store millions of products in a vector database where they could be used to build a custom AI agent of some sort like an Auto GPT tool that can take this data and then build you an Amazon Drop Shipping business plan the bottom line is that if you want to do cool stuff with AI you're going to need data but in many cases the only way to get the data you need is through web scraping and now you know how to do it in a safe and effective way thanks for watching and I will see you in the next one

Info

Channel: Beyond Fireship

Views: 704,762

Rating: undefined out of 5

Keywords:

Id: qo_fUjb02ns

Channel Id: undefined

Length: 6min 17sec (377 seconds)

Published: Mon Apr 24 2023