NestJS Web Scraping With Puppeteer

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone today I'm going to show you how we can Implement web scraping in sjs using puppeteer so in this project we're going to scrape Amazon to get a list of information about any products we want based off of a search query so you can see here I've searched for pillows and for each product we're going to get a URL a title of the product and the price so this is scrapes directly from Amazon and we can even change the product query to whatever we want to get a response back and get real-time information about really anything on the internet that we want to scrape we can do it easily with nest.js and puppeteer so you can see here I've just got a result set back for all of these different laptops and with the patterns we'll learn in this video we can scrape anything we want on the web as long as it's publicly available which can be extremely useful in building any sort of application we want so let's Jump Right In and I'll see you in the video okay so to get started we're going to use the nest CLI to initialize our project so as always if you don't have the nest CLI or you want to make sure you're running the latest version of it you can do this by running sudo npm install Dash G at NAS JS CLI at latest to get the latest version and enter your password and let the installation go ahead and complete so we'll initialize a new project by running NES new and I'll call this Nest JS scraping and for our package manager I'll go ahead and use pnpm so we can go ahead and now CD into our project directory and I'll go ahead and open up this in vs code now as always I'm going to keep the completed code for this project in a GitHub repository and I'll leave a link to that repository in the description so you can follow along and check out the completed project there okay so we have the default Nest project out of the box with our main.ts file with our bootstrap function that will bootstrap our application and listen for incoming HTTP traffic on Port 3000 by default so as always we have our one app dock controller with one get route which is simply going to be returning a hello world string from our app service now we can go back to our terminal window and Run pnpm Run start Dev to make sure our application starts out properly so you can see our Nest app has successfully started and then I'll go ahead and open up Postman so I've gone ahead and opened up Postman so we can launch a test request add our server so let's launch a request at localhost colon 3000 where our server is running and the root path will launch a get request we can see we have a Hello World response returned from our one get route in our app controller so let's go ahead and remove these app.controller files that we will not be using as well as the app service since we're not going to be using this then in the app module go ahead and remove the references to these files as we're no longer using them and then I'll open up a terminal window in vs code so that we can use the nest CLI to generate a new module for us so we're going to generate a module called Amazon this will be the module responsible for having all of our Amazon scraping code and you can see the nest CLI has created an Amazon module for us and added it to our Imports array in our app module so nothing is in the Amazon module currently let's go ahead and use the CLI to generate a controller called Amazon as well and by default this will add to the Amazon folder and also add it to our Amazon module and then we'll go ahead and do this one last time to generate a service called Amazon as well so we have a controller module and service so you can see the controller has been added to our controllers array and the Amazon service has been added to our providers array so let's go ahead and add our first route in here so that we can actually expose HTTP traffic so I want to add a get route we'll do this by calling at get and make sure we import this decorator from nest.js common and then we can prefix this with products so that the URL will be slash Amazon slash products and then we'll call this function get products which will reach out to the Amazon service so to get access to this let's have a Constructor where we inject it so we'll have a private read-only Amazon service of type Amazon service so next up I want to actually extract a query parameter from this route that we will use to input into the Amazon product search field so that we can actually search for products and make this dynamic so to extract a query parameter we'll use the at query decorator from sjs common and then I'll call this query product so that we can extract this specific query and we'll go ahead and call this product of type string so now that we have this product query I want to return this Dot amazonservice.getproducts and pass in the product string now of course this doesn't exist yet so we'll go into our Amazon service and now add the async get products method that takes in the products string and now we can actually implement this get products method okay so we're ready to start actually writing code to scrape Amazon so to get started let's go ahead and install Puppeteer so Puppeteer is going to be the library we're going to use to get access to a high level API that's going to let us to control a Chrome browser over the devtools protocol it's going to run in a headless mode and we can use it to actually go to a web page scrape HTML navigate around click buttons and do anything as if we were on a web page so let's go ahead and pnpm install save Puppeteer core so now that we have Puppeteer installed we need a way for it to actually connect to a remote browser now we could use a browser locally and that would be fine however the issue we're going to run into when scraping data at scale is that most large websites have mechanisms in place that will prevent automatic scrapers from accessing their websites so they can use tools to detect when a scraper is scraping their data and block your IP address so we want a tool that will allow us to connect to a remote browser session without getting blocked because we're scraping data and that's where bright data comes in which is also today's video sponsor it's a great solution to actually Implement scraping because what it's going to allow us to do is it's going to allow us to connect to their remote browser which runs through a proxy Network which means the actual scraping that happens is going to be happening through their proxy Network the actual IP address is always going to be different so we're never going to get blocked when we scrape external websites so this tool is extremely useful when we're trying to scrape large websites at scale and the best part is they've given me a 10 credit to their services so that you can follow along with this tutorial at no cost simply click on the link in the description set up an account and head to your dashboard where you can see we have this proxies and scraping infrastructure tab and here we can click on the add button and add a scraping browser so you can see here it's an all-in-one browser that will allow us to scrape any website at scale without being blocked which is exactly what we want to do so after you've gone ahead and created your scraping browser integration go ahead and click on it and then you can click on the checkout code and integration examples where if we scroll down you can get access to this SBR WS endpoint so this endpoint is going to be what we're going to provide to Puppeteer to actually connect to a remote session so go ahead and copy this string that we'll use to connect Puppeteer to their proxy Network I'll go ahead and copy this and then what I want to do is go back to our project and then I want to create a dot m file where we can paste in this sensitive data so go ahead and create a new SBR WS endpoint environment variable and scrape in that string without quotes so I'll paste in the entire string here which is what we're going to be passing in to Puppeteer to connect to our remote browser through the bright data proxy Network and make sure we won't get blocked when we scrape data so in order to actually inject the environment variable in to our application we're going to make use of the nest.js config module so simply pnpm install dash dash save and sjs slash config to install the config dependency and now let's go back to our app.module so we can actually set up our config module so in our app.module in our Imports array we'll go ahead and add a new import so I have a config module dot for root and we're going to go ahead and set is global to true so that we make the config module globally available and now next up we can go directly to our Amazon service and let's go ahead and inject the config service into our Constructor so that we can use it so we get access to a config service and this is from nasjs config and now we can use the config service to get that and variable that we just provided so I'll also make sure I go ahead and start up our server back up now let's go ahead and start implementing our get products method so the first thing we're going to do is create a new const called browser and connect Puppeteer to our remote browser session through the bright data proxy Network so let's call away Puppeteer and we need to make sure we import puppeteers so let's import puppeteer from Puppeteer core so now we'll await puppeteer dot connect and now we provide an options object where we specify the browser WS endpoint so this is going to be the remote browser endpoint that we have in our dot m file that will allow us to connect to the bright data proxy Network so let's call this dot configservice.get or throw and now extract that SBR WS endpoint so now we have a browser that's connected to this remote browser and it'll be run anywhere in the world to prevent us from getting blocked from scraping any website because to the website we will just be a normal end user whose IP address will always end up being different so now that we have the browser let's go ahead and open up a try finally block because no matter what code we now execute we want to make sure that we always call await browser Dot close when we finish to close the browser session so now let's go ahead and get access to a browser Page by setting a new const equal to a weight browser Dot Page new page and now we have a page object so the first thing I want to do is set a Timeout on the page so that if we have any issues navigating our code won't run forever so let's set a default navigation timeout which is the timeout in milliseconds so I'll do two times 60 times a thousand and now we're going to go ahead and tell the browser to actually navigate to Amazon so let's go ahead and wait promise.all because we want to await two promises so the first is going to be a call to page dot wait for navigation we're just going to tell the code to actually wait until our navigation finishes we'll call Page dot go to and now we Supply the URL so in our case we'll go to https slash amazon.com to navigate to Amazon so to make things a bit easier I'll also navigate to Amazon in my browser and open up the dev tools because we're going to be utilizing the different selectors in the HTML to actually scrape the data appropriately and since these classes and IDs could potentially change in the future you should follow along with me and see how I'm doing it so that you can always come back and change your code as necessary so the first thing we want to do on this page is Target the search bar here and actually click on it type in the query from our API call and then click on the search button to actually navigate to a product page so to run through this completely we want to click on it and then import the product query from our API call so we'll say laptop and then click on the search button so that we're brought to the actual product listing page where we can scrape data so to do this let's take a look at the html4 this search bar now in particular the input is what we want to Target here so we can see it has an ID of two tabs search text box so let's go ahead and Target this ID I'll go ahead and copy the ID and now we can go back to our code and actually Target this text box and type a query into it by calling await page DOT type and now we'll enter the selector in that we just copied so make sure we use the hash sign to Target the ID and then we'll give it the value we want to type into the box which in our case is the actual products query that's passed in as an argument to the method here so now at this point we've typed our query into the search box and now we want to click on the submit button so we'll go back to the HTML here and look at a selector for the button and in this case we have an ID here of nav search submit button so let's copy this and go back to our code and call await promise dot all where we'll follow a very similar pattern as we did for the first navigation we will call page.weight for navigation so that we wait for the page to finish navigating before continuing our code and then finally to actually execute the click we'll call page.click and enter in our selector which is the submit button so now at this point we have our query typed into the search bar and we've clicked on the search button so we should have been brought to this product listing page so now at this point we want to be able to Target this list of products and scrape the individual details from each one and formulate a response to our end user so for example let's say we want to grab the title of each product the price of each product and the URL of each product let's see how we can do that next so the first thing we need to do is get access to a selector that will give us access to an individual result so to do that let's hover over each of these elements until we find a container element for all of the results that we want so this looks like a good one it'll be the div that contains all of our search results so let's include this first class S search results as our container so now in our code we're going to return await and now we're going to call Page dot dollar sign dollar sign eval so if we look at the description what you Val is going to do is it's going to run array Dot from document.queryselector the selector we pass in within the page and pass the results as the first argument to the page function which is going to be the Callback it's going to allow us to execute JavaScript directly on the selector that we provide here so let's go ahead and fill this out so it'll be a bit more clear so we want to firstly provide the selector of the individual element that we want to be targeting we'll start off with the s dot search results class that we grabbed earlier and now we need a selector that will Target an individual product so if we keep scrolling down a bit more until we find this div here we can find another div that will have the class that we want so you can see here I'm hovering over and we have this product entry so this is the div we'll want to select because it contains all of the information we want with inside of it the title price and URL so to Target it we'll use this S card container class so we can simply copy this class and now back in our code we will paste this additional selector in to Target the individual product okay so now the second argument this will take is the Callback function that will get the result items which will be each individual product we can map over them and extract the data that we want from within each one so to do this let's go ahead and return result items dot map and then we'll get access to a single result item and now within this function this will be the individual product okay and so now we need to actually extract the URL title and price from that given product so to start off with the URL that's going to be the easiest what we can do is if we look again at the HTML really what we want to be looking for is the first link element that we find and get the href because that'll lead us to the product link so we can do that very easily by creating a new cons called URL and we'll set it equal to result item and now like I said before this is going to be as if it's a Dom element so we can execute JavaScript directly on it so we'll call result item dot query selector and within this we'll select the first link to air anchor tag that we find and extract the href so now we have access to the URL let's go ahead and find the title so inside of this anchor tag we can see we have a span here so to Target this span directly we'll go up and Target this class called S Title instruction style because this will be specific to the title so I'll copy this class S Title instruction Style and then we want to select the first span element that we find Within so by doing this we can select the actual text itself so let's go ahead and do that we will get the title by calling result item dot query selector and paste in the class that we just copied and then we want to get the first descended span element so now we have the span we'll grab the text by calling question mark to see if it exists dot text content so you can notice there's a lot of different ways to go about doing this you can choose the different selectors you want that will get you to the end result sometimes you have to be a little bit creative and scraping this data and there's definitely more than one way to get access to the data so finally we want to get access to the price so let's go ahead and inspect the price element and if we take a look we can actually see we have this nice span here called a off screen which will give us the total price so to Target this let's target the a price class and then Target the a off screen span so we get access to the price with the currency so the price will be equal to result item dot query selector and we want to scrape the a price class and then the a off screen class and now that we have this we'll Target the text content to get the text content so now we have these three elements that we want to return back to the user and of course you could keep going at this and scrape any amount of data that you needed for your solution however for us we'll go ahead and simply return an object now so whatever we turn from this eval function will get sent back to the outer scope so we return the URL title and price inside of this object so you can see our Nest application is still running successfully and now we can now we're ready to actually give this a try so let's launch a get request a localhost 3000 slash Amazon slash products and then I'll set our product query equal to well this will be whatever product we actually want to search for so let's start off with the laptop query that we just worked on so I'll go ahead and send off this request and what this is going to do is it's going to launch a puppeteer browser session using the remote browser that we specified using bright data so this is actually going to launch a browser remotely and we can see we get data back we're not being blocked and we can launch this request any number of times we want and it will continue to not be blocked thanks to the bright data proxy Network and so we can see we get this really nice list back of all of these products and within each one we get a nice URL the title of the product and the price so we get all of this information thanks to our scraping and we can go ahead and show how this works really dynamically by changing the query now so we were just looking for laptops let's say we want to find a new pillow so I'll search for a pillow now and again this is going to launch a new browser session through the proxy network using a different IP address so that to Amazon it looks like it's coming from a totally different user and we get all of this data from Amazon including all of these different products the URL the price for each one so we could even click on any one of these URLs and we get brought to the product that we can see here so you can really see how versatile this Puppeteer and bright data combination is using them together we can really scrape any data we want all over the web without having to worry about getting blocked I hope you learned a little bit on how scraping can be really easy with an sjs and Puppeteer and I'll see you in the next one thanks

Info

Channel: Michael Guay

Views: 6,074

Rating: undefined out of 5

Keywords:

Id: UN-yK0F38Sg

Channel Id: undefined

Length: 26min 24sec (1584 seconds)

Published: Sun Aug 20 2023