Advanced Web Scraping with Puppeteer: Avoid Looking Like a Bot and Pass Authentication!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in one of the previous videos we've taken a look at Puppeteer and the advantages it has over using the regular fetch API to handle web scraping now the thing is in some Advanced use cases you do want extra functionalities for puppeteer such as not appearing like a bot right if you're using a bot you are a bot but you don't want to seem like a bot so your user agent shouldn't be something about users right how do we pretend that or Puppeteer Chromium browser is not a real bot and also how do we get past authentication if we want to scrape something that is behind authentication how do we do that well that's what we're going to take a look at in this video let's get right into it okay here we are this is probably the most simple Puppeteer application you can possibly have if you don't know what this is and don't know how to set this up I'd advise you to check out the last video I did on the topic where we set up a basic Puppeteer project and get into the basics of scrapping an online bookstore how do we get the data to Json file by the way the data you can see here so we get the title price image source and rating if you don't know how to do that with Puppeteer yet check out that video and if you do then let's get into some advanced concepts so first one I want to get into is bot protection so if we go to this website right here the domain is bot.sanisoft.com you can see when I go to it with a normal browser a lot of the stuff is green right the user agent is green the Webdriver webgl vendor hairline feature like a lot of stuff is green and a normal user browsers this is the case however when we navigate to this page with Puppeteer to so we have the browser then we open a new page go to the URL I've just entered here and then let's take a screenshot so let's take a screenshot at path bot dot jpeg so it's going to do this and save a screenshot so let's say oops yarn Dev start up the Puppeteer application that we've defined in the package.json so it does TS node index.ts it saves the screenshot and as you can see when we navigate to it with Puppeteer well we look like a bot maybe because we are a bot like this is a bot um but we don't want to appear like a bot right that's the that's the secret so how do we not appear like a bot and the solution might be simpler than you think there is an npm package called um npm puppet tier extra plug-in stealth that I'm gonna show you really quick and essentially it's it's a very famous package essentially all you need to do is install it and have Puppeteer extra and that can allow us to bypass that bot protection so let's go into our project say yarn add Puppeteer extra and then we also want this plugin so it's called Puppeteer extra plug-in stealth let's copy that paste it in here press enter that's going to install those two dependencies and yeah then there's not a lot of stuff involved and we can seem like an actual you know not bot so exactly what we want so we can import Puppeteer from Puppeteer Dash extra in this case not from the regular puppeteer then we're going to import the stealth plugin so let's call it still stealth plug it just like they do in their documentation um down here we're doing the same thing so if you're getting lost you can rewatch the video or go to the talks so we recall wire and then here we're going to say uh pop with your extra plugin stealth that's what we want and that is pretty much all we need to do now we might get an error in a second that I'm going to show you but first let's say um puppet here dot use so we can use a middleware in here just like the you know if we were working with Express for example stealth plugin and we're gonna invoke that okay and let's try out doing the same thing again so we're going to the bot website and try to fetch the screenshot from it however we get an error an executable path or Channel must be specified for puppeteer core and as far as I know that is nothing that is mentioned anywhere in the documentation or online so if you actually want to use this package because I think the last update was like uh yeah four months ago so they haven't thought about this yet so the solution to that would be to actually get the executable path from Puppeteer from the original Puppeteer so we can destructure an import and say is equal to require Pub tier and from here we can get the executable path and now where do we pass that well it is an option in the publisher.launch so we can say executable path is equal to or not equal to but in the object notation a securable path and then invoke that function and that should make the error go away so let's retry let's go to the website as you can see done so it worked and now if we go to the and bought the jpeg you can see oh wow we don't seem like a bot and that's that's pretty cool that's exactly what we want right so now we appear as a real user exactly what we want um so we can bypass bot detection and let's now go to a website that was um one of my first ever full stack projects now it doesn't look as amazing in Firefox apparently and the database is not even working anymore I did this project with a Firebase but it's not about the page it's about the login so this is the site I own so I can show you how to scrape this um be careful when scraping some companies don't allow it it's against their terms of service that's why I can't show you this for an actual like Reddit login or Facebook or whatever um because they would be getting quite mad that I'm scrapping the website so I'm going to show you that on my personal website that I'm hosting okay let's navigate to this page right here in Puppeteer and we can close that we can even delete that you can also delete the date so we don't need that anymore let's navigate to this domain right here and now if you're doing this yourself please try with an actual um with an actual like login that would make sense to you there's no point in you know logging into this site right here if you do it then I just advise you to do it you know like properly and not just with my example site all right so how do we type into this input field well the first step is as always we need the selector so as you can see right here the input has an ID that's email address we could also select it through an attribute like the name email right here but we're gonna just use the email address ID and now here we can say await page and at the page well we have a lot of options that we could use in this case we're going to use type so that is literally I'm typing as the first argument um we're going to pass the selector then the text that we're going to type so we're going to pass the selector it's going to be hashtag email address and then as whatever we're going to type um I have defined an admin underscore email that we are going to use as the email so you can put it as environment variable or um well if you do put it as an environment variable then some other work is needed it's not that easy working with environment variables in Puppeteer so I just um have it in a different file and wait that was really hard to see so let's await a page.wait for timeout time out let's give it five seconds so we can actually see how the email gets input into the field so let's start the chromium and as you can see okay this is the email that I've defined and now we're going to do the same thing with a password so let's go to the password field go to inspect the ID is password great so we can use that right here we're going to say await page DOT type we want to type into the password field and what do we want to type well we want to type the admin password save that let's go to the chromium and see what happens so as we can see the password was typed in and now we need to click the sign in button now it's on German doesn't really matter same thing we want the selector so let's see how we can get the button so there's an attribute type submit that we could use um so let's say wait page dot click and then we want to select the hash no it's not a hashtag it's a um well how do we address this button type submit okay so I've just looked up the attribute selector here on my left as you can see that is what we're going to use um so we're going to say um these square brackets and then here we're going to define the attribute so type is equal to submit and let's see if that works um yarn Dev let's open chromium and it should click the sign in button and as you can see it did and it signed us into the web page okay um let's have a bit of a longer timer let's give it 15 seconds yarn Dev and so as you could see we logged in successfully as an admin clicked on the sign in button this is completely automatically and now we are in the admin dashboard now if you uh did this with like Reddit or YouTube or whatever which obviously I really advise against you should never never do that right um You would be in your account and could do whatever you want that you need to be authenticated for and scrape whatever you want so that's pretty cool yeah um those were the two concepts I wanted to let you know so how to go through authentication and how to appear like you are not what you what you really are um which is a bot so this is just a crawler pretty much we don't want to appear as a Crowder though to bypass Sports protection now if you get into really Advanced use cases where you have like a capture there is like a captcha solver even on npm which I think is um absolutely hilarious because capture you know a bot should be able to solve a capture well this might not be it but there was um another another package that um I saw no it's not this either well it is definitely possible to solve captures with um with Puppeteer and that I find that hilarious but uh yeah those are the concepts I wanted to show you thanks for watching I really hope you enjoy them and found them useful and build something cool with it if so let me know I'll see in the next video Until then bye bye
Info
Channel: Josh tried coding
Views: 40,388
Rating: undefined out of 5
Keywords: puppeteer, puppeteer tutorial, puppeteer nodejs, nodejs puppeteer, web, scraping, web scraping, puppeteer web scraping, advanced, advanced puppeteer, advanced web scraping with puppeteer, tutorial, beginner, josh tried coding, joshtriedcoding
Id: 9zwyfrVv3hg
Channel Id: undefined
Length: 11min 11sec (671 seconds)
Published: Fri Dec 02 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.