Learn Web Scraping with Puppeteer/Node.js in 15 Minutes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone hope you're doing well so today is a quick introduction into web scraping with puppeteer if you enjoy this kind of content please consider subscribing and let's get right into it so if you're not familiar with what web scraping is web scraping is essentially a way to manipulate and extract data from websites usually using software so puppeteer is a node library so it works with with node.js and allows you to do exactly that allows you to manipulate the browser usually chrome or chromium and to do whatever you might usually do you know um normal kind of day-to-day but allows you to do all programmatically written in code in software and it uses something called headless browser as well um which basically means you don't actually see the the window the the chrome instance itself but it just kind of does it in the in the background so why would you want to web script or why do you need to use puppeteer tools so there's a few reasons on the website here generating screenshots pdfs um server side rendering automating submissions of forms etc there's quite a few reasons there the main things i've used it for is um automated ui testing so you can build an application you ship it and then you want software to basically open up your application you know fill out the different forms and you know check that the data is right so that's you know one thing that pub2 can help with um one of the examples we're going to look at today is basically checking prices on amazon and if you you know you might be looking out for an item on a shop and they don't have an api that you can kind of nicely call programmatically so you might have a webscripter to just you know check the price um of that item every now and again and basically let you know when it's you know on discount or it might even order it for you so that's what we're gonna have a look into today uh let's get right into the code so i've just instantiated a brand new node project i've just done that running npm init hyphen y so that just starts up a new uh node project and then in the package json i've added two scripts uh one for price check and uh one for screenshot so the first one is just to get you familiar with how pop2 works and then we'll build out the the price check script just after that and not here i'm just playing around with the type modules but you can stay with with common.js just in case you had noticed the imports so i've got the property or github here on the right and to get started is as simple as running npm install puppeteer there's also the option of using property core now these are exactly the same uh i think pretty much the only difference is puppeteer ships an actual instance of chrome or chromium with it so that way you don't need to kind of run it on an existing browser but if you're just kind of you know either building a library or if you want to use it with an existing browser you can use puppeteer core and i think that's basically the difference there so it's a bit more or lightweight so in terms of usage i would recommend checking out the documentation so documentation um resources on the github page it's basically got uh everything you need to know it's actually quite quite good documentation here but we'll uh uh yeah we'll get into it in the example so let's have a look at how popular works in general so i've got my screenshot js file here and this is one of the examples on their uh documentation so you can have a look so we're just importing puppeteer and we're going to define a function called screenshot and this is this async function we're going to call this um towards the end so the first line we're going to do is we're essentially launching prop 2 we're um defining an instance of a browser so this will basically launch the chromium browser and then using that browser you can create new pages create new tabs you can you know stick check the existing tabs um each page is essentially a tab and um now you have an instance of a page so now with the page you can basically do whatever you could do in a normal browser so in this case we're going to tell it hey go to amazon and typically one of the things you want to do is you want to tell it to go to amazon and then you basically want to tell it to wait until things are loaded and there's a few different ways to do this um within the go to function there's an options block and you can pass a wait until key here and you can see there's a few different options here um loaded uh document loaded so this is actually hooking into the lifecycle of the page and then you have network idle zero and two um and these two basically means uh this will wait until there are zero network calls for i think it's 500 milliseconds and the second one here will wait until there's maximum of two calls um and these the difference between these two is one is basically more suited towards single page applications where you still have um you know fetching uh going on and one is more geared towards server side rendered um items so i'm just going to stick with uh network idle 2 and because i think this is better for the server side rendered which is basically what amazon have here now that we're in the page we can start basically you know manipulating and doing things so we'll have a look at that in a second um in the price check in this one we're going to use page don't screenshot and we're going to give it a path example.png so these methods they come from puppeteer this is all part of the puppeteer api and then what we can do finally is we can close the browser and you can notice that these are all kind of asynchronous functions um so we have the await keyword beforehand so all it's going to do is open up amazon take a screenshot and close it so final thing we do is just call the function and that should be it so if i just hit run here you can just see that's going to be running for a moment or two and once that's done there we go we can come back on the left here and if i double click on this example png you can see that there's a screenshot of um yeah the amazon website you can see it's cut off here there is a default width and height which is uh 800 by 600 i think and yeah you can manipulate that as well but for the most part we don't need to see anything um because it's headless by default and we'll see how to to tweak that for for debugging in the next example so that's kind of the the fundamentals let's move on to the price check so for the price check example we're gonna start off in the exact same way um again a browser a page and then we're going to the url so i'm just extracting this out because we're going to be changing this soon right now the url is just amazon.com.uk and then we're just calling the function so of course we need to add a bit of logic here to to do what we needed to but what i thought we'd look into first is i said earlier on and this is running into headless so that's why we didn't see an instance of the browser showing up earlier on to the screenshot what we can do is we can add an options to this launch function and we can just pass in headless and by default of course that's true so we can say hey i don't want this to be headless and now if i run the price check function here and if i just leave that for a moment you can see there's this instance of chromium that's been brought up um with the specific dimensions and it's just opened up amazon you can see it's opened up a new page so i had a default page open up a new page went to amazon.org uk and then it's just stopped there or um because we've not closed off the browser we've not done anything but that's basically how you might want to go about debugging right so if you're debugging and you know you're getting stuck and you want to see what state the world is in you can just basically you know make sure it's um headless is false and check it that way so let's have a quick look at what we can do so typically when you're web screen there's a few things you might want to do if you think about what do you do when you're on a website you're either gonna be typing something clicking something or reading something reading data right that's so that's the three things i'll go over now and if we open up the inspector tools here and go to console you can pretty much do this all in javascript already of course you can go to the browser you have access to the document and then you can you know select specific parts of the document for example if i go over to this input here and i can see it's got an id of two tab search text box i can copy the id alternatively usually you can copy copy selector there we go and then i can go to the console and i can type in document.query i think it's selector there we go and just pass in your selector just like that and then now you have you know you basically have access to that box and now you can just basically manipulate it in javascript the way you usually do so i can you know i think i can just update the value to be you know something hit enter you can see the value is there if that was a you know button i'd be able to call the on click function onto it um and of course you could read the inner text etc so if i just keep at that i think it's just going to show me the you know the value so you know you can manipulate anything on the html page using javascript um on the browser now what we want to do is basically do something similar but over here in puppeteer and and that's exactly what we're going to do here so if we wanted to do the exact same thing type something and what we could do is we could always start off with the page so page we've got a bunch of helper methods here um one of the ones to know is if you ever kind of get stuck on the puppeteer side you can always just use the things that evaluate function right and the evaluate function just takes in a callback and within this evaluate function you basically have access to the dom you're basically this gets evaluated or it gets executed within the browser environment so i could basically take this copy and paste it inside here and yeah let's just run this and this should just work fine so this should open up a browser new tab load everything and once it's loaded it should just set the yes set the text box to something now although you could do this for most things um puppeteer basically has a few helper methods to make life easier for you so what you can do instead um we can do page.type and it takes in the same thing so we pass in a selector oops so let me just take this selector here uh there we go so we can just take that selector and then we can just give it a text to type something there we go um and then same things for you know same goes for clicking so if i wanted to click something i would do page dot click and what we can do here is let's just take this button again we've got id we copy that and click it just takes in an id and this is just again a selector so that's an id so i pass in a hash there and now what they should do is basically um stop this instance we run it it should just open up amazon uh search for something and hit uh basically uh click the uh button here and actually search for it so we just see if that works out there we go so if i accept the cookies here you can see that search for something and yeah that's basically typing and clicking so in our example here um what we want to do instead is we actually want to read some data so what we're going to do is we're just going to click on a random product here so let's just go for the alexa go for the echo dot um so this is the product page let's just say i want to monitor this i want to check when you know i want to you know check for for a discount so i want to buy it if it goes below a certain amount um and you know let's assume they don't have an api so what i want to do is i'm going to copy this url because this is the url i want to go to um we don't need to kind of go through the searching and you know etc but you know it's there um if we need it and then what i want to do is basically i want to look for where's the data on the page that i need so i know that it's going to be i can see the price here the price here i can use either one it's going to see this one here um there we go so this one's got an id usually if something's gone id just means it's easily searchable now i'm just going to copy this selector now one thing i would kind of caveat is um although the id should be unique in a page i wouldn't always trust that so it might just be worth checking so you could use document.queryselector um all for example just make sure that you've got one result because although ids should be unique in a page they aren't always this isn't always the case and that might basically um yeah break break the application that you're building so what we're going to do here is what we're going to do is we're going to use page again and we have this kind of dollar sign selector type thing so if you're used to or if you've ever played around with jquery and you might be used to this it's basically going to bring you back an element with that selector um so this basically just brings you back an element with that selector and you can combine this with the eval that we saw earlier on the evaluation and what this basically allows you to do and this is one of the things i use most of the time is pass in a selector and as a second argument we want to evaluate some sort of function within the browser so in this case i want to read the inner text um so what this function does is it basically gives you the element that it found in this case actually you can just call this element so this is the returned element and yeah you can just return whatever you want from here so i'm going to return the element dot inner text so i think that should work fine it's going to format that so basically what we want is and again we can just test it out here quickly just to make sure it works paste that in and then you're going to do dots in your text it's going to return 39.99 and that's all we need so if i return or if i assign that to a variable so make sure i have that as the price then what i'll do is i will close the browser just to make sure everything's tidied up and then i will return the price at that point so let's return price and that should all work fine if i can spell so return price there we go and what i'm going to do after i call the get price method then again this is a promise i'm just going to console.log right so i'm going to log that out and i think that should be fine so i'm just going to remove the remove the headless here and stop the existing one running and i'm just going to run it one more time so if i run price check it should go away go to the page fetch the text and just print it out and there we go so that took a bit longer than i'd expect but yeah there we go it's printed out 39.99 that's exactly how we need and now you have access to this value and in your node environment so at this point you can have you know you can basically do whatever you can with normal node environment you can run this as a cron job and then you know take this value check some logic check if it's you know under a certain amount send an email send a text or you can even just fill out the entire form and you know go ahead and click where is this click the buy now button basically you can do whatever you want so one thing i forgot to mention earlier on which is actually also quite useful is another api that they have which is await page dot wait for selector so although here we have waited for the basically the network um to die out sometimes you don't have that luxury sometimes you click on a button and you want to wait for a select or something that you know is going to be on the page to be there what you can do there is basically wait for selected so what this does is again you pass it some sort of selector um and you can just it will wait a certain time out which you can specify for that selected to come into play and once it comes into play it will basically yeah it will return the element itself so um in addition to using this network idle i could have also waited for this selector here um so basically say wait for that element to be available and then i could do element dot evaluate um and you know so on so forth basically evaluate it in a similar way that we we have here basically so um that's just something that's worth noting so i think that's everything that i want to cover in the scope of this video so this has been kind of a very basic introduction uh you can see here on the right the api is is quite big so you can actually do quite a lot with this but you know as far as the absolute basics go um this is going to cover you the majority of the time so i hope you could learn something from that and thank you very much for watching have a good day and i'll see you in the next one you
Info
Channel: Redhwan Nacef
Views: 1,310
Rating: 5 out of 5
Keywords: Software engineer, software developer, how to, how to code, coding, programming, testing, java, javascript, learn to code, learn coding, learn programming, software, technology, computer science, YouTube, Redhwan Nacef, puppeteer, web scraping
Id: w3F4rUaWNcU
Channel Id: undefined
Length: 14min 44sec (884 seconds)
Published: Sat Apr 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.