Web Scraping with Puppeteer & Node.js: Chrome Automation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey welcome back so as developers you and i we love apis but sometimes there's not always an api for what we want to do sometimes a company or website will make certain data or features only available through the actual website and you know interacting with it like a normal human being in situations like that wouldn't it be nice if we could programmatically spin up a headless version of google chrome you know through the command line or through node.js and then simulate clicking on certain buttons or filling out certain form fields maybe pulling text value from certain elements maybe submitting a form and then accessing data on the resulting page that's perhaps you know password protected or some sort of secret data on the resulting page well programmatically controlling the google chrome browser like that is a lot easier than you might think it is and in this video we're going to use a package called puppeteer to make it happen before we jump into the action though i just want to say that this video is sponsored by me in the description for this video you'll find heavily discounted coupon links to all of my premium courses they range in topics from html css javascript node express react wordpress and more having said all of that let's jump into the actual action for this video okay so i have a completely new empty folder opened up in vs code so i just have this folder on my desktop named example and here's how i would start using puppeteer so i'd open up the command line it's pointing towards my new empty folder and i would run npm init dash y now in order to use this command you will need to have nodejs installed on your computer so if you don't already have that you can go pause the lesson visit the official website and get that installed anyways though npm init dash y that will give us the package.json file and now we can install whatever we want so what we want to install is puppeteer so npm install puppeteer it's very easy to misspell it at least it is for me but let's go ahead and run that command the download could take a good minute or two or even longer on a slow connection because it needs to download a version of google chrome so that's a pretty sizable download okay once it completes to actually begin using it we just want to create a javascript file where we can write a bit of code so i'll just create a new file you could name it anything but i'll name it index.js up at the top let's pull in the puppeteer package so i'll create a const named pup by tier equals and then just require it in that package okay now most of the functions and tools that puppeteer gives to us they are asynchronous or they're a promise or there's something that will take an unknown amount of time and we would need to await them before moving on to something else in other words we'll want to use the await syntax but remember you can only use a weight inside an async function so i would probably just say async function you can name it whatever you want maybe start or go or pizza or unicorn but just create an asynchronous function and then right below it just call that function okay and then in the function definition let's say const browser you can name this anything this is just what i'm calling it set that to equal and then we just want to await puppeteer dot launch so we're launching the browser again we don't know how long that will take but we're awaiting it so maybe it takes one millisecond maybe it takes a thousand milliseconds but once that completes uh then we want to create sort of a new page or new tab in the browser so we'd say const page equals await browser dot new page now that we have our new page let's tell it to actually navigate to a specific url so i would just say await page dot go to in the go to function just give it a string of text and then paste in the url that you want to visit now just so you and i are on the same page and we have an example to work through together i encourage you to use this url there should be a link to this url in the youtube description so you can just copy and paste it into your clipboard and then use that here okay just for a quick example let's take a png screenshot of that page so it would just be await page dot screen shot call that function in these parentheses give it an object and we just need to give it one property called path and now just make up a file name so maybe we'll call it amazing.png okay let's give this a save and now let's test this out so in our command line we would just run node and then the name of our file so i named mine index.js you actually don't need the dot js so just the file name node index and you can see it's taking just a moment oops it's actually just going to take forever until the end of time because up in the code i forgot to close the task or close the browser so you would want to add a final line of code here that says await browser dot close and just call the close method so let me save that again now we don't actually need to run it again it successfully created the screenshot cool there it is amazing.png if i open it up you get the idea back in the command line i will press ctrl c to stop that task it was just going to sit there and run forever because i didn't specifically tell it to close okay now before we move on to the next example where we learn how to actually extract text values from the page i do want to show you one more screenshot trick so by default when you take a screenshot it's going to be a set viewport window size now you can customize the size of the window but what if you have a really long page so for example here is the wikipedia page for javascript this page is super long so what if i wanted to take a screenshot of the entire thing so i'll just take the url i'll use that temporarily for my go to url and then when you're taking the screenshot you can just give it another property so comma of full page and set that to true and now instead of just taking a screenshot of a certain browser window size it's going to actually take the full length of the page screenshot so if i run that task again and then check my file explorer cool it took just a few seconds but now look at the shape of that file and if i open it up well i need to zoom in a bit but you can see that it is the entire let me scroll it's the entire wikipedia page so i think that's a really cool option in terms of testing different web pages it's just a nice tool to have in your toolbox cool okay now let's move on let me put this url back to our practice requests url and we actually don't need to take a screenshot so let me get rid of the screenshot line and let's try something a bit more useful perhaps let's try to extract the name of these three pets now there's only three of them so this won't be super impressive but imagine if there was a page with a hundred or you know 500 different names and imagine we want to take their names and output them into a txt just a regular text file onto our hard drive so back in our code before we even get to the puppeteer specific code let's first just set up sort of the node.js skeleton of taking an array of strings and then saving that to a text file right where each item in the array is sitting on its own line so right here let's just create an imaginary temporary array so i'll say const names equals and i'll just have an array with three items in it just for a test so what if we wanted to save this to a text file on our hard drive where each one of these sits on its own new line so right below this i would just say well actually first before i try to save to my hard drive i need to have the right node tool available so up at the very top of this file i would pull in something so const fs equals and then just require in now instead of just using the default file system module i would actually use fs slash promises so this way we don't have to write messy callback code it will just give us a promise so now we can go leverage this so right below our made up array i would just await and then i would use that fs file system module look inside it there's a method called write file when we call write file we give it two arguments so the first is the name of the new file we want to create so let's just call it maybe names.txt the second argument is the string of text you want to save into that file so i would just take our array of names so names it's an array in javascript every array has access to a method called join right sort of how do you want to glue together the different items in the array into a string and i would just want to sort of glue them together with a backslash r backslash n in other words a return and a new line so let's go ahead and save this and test it out so down in the command line node index give that a run you can see that created a new file in the sidebar named names.txt and if we open it up perfect so now just instead of these made up imaginary values in my array let's actually use puppeteer to find all of the names on the web page now every website will be different and will use a different html structure but a good place to start is to just right click on the piece of text you're interested in choose inspect and look at the html structure so here you can see that name is wrapped inside a strong tag and that lives inside a div with a class of info so it would be very easy to just write a css selector right of dot info for a class of info and then look inside that for a strong element and then just grab the text content of that and that would work for all three animals however if you're dealing with a more complicated website right and maybe the html structure isn't very intuitive like this let me show you a nice trick in google chrome you can right click on the element you're interested in in your dev tools inspector here and choose copy and then let me pull this up a little bit so you can see choose copy and there's one called copy selector and then with that in your clipboard if you paste it in you see you have a css like selector now it's specific to just that one element that you right clicked on so in other words it has this pseudo selector of nth child one so that would only select meows a lot but you could very easily adapt this right just get rid of that pseudo selector and now that would work for all of them you don't have to use that copy selector feature you can write your own selectors but on more complicated websites that's a nice trick to have up your sleeve anyways the idea is we're just going to write a selector for the elements we're interested in so how do we actually do this within puppeteer though well let's get rid of the fake hard-coded array and instead say const names equals await page dot and then we're going to use a method called evaluate inside evaluate we give it a function that we want to run i'll give it an arrow function and now in the body of our function we can write any client-side javascript that you would normally write all of the normal functions will be available to you so for example i would say document dot query selector all right that's a typical browser function quotes and then i would just give it the css like selector that i'm interested in so the elements with a class of info and then look for the strong tag inside that now this is going to return something that's similar to an array but not exactly an array it's a node list of elements and i want this to be an actual array so that i can loop through it and get just the text content right i don't care about the html element objects themselves i just want their one property of text content so what you can do to turn this into an array at the very beginning wrap it within uppercase array.from so just call the from method close out the closing parentheses at the end and now we have a true array right an array of those three html elements so this strong tag this strong tag in this strong tag so now at the very end of this line of code i'm going to use the array method called map i'll just have an es6 arrow function so we just have the one parameter x meaning the current strong tag that we've looped to and i'm just going to return x dot text content so literally just that one property of its text so altogether we now have an array with three items with just their name or text value however at the moment this array is just sort of floating in outer space right we didn't save it anywhere we didn't do anything with it so at the very start of this line we just return so the idea here is whatever your function that we give to evaluate whatever it returns well remember how this line of code first started whatever we return is going to get saved into this constant named names so if we give this a save and then i realized down in the console when i was control z'ing or command zing earlier to get this back to the right url i accidentally removed the parentheses off the close method so you definitely want that in place so i'll save that i do need to stop the command line task and then just fire it up again so node index okay if we check the sidebar go into the names.txt file perfect so that code worked now i realize this was not the most intuitive code in the world but the confusion here has nothing to do with puppeteer this is just javascript so if this code looks really confusing you might need to just practice your javascript in general i do have a series called the 10 days of javascript available on my youtube channel but the point that i'm trying to drive home here is that inside this function you can do anything that you would normally do in client-side browser javascript i do want to stress one thing though because this confused me at first and that is that if you tried to test something out uh within this function by using console.log you would not be logging to the nodejs command line you would be logging to the chrome browser console right so that's something to definitely be aware of within this function you are in browserland not nodejs land so keep that in mind in terms of scope and which variables are and aren't available to you that's also why we're returning something here right so we're in browser chrome land and then we're returning what we want back to node.js land okay at this point let's move on to the next example and let's try to actually save the image files to our hard drive now there's a million different ways that you could do this so for example in this same evaluate function you could return an object with multiple properties and then just destructure them here but we don't have to include everything in one big single evaluate call like that anyways here's what i would do below these lines so right below where we're writing to file i would say const photos equals and i just want to select those three image elements from the page now we absolutely could use page dot evaluate once again however puppeteer gives us other methods that are perhaps more focused than evaluate so evaluate is sort of the generic catch-all function when you want to run a bit of browser-based javascript but in this case since we have a very specific task in hand right we want to select a collection of elements puppeteer actually gives us a function so page dot function called dollar sign dollar sign eval this function is specifically designed for the purpose of selecting multiple elements in other words we won't have to do this awkward thing of saying array.from instead puppeteer will just sort of do that for us so let me show you how this works in the parentheses we give it two arguments the first argument is a css like selector so let's just say quotes img just any and all image elements on the page on a different website aside from this one you'd be free to write any sort of selector you want okay and then the second argument is just a function so i'll use an arrow function so parentheses arrow symbol curly brackets and the way that dollar sign dollar sign eval works is it's going to pass this collection of elements that it finds into our function so let's have a parameter you could name it anything let's name it maybe just imgs short for images and now the cool part is that it is an actual array it's not a node list so within our function here we would just want to return images dot map right because we're not interested in the html objects as a whole we really just want their src or source property right what we're looking for is the string of text the path towards that image file so inside of map i would just have an arrow function with one parameter so x for the current image that's been looped to and then we would just want to return x dot src if you're new to html and javascript let me show you how that works in the browser so if you open up your console and you just select the very first element the first image on the page so document dot query selector give us the first image well then you can look inside that for its src property and you can see that gives you the full web address for that image so in other words we will now have this constant named photos that is just an array with those different image urls so now below this code we just want to loop through that array and just save them to our hard drive so we know that this is an array but instead of looping through it with a 4 each i'm going to use a 4 of because it allows for the await syntax so i would just say four parentheses const photo of photos right our array of those image paths curly brackets so i just made up this name this would represent the current one that's been looped to and first let's create a new page or sort of a new tab in the browser that visits the url for a photo so you could make up any name i'll just call it image page just made it up and we would await page dot go to so we're just navigating to a new url and let's navigate to the jpeg or photo url and then once we've done that we can actually try to save it to our hard drive so we would just await and i'm just going to use fs so the file system module use its write file method we give this two arguments so the first is the path or the file name that we want to save this file as i just want to keep its original name so whatever the jpeg is named on the web page and remember the src the path that it returns so we're not interested in most of the url in terms of the file name but what if we just wanted to grab this final part of cat1.jpeg well if we look at this string of text really it's separated with forward slashes so we just want the last bit of text that comes after the final forward slash so to get that file name you could just say photo the current image that's been looped to photo.split so we're turning the string of text into an array and we want to split it based on the forward slashes and then right at the end of that just call pop which will give us the very final item in that new array okay and then the second argument instead of the b placeholder this is the contents of the file that we want to save to our hard drive so we would just say await image page right that new page that's pointing towards the image and then we can just say dot and call a buffer function that puppeteer offers let's go ahead and save this and test it out so if i call node index and open up my sidebar we should see a few new images appear whoops no instead we have an error and that's because i forgot to await the call of page dot dollar sign dollar sign eval so when we said const photos equals be sure to include a weight right here right because we have no idea how long this is going to take to complete so we want to be sure to await it before actually trying to use that array let's give that a save and try it again so node index okay perfect in the sidebar we have three new jpegs and if i go to my files cool yep there they are you can go ahead and open them up perfect okay let's change gears and move on to the next example so now see this button here that says click me on the page well if you actually click it it inserts a bit of new text so imagine from a node.js perspective we want to be able to access this content that only exists once you've clicked the button right when you first visit the page that text doesn't exist in the dom now sure you could hunt through the client-side javascript file and search for that text but imagine if it came from a network request right imagine if it didn't actually exist anywhere in any of the code until you actually click the button so how can we simulate clicking a button within puppeteer well first of all let's find the selector for this button so we know what we're trying to click so you can right click and choose inspect cool so the button has an id of click me and then the idea is this div that has an id of data it's completely empty until you actually click the button so if i refresh you can see that div is empty cool so from within node.js we want to simulate clicking on this button and only then do we want to try to select the text that will be in this div so let's jump back into our code and just a quick note we want to be sure to write this code before this code because this code is changing the page that chrome is pointing towards right this code is looping through the images and actually visiting the urls for the images so if you try to write the following code after this it will not work so let's be sure to write that above that for of loop so maybe just right above our photos code i'll say await and i'll use page dot click so puppeteer gives us a method called click i think you can imagine what it does you just give it a css like selector so quotes hashtag for an id of click me so this simulates the click event okay and then right after that we could use page.evaluate and then use document query selector but just like puppeteer gives us dollar sign dollar sign eval it also gives us single dollar sign eval which makes it so we don't have to spell out document query selector and this is for just selecting one element or the first instance of that element so for example we could create a variable we could name it anything i'll call it clicked data equals and then we can await page dot just single dollar sign eval you give it two arguments the first is a css like selector so let's just say hashtag data right that empty div the second argument is a function and it's going to pass into it the selected element so i'll just say l for short arrow function and let's just return we don't even need to drop down to a new line if you stand the same line the word return is just sort of implied so let's just return the text content of that div okay so now we have this variable with that secret clicked data that only exists once you've clicked on the element so just for a quick test to make sure that we actually have it right below it let's log it to the node.js console so clicked data let's give that a save and test it out so node index awesome down in our terminal we see dogs bark and cats meow cool so it's that easy to simulate clicking on an element let's move on to the next task actually the final task and what we want to do here is we have this form and you can see it's asking what color is the sky on a clear and sunny day now if you get this question wrong on purpose so if i say green and submit the form we see sorry that is incorrect but if i go back and i type in the correct answer of blue and submit the form we see congrats and then let me zoom in a little bit this is top secret text and the idea is imagine if this was some sort of sensitive information that you only have access to if you submit a form with correct values so you can use your imagination this could be any sort of sensitive information the idea is we want to get access to this text on this new page from within node or puppeteer in other words we need to submit this form with a value in this input field of exactly blue then we want to wait for the network request we want to wait for the page navigation change to the new page and then we want to select this text just so we're on the same page if i inspect this text on this new page you can see it lives in a paragraph with an id of message so how would we do this well let's go back to our code and let's write the code for this right before the for of loop so this way we're still on that original page we haven't navigated away to any image pages however within this code that we're about to write we're going to have to navigate to that new page after you submit the form so we don't want to mess up any of our existing code by navigating to that page before it has a chance to run so right here seems like a good spot let's await and then we'll use page dot and there's a method called type you give it two arguments the first is the css like selector of the element that you want to input or type into so i'll just let you know that it has an id of our field the second value is the value you want to type into that input field so let's just say blue okay then right below that line you could use the method that actually submits the form but instead i'm going to simulate clicking on the submit button i have to do this because i'm hosting this example website on github pages which doesn't actually have any server-side functionality so i'm just using client-side javascript to decide whether you typed in the correct value or not and in order for that to work you need to actually just click the submit button to submit in the real world for most websites that you're going to use puppeteer with you could just actually use javascript to submit the form but to simulate clicking on that button i would just say await page dot click and the selector for that button is just our form the id our form look inside it for a button element right after that let's await the page transition the network request to actually navigate to the new url right so when you type in blue and submit the form you can see it takes you to a new url of message.html so to actually wait for that page change we just say await page dot wait for navigation so we call that method okay now that we're on that new page let's just actually select this secret or sensitive data so it's in a paragraph with an id of message so i would just say const you could name it anything i'll call it info equals await page dot and then you could use evaluate or in this case i'm just going to use dollar sign eval call that this way i don't have to write document.queryselector i just give it a selector as the first argument so let's say id of message second argument is just a function or actually let's use an arrow function so it's going to give us the element arrow symbol and then i would just return that element and i just want its text content property okay then finally right after that line just to test that we actually got access to this top secret text let's log it to the node.js console so maybe console.log and i named it info okay one last thing before we save this and test it out i'll let you know that while this seems like it makes sense and like it should work right that we're submitting the form and then we're waiting for the page navigation change this actually has caused problems for me before so what we need to do to make sure this works is use promise.all and wait for both of these promises to happen before moving on i would need to do more research i'm not exactly sure why that's the case i would think that waiting for the button click and then waiting for this would work but it doesn't so here's what we would do instead we'll just say await uppercasepromise.all in these parentheses you give it an array of promises and then you can just copy and paste our two promises that we were awaiting so you don't need the word await but just page dot click just cut that put in the array comma let's take this one so cut that paste it right after that comma cool i've found that this way of doing things is very reliable and works every time so let's go ahead and give this a save and test it out so in the command line node index perfect there is that information that appears on the new screen only if you entered a value of blue okay at this point we've worked through all of our examples and now that you know how to perform a few basic tasks with puppeteer the next logical thing you might wonder is well how do i automate something like this what if you want a task to run every hour or every 10 minutes or every 5 seconds right the whole idea of automation is that it continues to run perhaps so in terms of scheduling or repeating this task you have a few different options so first of all let's get rid of this code at the bottom where we actually call our start function instead what if we wanted to call it every five seconds you could just use a set interval you give that two arguments the first is the function you want to run so we named it start second is how long you want to wait so 5000 milliseconds so i could save that and then fire up the task node index so the first time you run that's going to wait five seconds but then we should see there it goes and then if we wait another five seconds so three one thousand four one there you go so that will just continue to run forever that's one option let me cancel that with control c so this is just a super simple you know it runs every interval however what if you wanted a bit more control what if you wanted to run something only on the third hour of the second day of the month or you know only on a certain month right something that's actually in tune with the current calendar date and time so you have a couple options there first of all let me show you the node js way of doing something like that so i would install let me clear this out i would install a package so npm install called node dash cron okay so install that and then up at the very top just include it in so const cron equals require in node dash cron and then down at the very bottom here i'll just paste this in but you can type it out so the way this works usually with something like cron tab is there's five values in this case there's an optional sixth value for seconds we're not going to get into the exact syntax of how this works if this is interesting to you if you for example wanted to learn how to run something only on you know the 14th hour of every day or a certain day of the month then go ahead and google or search on youtube for cron tab or cron job but we can test this out so i'll save that run node index and again it should run once every five seconds cool so there's one let's wait another couple seconds perfect finally so i said we had a couple options that's one way of doing it but that does require you to leave the node.js task up and running forever and perhaps node runs into some unexpected error and then it doesn't know to restart itself to keep it up and running so probably the most robust way of doing this of scheduling an automated task would be at the operating system level this is where it gets a bit tricky though on windows i don't believe windows has cron tab mac has crontab but mac has so many security permission issues you would need to explicitly give permission to like five different tasks in order to get it to work however on linux which let's be real that's what a server is 99 of the time going to be running it's very easy to set up a cron job using crontab so if that sounds interesting to you if you want to set up an automated task on perhaps a vps just do a youtube search for cron job or cron tab if you go that route then at the bottom of our javascript file you would actually just manually call the function once and then it would be up to the operating system to actually call and execute our index.js file so it would handle the scheduling and i think that's the most robust way of handling things cool that's going to bring this video to a close thank you so much for watching hope you feel like you learned something and stay tuned for more web development tutorials [Music] you
Info
Channel: LearnWebCode
Views: 20,380
Rating: 4.9702382 out of 5
Keywords:
Id: lgyszZhAZOI
Channel Id: undefined
Length: 35min 12sec (2112 seconds)
Published: Mon Jul 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.