Node.js Web Scraping - 1 - Scrape A Website with Node.js & Puppeteer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright peeps so we're gonna do a no js' web scraper today I'm kind of sick of doing those react videos I will be back to those but I think a playlist of a bunch of nodejs stuff is kind of cool too so we're gonna do two parts so first part is we're gonna build the actual scraper and we can actually extract data from the website and we're gonna write to a JSON file so I'll show you that here so we're gonna go from the site DOTA buff and we're just gonna grab the team standings for the eSports oh hey what's a pat sajak let's let's ignore him but we're gonna have our scraper visit this and it's gonna grab the rankings so it's just gonna grab everything you see here and I'm gonna have it grab the team name and just the points you'll see that it's it's gonna be pretty easy if you ever want it to come back and grab the additional stuff but that's what we'll grab for now and second part is that's gonna be pretty cool we're gonna set it up to do via like a cron job so basically you could run it automatically whenever you want it every two hours every day every other day whatever and each time it runs we'll have it hooked up to a little thing that we write for email notifications so if an error occurs we'll send out the error message to our email and if it doesn't error and we get the standings we'll just send a list and and email to us so we can just stay on top of it and yeah I'll be pretty cool so we'll actually set that up with the Gmail I'll put my gmail account in there and should be pretty simple so I'll show you what's currently working and we're gonna start from literally nothing but I'll show you what happens here so when we do start we're going to run a scraper you'll see something pop up and then we should write the standings data to the data folder all right cool that's pretty quick so if you check here I too have it just print out in like a nice array so we have a we have a top-level array and then just each each index has the team name and the points so obviously this is it just goes down this will be the last place team blah blah blah and basically you could do anything with you you want with with that but that's a it's just a super simple example but I remember a puppeteer was was a panda but when I first started but it's actually kind of cool once you once you get good at it so let's just start doing it let's throw this puppy in here and just give me a sec and I will I'm gonna check out to the part one branch where we literally just happened nothing going on all right so we have an empty folder here so I just want to tell you that I'm gonna be going to be jumping in to the github repository and I'm just gonna copy code as we go because there's really no need for you to see me type a lot of the stuff although I'll post it in small snippets and then we'll talk about it and then keep going but right now let's just let's just do an NPM init and and at least get this at least get that going all right so what we'll do is we're gonna install puppet ater now this is so basic good like what's nice is it's just that package you don't need anything else as long as you have no GS it's gonna install everything everything you need so as you can see it's downloaded chromium that's what it uses behind the scenes and I've also included so I'm gonna have a link to there to this repository in the description so I already have links to getting started API dogs so you can read up as much as you want on how it works is it's very exhaustive if that's if that's your thing let's see all right so we are all set first thing we'll do is so we definitely know that we have up there let's do this I first we need an entry point into the into the scrapers because we're gonna build to actually because I kind of want to show you two different approaches because no js' actually it allows for a class-based syntax so I actually like the class-based syntax for for scrapers to be honest so I'm going to show you that one first all right so that's good all right let me grab the code for this and then then we'll explain it before anything else we go on to anything else all right so we're not doing anything with this yet we're gonna we'll build that next but basically we're gonna use async/await which is really nice so it allows us to pretty much make our async code look synchronous everything's just gonna run line by line and since we're gonna be using the class-based syntax we're just gonna initiate puppeteer I'll jump to the docks and we'll cover this but all we need to do is launch a browser instance and then create a page and then once that page is in memory we could start doing whatever we want with it and that's where we're gonna pass the browser and page instance into our class and then we'll end up writing all of our logic to actually visit the page and write the JSON file etc you'll see where you should try catch - so which is nice if anything so this is this is just a top-level one so if anything happens in our scraper class like any error that goes off this what this will catch it you know print it to the console and part two obviously will end up emailing that to it to us so it's just jump over to let's go back alright so getting started it's super basic so a very simple example is you just you look like what we said we launched an instance a new page and then we get access so the page object has a bunch of different methods or functions on it if you will and so we can use go-to we can take a screenshot of it and then at the end obviously we can do a close so I'll show you in here so look at how big this thing is so there's there's a ton of functionality and so if we do go to you'll see that it gives you all the type of things that you can pass to go to and so you could even do a so so we'll say page go to team standings and then we could say oh and also wait until network is idle or the table contents loaded there's a lot of cool things you could do so yeah that's pretty much it's you'll have these obviously to yourself whenever you want but let's start writing that class so like I said I want to show you two approaches so let's create that folder class-based I'll say functions I think that's good J yes let's see if that did it's lovely alright let's open this up and let's let's paste in some code from the from my from my working example and actually right before I forget let's just make the data directory too cuz that's what we're gonna write the JSON file alright smile CD is cured and alright honestly you look look at this is it's it's so simple so if you're familiar with how classes work at all really JavaScript doesn't have classes because it does this stuff behind the scenes or whatever if you like how object-oriented programming looks I I honestly love how it looks at in nodejs so especially for scrapers I think it's a lot easier because let's just let's actually just go into why I like that so when we first start off you you have a constructor in any type of object oriented language so remember we're gonna we're gonna pass the browser and the page that we those instances that we created and then index.js file and we set them to class variables so anywhere in the class we can just say this browser in any method it's it's super simple to access and then we're gonna have a standings array and we we're just gonna save the the URL to a instance variable as well so everything's super easy to to access and we're gonna use two methods we'll use a this it doesn't have to be mean but I kind of just use that because I know how other programming label just work they C++ typically you have the your main entry function so I think it makes sense you could name it anything you want it to be on but we'll go with Maine so Maine is the is is the bulk of what's going on and then basically we're just going to go to the page and puppeteer I've to see the page like I said it gives us a bunch of different things where we can use to interact with the page and we'll use the evaluate function that it gives us and we're going to grab all of the teams which is just it's just a table and so we'll grab all the table rows and then within those we'll grab the the TD the table data that contains the name and the points and that's it so since we're using class-based we can we don't even need to pass it as a as a parameter and this isn't even really necessary like you could literally have just wrote the where we're importing the actual file system module from nodejs and we're just importing the the right file sync and base via the reason why I'm using files think is there's really no need to because it's everything's so imperative here and we're not when our writing world-class software here where we need an asynchronous behavior to write to the JSON file but just in case I just want to show you that like how easy it is to add methods to handle different parts of your scraper and so that was just a little addition there and so all we need to do is once this finishes we say all right going right to the JSON file and this will take care of that for us and we'll stringify the array and literally that's we're all set so did I piece that I'm already know let's do that all right super simple you and just to export your class you can do module dot exports and you can assign it right to a class or you could do something like class scraper equals scraper and then you could do it that way but I find this this is just super simple to do it that way and let's see what else we need to do here we're importing it and so we're adding those instances and so we're passing them and I think we should be good so let me just see what we have here so the launch function allows you to add a bunch of different cool features or like when you when you launch the browser you can give it a bunch of different parameters so you can have it run exactly like you want I'll show you that let's go back to the dogs I should just keep them open let's launch let's see what they got there all right cool so check this there's a bunch everyone's we can set the default viewport we can say what width and height we want is it mobile does it have touch there's so many things you can use you can say oh I want to be able to use dev tools too and that will automatically pop up there's a bunch of different arguments there's a ton that you can add to them I won't cover those but the only one we're gonna use here is headless because if you I think by default if you don't include this this options object it's a it's gonna run headless by default and so it's gonna launch a browser but you're never gonna see it because it's headless and it runs directly in nodejs we want to say false for this just so I can show you that the page actually pops up so let's try to do that and p.m. starts all right you can see a popping up and boom I think we had the data it's simple as that like we already have the data ready so let's see if I can explain what we're doing here that's probably this is probably the best thing to finish out the video we will we'll check out the function bass one two quick but um I actually want to show you how I figured this out at first so we're gonna go to the page you know and I just said let's just wait until the Dom content loaded is loaded so I know that the table is loaded what I'm gonna do is so the table body it has a bunch of table rows right so that's what I wanted to select and I'll show you how I figured that out and I just I just use my browser you don't have to use Chrome you can use any anyone you want but so I came over here I selected I just I just did some reconnaissance if you will and sometimes you did be good to be careful what pages that you're that you're on because this this is a super simple page but sometimes you need to be really specific with your selectors so I just made sure that this is the only table that's on on the page just by searching for table and I mean has a class of table but tablet but if I go in here I just have two results as you can see right here so I know that this is the only table on the page so I don't have to get super super specific with my selectors so when we have the table we just want to get the Leon table rows from from the body so I'll collapse this so see so I'm gonna select every single one of these and I want to grab not this this so see we this is a few ways you could you could um grab the the team name from it you could get the inner text value but this is see now you have to start like I don't like this like you have to start traversing all the way down I found a neat little trick where you do see you see here on the top level TD it's just they have a data value after attribute where we can just grab it from that so I thought that was pretty easy for for our use case and so let me collapse no let's go back actually so we want the the second TD in each row and the third one as well and you'll see that if they do the same thing here they made our lives easy so when we think about so basically when we select all of these there are a bunch of nodes and that's so let's all let's see if I can show you that could probably type in him right here and I'll show you what what comes up all right so see what I know what list is it's not detective it's technically technically not an array it's a it's a structure provided by the browser which is a real Ike but we can't really map over it so you'll see what I do here and right here so I actually use the array from and I wrap that the entire thing and that so and I mean map over so it literally turns it into it in an array for us so I can show you that copy let's get rid of this because we're just messing around so let's say teens equals and then if I show teens now we have now we have an array so now we can map over and make our lives easy and so that's where we start extracting the those the the second child and the third child and you'll see I probably can just throw this entire thing in here to be honest so this map statements let's see if that works and see it works so we just say so let's map through every single team that we have here which I call it a team here and the kind of function but you need to picture that as the table row so the team I call it a table of robot so we say team query selector the second child and the third child and what I'm using here is see I'm wrapping it in a tray so it automatically spits this stuff out see simple as that I mean once you got see this there's a lot of imagine if you're writing a scraper this could get very repetitive but if you get better at no js' you'll realize that you can actually open on no js' puppeteer you can run you can create your own functions that um that maybe you can import into your into your file and so you don't have to type all this stuff all the time that's what I that's what I've done in the past so you don't have to it gets very verbose very fast and there's really nothing else that's going on here guys like I hope you understand that so let's quickly just jump to the function based one just to show you how you would do something like that and let me grab the code all right so this obviously isn't a class but I named it a I named it with a with a capital whatever like it's it doesn't matter you could do a lowercase but what we say here is we're gonna we're gonna export the main function here we'll pass it the same thing we're not doing anything with the browser here but might be nice just to pass it if you need to do any additional stuff because obviously we this is a super simple example but if you need the browser instance in here somewhere you have it and then it's really the same thing but let's see if we can put it side to side so really really nothing would likely we call out the herbal here the URL and we do the same thing we went to the Dom content loaded and we do the same exact thing here but we're just instead of using instance variables and attaching it to like a class we're not doing that and then so it ends up you're actually passing around values in between functions which whatever you could even make them global if you wanted or if you want to be super crazy you could do like how JavaScript really works behind the scenes you could use constructor functions and you could you could write to the prototype of the scraper but I don't know I think that's that's a way too confusing for now so honestly I like this but whatever you guys want so I think that's pretty much a wrap let's throw an error first though and then then I'll see you guys in part two and then we can do some additional cool stuff maybe I'll even think of some other stuff that we can throw in so let's see here in our class based well let's misspelled the website and see if if anything happens nice so error name not resolved and it even prints up what we were just trying to access so that's gonna be something that we're gonna pass on an email via Gmail and we've get that right away and so now we're not in the dark and things are good and yeah so I'll show you actually headless so you you won't see anything pop up but it is doing its work behind the scenes and I can prove that when the new data is written so perhaps it's not popping up um there we are so thank you very much for checking out the video please please please subscribe to my channel and I really would like people to reach out and just tell me in the comment section what else you want to see cuz anything JavaScript I can do it I've been messing with JavaScript for a long-ass time so hit me up alright thanks
Info
Channel: Daniel Zuzevich
Views: 6,694
Rating: 4.6043954 out of 5
Keywords: node.js, puppeteer, web scraping, website scraper, website crawler, javascript, data scraping, node js
Id: ZcbTLaB8Tfw
Channel Id: undefined
Length: 25min 12sec (1512 seconds)
Published: Thu Mar 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.