Introduction To Web Scraping With Node.js

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so in this tutorial we're going to take a look at a basic example of scraping a web page using node.js and a couple of associated libraries and the tutorial will be broken down into three parts the first part we're actually going to retrieve the contents of a url so the html content that's sat at a particular address we're actually going to download that so we can examine it and then in the second section we'll have a look at how you can parse the content and actually pick out some of that data using a node library called node html parser and in the third section we'll look at how to deal with dynamically generated content say for example if you've got a react or an angular website where most of the content is actually generated with javascript we'll look at how you can actually render a page first using a library called puppeteer and then we'll be able to actually do the same thing again and extract any of the data that's been presented on that particular page that's generated so i'm going to take you step by step installing each of those libraries and then how you would use them to actually download a web page and extract its contents and by the end of the tutorial you should have the basics of being able to scrape a web page using node.js and apply it to any web page that you want to actually extract data from so let's get started so we're going to be looking at some of the different techniques that we can use to scrape data from a web page but first of all we need some web content to actually scrape so i'm just going to create a simple file that we're going to serve up which will be our source for all of our web scraping tasks so here in visual studio code i'm just going to create a new file and i'm just going to call this index.html which will hold some new html content for us and i'm going to purposely keep this pretty simple just so that we can see exactly what's on the page so that we're not working with a complicated document and trying to work out where all the bits and pieces go so i'm just going to put a heading level one tag there and inside a p tag i'm going to put some lorem ipsum text inside there and let's actually just copy that a couple of times as well and i'm also going to put an image so let's put a placeholder image inside of there and then we'll just put a few more paragraph tags in there as well again with some lorem ipsum text and we'll just leave it as that at the moment now in order to examine the page we could just open this up in our browser but because we're going to be making network requests as if you were scraping a real website we're going to want to actually serve this content on a local server so let's first of all set up our project we'll create a new package.json file and then what we'll do is we'll install a package called light server which is essentially just a http server a local server running on your computer there are loads of these different types of packages available through npm so if you've got a preferred one you can actually use that instead and once that's installed if we go over to our package.json file and in our scripts what we'll do is actually create a new script in there to run that light server for us so we'll just create a new script called serve and then we'll just run that light server and we'll pass in index.html so that opens up automatically so now in our terminal if we say npm run serve it will run that command and the web page that we've got looks a little bit like this so this is the test content that we're going to be scraping and you can see the url is just localhost with the port number on the end and you can see all of the elements that we added into our document so with our test document set up let's take a look at part one of this tutorial which is actually getting the content from an external server using node.js so the first thing we're going to do is install another dependency in our project and the dependency is axios which we're going to use to actually make network requests to actually retrieve the web page that we've just created from our local server next i'm just going to create a new file that will hold our javascript code and i'll call this pagescrape.js and inside of here i'm just going to require the axios dependency so that we can use that within our code and i'll just open up a new block of code here i'll make it asynchronous so that we can use the awake keyword inside of here but i'll just create an iffy so this will run as soon as the page great file is loaded so in order to retrieve the contents of the page that we just created we can use the axios.get function to actually send a network request to download the page that we've just served up on our local host and using the awake keyword we can store the result of that into a variable called page if we just log that result that axios gives us back to the console to take a look at what we've got and run our script from the terminal see we do get that html content back in the response that we have but the axios function actually returns us a complete object with lots of other information as well and the html content that we're actually after is in a property called data so we probably just want to extract that data property and save that so that we can use it to extract our web content so now our data variable just has the html content that has been retrieved via the axios get request but it is just a string we can't actually do anything with it in terms of selecting elements for example we can't use a query selector or a getelementbyid function to actually extract any of the parts of that document so this is where the node.html parser library comes in i'm actually going to install this as another dependency of the project and this will give us those facilities those functions to actually use the query selector and other similar dom functions on top of that data that we've just retrieved with axios install that now with npm install node html parser and we just need to require one function from that library and that function is called pars so i'll just destructure it from the require statement here so what we can do now is actually construct something which is a bit like a dom in the browser so we'll say parse the data using that parse function that we've just imported and let's say we wanted to get the contents of the heading that's inside of that document so we could say create a new variable called heading and from our dom object that we've just created as you can see it's got a lot of the methods that you might be familiar with when you're working inside of the browser and we can just inside of there say query selector and we'll just select the h1 tag that's inside of that document and then let's just log out its text to the console so let's run our script again now you can see in the output in the console we're getting the value that's inside of the h1 tag let's just double check that in our index page so you can see those match here and we're actually accessing this element from the page from our constructed dom which is created from the node html parser library so you might not be able to do everything that you would normally do in the browser with our backend dom object but you can do most things that you might need to do in terms of grabbing data for web scraping so for example if we wanted to get all of the content that's inside of the p tags all of that lorem ipsum text we could create a new variable called content and inside there we can run query selector all and get all of the p tags and then we can actually just reduce that into a single string so if we grab an accumulator and grab the text property from each of those p tags that we've just extracted and then in this example just add them together starting off with an empty string if we were to then log that out to the console we should just see all of that lorem ipsum text being repeated let's run it again oops and i just missed out these single quotes here so let's just dab those in so now you can see in our output we've got the heading level one tag or at least the text from it and then we've got all of the lorem ipsum text that has been taken out of all of the p tags that are on our document so let's do one final example where we're going to grab the image from the document itself so we'll create a new variable called image and from our dom we'll just use another query selector and at the moment the image is just a standard image tag but if we wanted to differentiate it we could give it an id or a class for example so let's give it a class of kitten and in our page scrape code when we're making our call to query selector we can just pass in that class here in exactly the same way you would if you were doing this inside of a browser so as i said there are some things that you can't quite do that you would normally do in the browser so for example if you wanted to get the images source attribute what you could do in the browser is simply say the image that you've selected just access its source property but you'll notice in our autocomplete that we don't see that property and if we were to run that code as well you'll see we get undefined so the node html parser won't actually parse things like that but what we can do because that attribute is actually there we can call a function called get attribute and then just pass in the source property so now this time you can see that the source attribute has been returned successfully so if you're just scraping some static content from a static page i.e a page that's not been dynamically generated by javascript then downloading the page of our axios and then parsing it in this way is probably the best approach and it's quite simple but what if you've got some dynamically generated content on your page say for example if we go back to our index page and add a script tag here and inside that script tag let's create a new element so we'll say let's create a new h2 element and let's just give that heading level two elements some content and then we'll just make sure that element is actually added to the page so we'll use a query selector and just select the heading level one tag that's already on the page and then next to that we'll insert the adjacent h2 element and we'll just say after end h2 and if we save that if we just go back to our browser for a second you can see we've now got that heading level 2 element in there which wasn't there before and that's been dynamically added with our javascript code so let's try and extract that with our html parser library so let's just down here say the heading level 2 is going to be equal to dom query selector and we'll just pass in h2 and then right at the end let's actually just clear out some of this so we can see what's going on and we'll say console.log heading level 2 dot text so if we run that code again we'd expect to actually see the new element that we've created on the page and its text should be dynamically created element but when we actually run the code you'll see we get an error saying we can't read the property text of null so if we actually look at the heading level 2 tag that we've got there the element that we've selected is actually null and that's because the page that we've actually retrieved with axios won't have that generated element already so if we log the actual data right to the console just to see that in practice you can see that the page actually has this script tag but if we scroll up the only element that we've got at the top of the page is the heading level one tag and that's because axios will do a good job of actually retrieving the html page but it won't run any of the javascript on the page so if you're thinking of trying to scrape a page that's made in react or angular or basically using any kind of javascript to generate elements on a page then this approach is going to fail if you're trying to access anything that other than the static content on the page so how do we solve this well we actually need to use another library to actually generate a page that has all of the rendered javascript before we try and plug it into a dom parser and then try and access the elements that have been created via javascript so there are a few different options to do this but probably the best one to use at this time is called puppeteer and what this actually does is runs a version of the chromium browser in a headless mode and then it actually loads that webpage that you're requesting parses all of the javascript inside it and returns the rendered page to you so that you can then do further processing in terms of parsing and scraping the document so in our terminal let's actually go ahead and install puppeteer so npm install puppeteer and when that's installed in our page great file let's actually import that so we'll say create a new variable called puppeteer that's just going to require the puppeteer module and we don't have any need for axios in this example so we can actually just get rid of that completely so to use puppeteer it's simply a case of creating a new object which is a reference to that browser that we mentioned before so with puppeteer we can actually just then say puppeteer launch and this returns a promise i'm just going to use the awake keyword to make sure that gets stored into the variable browser and then we can create a new instance of a page by saying uh create a new variable called page and this again is going to return a promise so we'll say await browser and new page is the method that we want to call so this will actually set up a new instance of puppeteer via the chromium browser and we'll create a new page or tab within that browser and then we can simply say we're going to await the page.goto function which as the name suggests is pointing that page to a particular url so the url we want to send it to is what we were using before which is localhost 3002 and we'll just say index.html to make sure we specify the page and then in a similar way that we did with axios we want to get the actual data from that page which is the rendered web page so we can do that with a function that's available on the page object that we've created called evaluate and i'm just going to create a new variable called data and again this returns a promise so we'll await the result of that so we'll say page evaluate and this takes a function and what we need to return from this function is the actual entirety of the document that's been generated so we can say document.body dot html so here we're just extracting all of the html inside of our body tag from our rendered document and then saving it into the data variable so the last thing we need to really do is just make sure that the browser tab is closed otherwise the script will hang waiting for that tab to close so we can just say await the browser to close but if we run our code again if you scroll up you can see that the actual result that we're getting now is the rendered content and we can tell that by the result that we're getting is we've actually got that h2 tag element in that data variable so instead of just getting the static content now we're actually getting the fully rendered page so if now we're actually going to try and render that heading level 2 tag let's just clear up all these bits and pieces here we can should be able to now access its text content which as you can see is the dynamically created element text that we set with our javascript code within our web page so using this approach with puppeteer you can actually generate those rendered pages before you try and actually parse it with the html parser library and pubs here actually offers a lot of other functions as well you can take screenshots and make pdfs from your rendered pages so let's have a look at how you do that so to take screenshot we simply await the result of the page screenshot function which takes few options but if you just pass in a path of the file name of where you want to actually store that screenshot if we run the code again you can see we've now got a png file and if we open that up you can see we've got the content on our page even with the dynamically created content that has been generated via javascript so to create a pdf it's exactly the same except we use a pdf function so we can say page.pdf and again we pass it in a path say site.pdf run that again now you can see we've got a pdf file which we if we open that up in our browser you can see is the exact same content again even with the dynamically generated content via javascript all saved up into a pdf for us so as i mentioned there are a lot of other things that you can do with puppeteer as well but hopefully you've seen enough to get you started in being able to scrape web pages and extract content using the node html parser library and then ultimately when you've got that data you can do something with it with your project or application so that's it for this tutorial hopefully you found it useful don't forget to subscribe to support the channel and so you don't miss out on any future tutorial updates
Info
Channel: Junior Developer Central
Views: 2,646
Rating: 4.9560437 out of 5
Keywords: Introduction To Web Scraping With Node.js, web scraping with node.js, web scraping with, web scraping with n, web scraping with javascript, web scraping, web scraping node js, web scraping node, web scraping n, javascript, javascript tutorial, puppeteer, web scraping with node js and puppeteer, node.js puppeteer, puppeteer node js pdf, junior developer central, axios, axios web scraping, axios get web page
Id: EQBhOTt2ASg
Channel Id: undefined
Length: 16min 43sec (1003 seconds)
Published: Tue Sep 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.