Intro To Web Scraping With Node.js & Cheerio

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] this video is sponsored by dev Mountain if you're interested in learning web development iOS or UX design dev Mountain is a 12-week design and development boot camp intended to get you a full-time position in the industry to learn more visit dev mountain comm or click the link in the description below hey what's going on guys so about a week ago I did a video on web scraping with Python and a library called beautifulsoup and a lot of you guys like that you like the fact that I used Python and I will be doing more Python tutorials but I also got a bunch of comments asking about nodejs and in particular a library called cheerio which is used for web scraping using node in JavaScript now this is the github page here and as you can see it says it's a fast flexible and lean implementation of core jQuery designed specifically for the server so we can pick things out of a website out of the Dom you basically using jQuery using CSS selectors we can use text HTML methods like dot find dot children die parent all that stuff to kind of traverse the Dom and can pick and choose what we want and get that data so what I want to do in this video is I want to scrape the sample blog just like we did in the Python video and loop through the posts get the title the link and the date and put them all into a CSV file okay and we'll be using the FS or the file system module that comes with no js' for that and we'll just kind of experiment a little bit and look at some of the methods and so on I'm not going to spend too much time on it because it's basically just jQuery and it's really late and I'm actually going on vacation tomorrow for a couple days with my family so you guys are probably I'll probably be on vacation when you guys are watching this when I upload it but let's go ahead and jump into vs code and I'm also going to open up my terminal it's actually close this let's minimize this so I have a folder called web scraping and it's completely empty and I'm my terminal open in the same folder first thing I'm going to do is just create a Jason so NPM and NIT - why and let's clear this up so that created a package.json file now we need to install two things we're gonna install cheerio and we also want to install something called a request which is a very lightweight HTTP module to make requests all right you could use Axios or fetch or something like that but I think that this is this is a good choice for this type of thing alright so now that those are installed let's create our file so I'm going to create a file called scrape j/s and we want to bring in our stuff so let's bring in request set that to require request and then let's bring in cheerio okay so now that we brought in our dependencies let's make a request so this takes in a URL and we're gonna get that from stupid microphone is in the way we're gonna get that from here so let's copy that paste that in and then let me close this up give us a little more room this takes a second parameter of a function so I'm going to use an arrow here and it this will give us three things so a possible error if there is one a response and also the HTML and then we want to basically check for to see if that make sure there's no error so if not error and the response dot status code is equal to 200 which is a successful HTTP response so if that's true then let's go ahead and just console.log the HTML and see what we get all right so I'm just gonna go to my terminal let's minimize this and let's run node and the filename of scrape and we just get the entire page all the HTML which is which is what is expected so this HTML will this we want to run this through a load method from cheerio so let's get rid of the console log and say cheerio dot load and we want to put this in a variable so I'm gonna say Const money sign equals that way we can just use this just as if you were using jQuery to select things from the Dom so from here let's let's actually it's actually open up the site and let's open up our chrome tools and I'm gonna grab I'm gonna go click on this big heading here and you can see that there's a div with the class of site - heading it has an h1 and it has a span with the class of subheading so let's head back into vs code let's create a variable called site heading and let's set that to money sign and then we want the selector which is a class of site - heading all right now let's see what happens if we console.log this so site heading and we want our terminal let's go ahead and run it and it just gives us this giant object which has a whole bunch of stuff in it you can see all these methods all these arrays and objects and stuff this is a very basic tutorial so I just want to get show you how to get the HTML and how to get the text inside so if you want the HTML we can simply tack on dot HTML and let's go ahead and run that and it gives us the h1 and the span because that's what's inside of this this site heading ok if we want the text we can get that as well by saying dot txt and that'll basically strip out the HTML so if we run this will just get sample blog and traversing media sample blog okay and then we have methods like find so let's do let's actually create a variable I'll call it output and we'll set it to site heading dot find and let's say we want to find the h1 that's in there and then we want to get the text from it and then we'll console.log the output and let's run that and we get sample blog okay so we can get we can do find we can do like what else children so children would basically in this case would be the same thing we're just going to look into site heading which as it has an h1 and the children so we can get that we can get the text and that should put out the same thing we can also do next so let's bring that down and let's say site heading children h1 and then let's get dot next so next should be the span so let's see what that gives us and that gives us traversing media sample blog because that's the text that is in the span which is is next after the h1 ok we also have parent so let's say we'll get that the h1 and then we'll do dot parent and of course the parent is the actual site heading so if we get the text of the site heading we should get the text from the h1 and the span so and that's what we get so I'm not gonna go too deep into this it's basically just just jQuery stuff so let's take a quick look at how to loop over things so if we look at the navigation which is let's see nav and should be in here ul so each Li right here and the nav it has a class of nav item and has an a tag inside of it so let's say we want to get the text from each of these so what we could do is loop through it so we'll say let's get the class of nav - item get the a a tag and let's do a dot each again if you know jQuery this is very simple takes in an index and then whatever you want to call this I'm just going to use L and then actually these should go in parenthesis because this is a function and I'm using an arrow so we want to go like that of course you could just do function and get rid of the arrow if you want to do that as well so in here let's let's create a variable called item and set it to we want we want to do this syntax and put in L whatever you put in here should go in here and let's say dot dot text and then console.log so each each item should then print out okay if I did it right so let's do it and there we go home about sample post contact so if we want the link let's say we want each link inside the the navigation let's take the element the current element our current iteration and let's get adder and we want the href attribute and then we'll go down here and let's console.log link and there we go so it outputs all the links okay so pretty simple now what I'm going to do is create a new file so that we can get the posts and we can loop through them get the stuff that we need and and put them into a CSV file so I'm going to create a new file we'll call this one scrape to j s and let's see in here I'm going to just copy scrape the the initial file we just created because we want to do all this stuff make the request bring them bring bringing the dependencies we just want to get rid of all this stuff inside of this actually we want to keep this the the cheerio dot load but get rid of everything else get rid of that all right so let's grab the actually let's take a look at the Dom real quick and see for each post it has a class each post is a class of post preview so that's the selector that we're going to want to use to loop through and then the title is inside an h2 with the class of post title the link is inside an a tag in there inside post meta we have a span with the class of post date so that makes it pretty easy so yeah so this should be simple so we're gonna grab the post preview so class of post - preview and we want to loop through that with each that's in some parentheses let's put the index and I'll put an arrow here and let's grab the title so we're gonna say kant's title equals L and then I'm gonna use the fine method here because I want to find the dot post - title you could put the put h1 in there as well if you want cuz there's only one h1 but I'm gonna use the class and we want the text so let's save that let's close the sidebar up and then let's just let's console.log the title ok so we'll run scrape two nodes scrape two and we get all the titles now notice there's a shitload of whitespace here so in order to get rid of this and we did this in the Python one as well but this has got a little different we're gonna do dot replace and we want to put a regular expression in here so we want slashes and we want to put a backslash s and then a backslash S Plus which will get rid of all the white space but it won't get rid of like the space in here and then we're just gonna do g for global so that it gets everything and we want to replace it with just nothing an empty string so let's try that out so if we go and run that there we go so now we have all the titles without the the white space all right so let's see next we want the link so let's say Const and if you watch the Python one notice how close this is to that even though it's a completely different language different library it just shows that if you learn one language it's it's it's easy to pick up others because you do basically the same thing it's just a bit of a different syntax unless you're dealing with like you know really low-level languages where you're doing memory management stuff like that but for web development it's pretty easy so let's do L dot I lost my train of thought we're getting all the links so let's do find link and we want the actual href so add our dot adder eight ref alright and then let's console.log title and link and run that and there we go we get the title and the links last thing we want is the date so let's say Const date and that's going to equal L we're gonna do dot find and we want to find the class of post - date and we want to get the text okay so if I were to log all three of these let's take a look sorry about that loud ass motorcycle outside so this is this is good it's giving us a date but notice the data is a comma inside of it so that's going to kind of mess things up for us since this is a comma separated value file so what we could do is tack on to this dot replace and use a a regular expression and just put a literal comma so we want to replace a comma with let's do a space so let's see what that gives us okay so just do January 4th actually try that yeah there we go that's better all right so now we don't have we shouldn't have an issue with putting this inside of its own row ok so now we're ready to basically write this to a file so like I said to do that we're going to certainly do it we're going to use the file system or our FS module which comes with notes so we don't have to do like npm install FS or anything like that we can just bring it in so require FS and we're going to create a variable called write stream so basically we want to open up the stream to write and we're going to set this to the FS module and it has this create write stream that's what we want to use here and then we pass in the file name that we want to use which is going to be post dot CSV okay so we have that to work with now just like in the Python video we have to write the headers ok so let's say write stream dot right and for our headers I'm actually gonna put in back ticks here and I'm gonna say titles so we don't have to do any concatenation or anything like that so let's say title what else was it the link and the date and then we're just gonna put a new line like that alright so that'll write the header is now down here we're gonna let's just copy this I'm gonna replace this console.log and let's say write to CSV alright row to CSV and instead of this stuff here we want to write the actual values so this is a since we use back ticks we can use this syntax where we just put variables inside of money signing curly braces so title and link and date all right and then we'll just do new line like that and we should be good so after the loop right here I'm just gonna do a console dot log and let's say scraping done okay I think that should do it so let's open up our explorer and when we run this file we should now get a post dot CSV file if everything is correct so let's run it scraping done let's go over here post CSV okay it looks good from here and vs Cove let's open it in what is it numbers I would always think Excel so I'm gonna open it up in numbers on my Mac and take a look make sure everything looks ok so we have our headers title link date our title there's the link and there's the date and we replace the comma with with nothing so we just we just took the comma out just so to avoid any any mess-ups with the CSV alright so I think that's it guys now I know this is very basic but it gives you a start on how to scrape websites and like I said in the Python video there's a lot of ethics that goes in that go into web scraping because you know there's a lot of sites that don't want you to to scrape their data so you have to always look into it before you you do any kind of scraping on a public site and that's why I didn't use a public site I just use my own little sample site so you know but whatever whatever you want to do is it's on you but that's it guys so hopefully you enjoyed this and I will see you in the next
Info
Channel: Traversy Media
Views: 149,477
Rating: 4.9515085 out of 5
Keywords:
Id: LoziivfAAjE
Channel Id: undefined
Length: 20min 14sec (1214 seconds)
Published: Wed Aug 08 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.