Web Scraping with Node.js

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Useful quickstart guide! I would also recommend Nightmare or Electron for more advanced scraping.

πŸ‘οΈŽ︎ 8 πŸ‘€οΈŽ︎ u/benjaminabel πŸ“…οΈŽ︎ Mar 16 2018 πŸ—«︎ replies

The only web scraper I've had success with scraping JavaScript-rich content is Navalia.

πŸ‘οΈŽ︎ 3 πŸ‘€οΈŽ︎ u/mattk1017 πŸ“…οΈŽ︎ Mar 17 2018 πŸ—«︎ replies

Cheerio - works only with static websites.

For dynamic you need the other tools like puppeteer or webdriver with Chrome/Firefox headless.

πŸ‘οΈŽ︎ 1 πŸ‘€οΈŽ︎ u/spmaster007 πŸ“…οΈŽ︎ Mar 17 2018 πŸ—«︎ replies
Captions
hi in this tutorial I'll be going over how to do web scraping using node I was kind of inspired by this article I read client-side web scraping with JavaScript using jQuery and regex and I wanted to continue this to kind of make it even better so instead of using client-side web scraping I decide to use node because so I don't have to deal with the cross origin issues when you're using node you can basically get information from anywhere and it's not gonna give you ears in this article what she does if she gets actually information from the free Koch camp website and finds the number of challenges completed so what I thought would be interesting is if we can combine that with web scraping on a different freaking website on the freako camp forum there's forums dot freako camped out org slash users and it's basically like a leaderboard it's gonna rank the users by how many hearts that they received so I thought I'd be interesting to compare how many hearts that they received compared to how many challenges they've passed on free code camp just to see if there's any correlation between number of challenges and how far they've gotten into the free code camp curriculum with the number of likes that they've received on the free code camp forum now you don't necessarily have to even be on free code camp and in the curriculum to be on the forum so there may not be any correlation but I wanted to use some web scraping to find out the first thing we're gonna do is going and go into sublime text here and I'm going to just require some dependency so we're gonna do Const our P equals require and the first thing is going to be called request promise and this is going to make it easier to make Ajax requests from other websites so we're going to also have to do MPM install riku request promise okay well that's installing I'm gonna do another line here Const cheerio equals require it's gonna be called cheerio material makes it so you can use syntax similar to jQuery from within node so that's another thing they'll be easier to navigate the Dom and different things like that when we're doing the web scraping and let's get that installed NPM install cheerio and then the last thing is Const table now this next one is just supposed to make it easier to format our information for this project we're not going to create a web browser or anything it's just going to display the content the results right in the console so this CLI - table is gonna make it easier to do that so I'll do MPM install CLI - table and then we'll just let that install okay so we're going to need to set some variables one of them is going to be the let users we're just gonna have to have an empty array of the users here so let's start our our first web scrape here so first we're gonna have to set our options this is something for the request promise I'm going to set the options the first thing is going to be the the URL we're going to get the data from now we still to figure that out so if we go over here to this list we have to figure out how to get all this information now now here's one of the problems a lot of websites render all the render the page with the JavaScript so like if I go over here into view page source and I search for let's say I'll just search for Quincy because that was one of the people on there you can see it doesn't find that name anywhere even though we can see the name Quincy right here it's not on the source it's because after the page loads it renders all this in JavaScript now one problem is with you when doing Ajax requests or requests to other websites is that it's not going to load all the JavaScript so we can't just load up this page from nodejs and expect to see all this content because it's not going to load up all the JavaScript so what we're going to do is try to find if there's an API that's getting all this information so I'm going to do a command option J or I guess we didn't need to do J by just trying to open up this developer tools here and I'm going to go to the network tab now I'm going to refresh this one more time and it's gonna you can see all the different calls that it's making to get the information on the page it's good it's loading all these different things so let's go up to the top here and I'm trying to figure out if there's a what if there's a list of users okay if I go to this directory items 1 you can see the whole list rep the items and then then it has a few other things if I click the down arrow oh you can see this look this is the list of users see that likes receive it start with 8850 134 and you can see over here 8850 134 and if we scroll over here well let me just drop one of these downs you'll see user and we can actually get the user names here so what we're actually going to do is use the request to get this information from the API and then you can see the user name I'm just gonna copy this user name really quick if we go over to free code camp org and then I put the user name on the end you'll see that we'll get the free code camp profile for that user the user on the form doesn't necessarily have to use the same user name as the user on the free coconut org website but in many cases they do that use the same user so we can use that username to find the number of challenges passed so to find out the number of challenges passed we're gonna actually have to count up every item on this list here but we'll get to that in a second right now we just need to get this link here so I'm going to let see copy copy the link address so now I'll go over here and for URL I'll put this in some quotes and maybe I will zoom out a little bit here so we can see the whole link here and then I'm going to put JSON to true that means the result is going to parse the JSON for us and make it a little easier to deal with the data so now we're gonna do RP and then pass in the options or RP was this request promise that's going to do a Ajax request and return a promise since it's returning a promise we can do a dot then so once the promise is resolved that means once it's able to get the information from that website then it's going to do something so let me pop this down to the next line and then we're gonna try to figure out what it's going to do it's going to give us some information and I'm gonna put all the information in this just a variable called data and now let's we're gonna make this arrow function here okay let's set up some variables we have let promises equal let user data equal so we're also going to make it an array of the user data now we're going to for that user of data directory items and so it's getting this directory items right from this data here if you look at this page over here you can see that the top level of this results is directory items so we have to get into that and I'm going to do user data dot push when I push onto that array which is going to be the the data from the user basically so the name is user dot user dot user name and I figured out what this is going to be the user dot user that username just because you can see first it has to if you look in this data here we first we get the user dot username right there so also I'm going to set the likes received to user dot likes received and now let's see what I'm going to just add a semicolon that's all we're gonna do for that well in this for loop and we've pushed all the data onto there oh and now that I'm putting it in here I just realized that I tried a few different things and I don't actually need this array of promises when I was figuring out this code at first I tried something with this but actually I don't need this anymore so I'm gonna take that out now I'm just gonna go down here there was a time where I was going to try to call a promise for for each user but then I decided to go about it a different way now just for the the purpose of my log so you can tell what's happening I'm gonna log something and instead of doing console that log I'm gonna do process dot STD out dot right and then just put loading one thing this does compare the console that log is it doesn't add a new line at the end so you can put everything on the same line so now we're going to call another function here which I'm going to create in just a second which is just get challenges complete it and push to user array it's a long for you function name but just describe what's happening so now the next function we're about to call is going to do that it's going to get all the challenges completed and then push it to the users array or the users array is right up here so what I'm going to do here is add some semicolons and then I'm gonna do a dot catch so let's see that catch is just to catch any errors that happen from this Ajax call and I'm going to yeah just do this little function here and we're just going to console dot log the error if there's an error now I'm going to create that function I'm actually just gonna copy this so I don't have to type that again so I'm going to put function and I'm gonna have the user data pass 10 still and let's define that function this is where it's gonna go through each user and make another request for each user on the list there's a few ways to do this but I want each request to be in order so that the the hearts will be in order like this so I just want to make sure that the data the data is in the order that it is on here so when you make a different request is asynchronous so you don't know which resolves going to come back first so the way I figured that out to do well let me just show you how I'm gonna do it to get the everything in order you may be able to find a better way to do this but this is just one of the first things I came up with I'm going to do var I equals 0 when I set this is basically a counter variable and then I'm gonna create a function called next so we're actually gonna keep calling this function it's gonna be used some recursion here so if I is less than user data that length if we haven't gotten to the end of the list we're gonna do this I'm gonna set up a new request so options equals now remember this is just like this one up here up here where we have a URL and we have if we want to do anything with the JSON so I'm going to put URL and this is going to be now if we go back over here we can see that it's just going to be free Co camp that org slash and then it's the user name so let's copy that and paste it in here freako camp org slash and then this is where I'm going to add a plus and user data we're gonna be looping over this so I is the index that we're looping through dot name and what I am going to just change this just to these back ticks that's what the best practice is now so let's see now I'm not actually going to do anything with the JSON I'm gonna do something else it's going to be it's going to be trans form body now this is where we're gonna use that cheerio that we brought in so cheerio that load body we're gonna be able to navigate this this HTML kind of like using something like jQuery and now we're going to make the actual request so rp4 request promised pass in the options and it's made dot then function here's the results I'm gonna put into a variable that's just a dollar sign kind of inspired by jQuery just so we know every time something's loaded I'm going to do process st d out dot right and then we'll know if a user is being loaded so now one thing I also want to check for is well if we go back over here to this list of users I'm going to try to find one trying to see if I could remember basically some of these don't have the same user name I'm free coke amp so if I copy this user name here and I put it again at the end of my URL it says we couldn't find a page for /john - free code camp so I want my results to be able to know whether it's actually getting data from the free coke amp user page or not so the way I figured out to do that was just to check if there's a a so if I go into here and if I inspect this you can see it's landing - heading so I'm gonna see if that exists so I'm going to do const FCC account this is going to end up being a boolean equals and then this is where we can do a look for that tag each one dot landing heading and then I'm going to put dot length equals equals zero so if the length is zero that means it doesn't exist on the page but this will be true if they do have an account and false if they don't have an account with that username we're gonna do a conjures test this is to figure out how many challenges they've passed and actually the code for this I got right from the article I made it pretty similar to how how they got the number of challenges on here so first I want to see if they have an account so this is gonna be a ternary operator do they have an account let me scroll down here so if they do have an account then we can find out how many challenges they pass so we're going to look for tbody TR and then do dot length and if not we're just gonna do a string we're going to set the challenges pass to unknown we don't know how many challenges that they passed maybe they have a different username that they use I'm phreak okay org so if we go back over here I'm going to go back to this page so if I the reason why this works if I inspect this you can see we have a TR let me go back so it's T body TR so in the T body and the TR so each TR is its own row on here so basically we're just counting the rows in that and the number of rows is the number of challenges that the person passed so when you're doing web scraping you often have to go into the code and try to figure out exactly what you need to count for it to make sense to get the actual data we could have just gotten this number right here but that number is not just that number includes the challenges passed and they could also get a higher number based on helping people in the the chatroom but I just want the number of challenges passed so I want to count every line in this row so it would include like if these projects if I go to inspect this you'll see this also has T body TR so it's going to include the projects that they pass to so it's gonna include everything on here it doesn't include the headings or anything so when it counts every single line in this table that will give you the the number here and so again when your web scraping you just have to look a lot the code to figure out the best way to do it so luckily this is not rendered with JavaScript it's just we get a page that has all that in there within the HTML so that's why we're able to do this we got the results but we want to push the results to a table so let's set up the table options up here because we're gonna make a tape this table is just going to appear in the console so I'm gonna put let's see next I'm not going to put it there I'm going to put right down here so let table equals new table and this is just from that CLI table thing that we brought in here it needs specific options we have the heading so what's the heading of the table gonna be like and the first column is gonna be username and then the second column so me heart so if I do ctrl command space that should bring up this if you're using Mac so I'm going to put in a heart there and then challenges so I just used that emoji because of how many hearts that they've received and now I also have to make the column width so I'm gonna put CEO l wi d th s widths and I had already experimented with this and I found that 15 5 10 are gonna be the good column widths now let me go back over here table dot push and then I'm going to push on this information this array so it's going to be user data and then I for what index Ron dot name then user data i dot likes received and challenges passed so those are all the information there for the different columns and then I'm going to just increment that I and then I'm going to put return next it's going to keep running this function and if I is less than user data dot length it's going to do all this where it gets the next users information and then eventually I will it will run function that next and I will not be less than user dot length so we're not going to go through this and return and run the return next so this is a basically our base case so our base case when I put else print data so I'm going to call this other function call we're just gonna print the data that's collected so let me go down here oh so we also have to call this return next to begin with so down here after the function next we have to actually call that so I'm going to put return next to call that function so it's going to start that so now we're actually getting close to the end here we're going to define our print data function so I'm gonna put function print data okay I'm gonna put console dot log we've been printing things through the same with a console dot standard out dot right printing system of the same line so we want to get something to the new line so we're just going to do a console dot log just something to put on the to the new line and I'm just gonna use this checkbox here and then I'm going to do another console dot log and I'm going to print table dot to string so that is going to print out our whole table in the console now I'm probably had some kind of errors here I don't know I'm going to actually try running this and see if it works so I'm going to go over to my console and I'm gonna just put node index is so let's see unexpected identifier let's see what I did wrong there always need a comma at the end of here okay now let's see what happens let's see throw air cannot find module request ok well this will miss you NPM install request it looks like we just need the module request ok now let's try it again so it did something you can see has the word loading here but then the program seemed to end right away it's like to kind of figure out why that is so somehow it oh oh ok I see what the problem is I forgot to put the parentheses here so it never called that function correctly to start off going into the function so now let's try it okay unhandled it says a cheerio is not defined so let's see what oh I spelled cheerio wrong here so I'll save that and go into Mike here again okay it's actually the dots mean it's actually doing these calls the request to the website to get the data so this does seem to be working as its wait till this finishes and here's the table so let me just scroll up here so you can see it loaded all that we've got user name the heart challenges you can see when it puts in the emoji it kind of messes up the spacing here that's okay though but you can see here's the heart and you see you see 8850 134 so if we you can kind of if we go over to here you can see it's getting that number 8850 134 and then the way to check that these are right you can see this one has has quincy as less so quincy Larson if you go over here and put that in there so the way a check is to go through and count every row I'm not going to do it I've done it before the check and it was right I'm not gonna do in this video but feel free to do that if you want to check you can also go in count 291 but that's a lot easier to just account the smaller one just to make sure it's calculating it correctly but you can see you can see here too now we can have the number of hearts on the forum with the number of challenges passed on the website and we can see how they're related so you can kind of when you actually look through this it doesn't seem like there's a major correlation like this person has only four hearts but also he's passed four hundred and three challenges stand with a lot of these have passed a lot of challenges but they don't have very many hearts and really the person who has the fewest challenges pay quincy larson but he has one of the the number three to four spot for number of hearts and you can see some of these there's actually quite a few unknowns here I'd be interested to it just makes me wonder hmmm I wonder do they have these people actually use freako camped out or or they're just participating on the forum most likely they do use free coke amped up the main site and just have a different user name but maybe not so that this web scraping is a great way to combine information from two different sites to get the exact information you're looking for I'm sure that some of you watching may have even better ideas of how to do this so if you figured out a better way to do some of these things put your idea in the comments to this video so everyone else can can see that because I love to learn new ways to do things and it'd be great to see what other people are doing in regards to web scraping and that's it my name is beau Karns thanks for watching don't forget to subscribe and remember use your code for good
Info
Channel: freeCodeCamp.org
Views: 121,827
Rating: undefined out of 5
Keywords: web scraping, tutorial, nodejs, node, how to, guide, web development, javascript, es6
Id: eUYMiztBEdY
Channel Id: undefined
Length: 27min 19sec (1639 seconds)
Published: Thu Feb 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.