web scraping with NodeJS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there everyone have this year back again with another video and in today's video we're gonna learn how to do web scraping using node.js this web scraping is bit more on to the production level side so what's the difference between the production side and when we are just making a simple tutorial in the production side we get a little bit cautious that means our script should produce less of the error we try to take care of that and taking care of that in the web scraping is really a tough thing so we'll be using node.js the only thing that you need for this entire walkthrough to walk along with me is no GS should be installed on your system rest all I'll take care here okay so what is the plan so we want to have a simple link as a web scraping and from that we'll be extracting some of the information now in order to make it more useful we will be not only just extracting that information but we will also throw that information in a CSV file comma separated value file also can be opened up in the Excel or can be read by Python or for any other purpose that you want to do once all the information is being pulled up and is being thrown into CSV file it's much more helpful later on we're gonna craft or modify this application so that it can take input as an array and multiple links can be provided to it so that more information can be extracted so there's a lot that we are gonna learn here also we're gonna learn a lot about the requests as well that whenever you fire a request on the web page what comes back and what are the things you need to take care while doing the web scraping and we are choosing the all-time favorite I am DB because there is a lot that we can learn here and we're gonna learn a little bit more on why this website is ridiculously fast so we're gonna take care all of that I hope you are all excited and already if you are just let me know in the comment section and if you want more see more such videos let me know in the comment section I would love to craft more such mini project dish video for all of you loved to do that so make sure you hit that subscribe and let's get started with that first and foremost let's talk about NPM so we'll be installing a node or we will be using nodes so we need to talk a little bit on to the libraries that we'll be using we will be using cheerio in this particular case there are many others as well if you wish we can make videos on that if you we'll request that in the comment section but this time we'll keep things easy and we'll use cheerio so Chile is pretty easy you can kind of store everything into this dollar which is very jQuery ish and we can load the entire webpage or element into it and then we can use almost jQuery like syntax so h2 dot title grab the text and stuff like that so that's why I love this one here and we can use the entirety or the knowledge of jQuery or JavaScript ish thing and we can just go ahead and grab that so we'll be using this chair do I highly recommend to try to just have a look on this and something now another package that we will be using is going to be requests now the problem with the request is that it's being deprecated so we'll see these messages here but another bit of a strange thing that I noticed here that another package which is a request promise which is built on top of the request is still working and is absolutely fine I am pretty sure they're gonna modify things later on but as of now this is all good very popular and if things change we will make another video for that ok so we can see this is how we grab the request promise and in order to have it first we need to have an installation of both request and request promises it's being deprecated still it requires that and then we can go ahead and do all the requesting interestingly the thing that we are most interested is that how these options are being used so we can provide all the header options so that it doesn't look like header options are important when you are doing web scraping if you are not providing accurate web headers then it's gonna look like that hey somebody's trying to do scripting it might block your IP or can even go into legal section as well this reminds me a disclaimer I'm telling all of this is as just for learning purposes web scraping is exactly on the line where it can fall into the legal stuff non legal stuff so make sure you are absolutely cautious talk to the team that we want to use your data eat their data you're using it for your own benefit be cautious it can land up into trouble right now this is all just educational so here you can notice that we have got this JSON true and all those if you'll read this documentation more you'll realize that they support a whole lot of attributes here and these are boots are important if you want to do web scraping on at least on IMDB or many other such website so go ahead and read that a little bit so this is what we'll be using another package that we are gonna be using is JSON to CSV not jason-3 CSV Jason to CSV now this is the one that will be using it's a ridiculously simple JSON converter it exists in almost all languages not this one but there is some version of it and you can install it by saying JSON to CSV and then you can provide an array to it and can convert anything into this kind of a syntax there is a lot that it can do colors and stuff will not be going into that so these are the three or two packages that we will be using okay now coming on to what will be scraping the first and foremost thing is to extract the data and usually I prefer to extract the data first in the browser and then try to put some loops or conditional over it so that's exactly what we'll be doing here moving forward first and foremost right-click and inspect and this is what we have got the chrome inspect element tools are absolutely awesome I love them so we'll be extracting some of the information first and formost we're going to grab at the title let's go ahead and click on how this title is going to look like there we go so the title is here and we can see there is this title but more importantly there is a division at the top it says title wrapper I'm pretty sure that's gonna be unique and then inside this there is an h1 and inside that there is the title which we are looking forward so how we can grab this title we're gonna be using jQuery syntax so we're gonna be saying dollar and then pair of parentheses and here we want to write our selector on which we can latch on now selector can be whole in itself a crash-course that I can talk on but right now I'll show you enough that you get a little bit hands to it but again just search for jQuery selectors or CSS selectors you'll get pretty good in that eventually so how we're gonna grab the title so first and foremost I want to select an element div this div has a special property that I'm looking for it has a class that is equals to something what is that something let's grab this it has a class of title underscore rapper so I'll grab this one here and then once I'm outside of this then it's not just this is the one I'm looking for I'm looking for its child the child uses this arrow so anything you can pick up it's the child similar syntax are for on the same level or the brothers this is the child one I want to grab an h1 of it and I can grab a text of it so there we go and there we go but the problem is it is giving me some of the empty spaces as well no big deal we can easily get rid of that by saying trim not wrap trim there we go so there we go this looks great and nice so now I'm gonna just grab all of this well copy that and we'll go on to I'll be using vs code in a minute but right now there is never a day where sublime texting kills away so we're gonna be putting up this is my title selector and I'll just put it here so expand this there we go so this is my title selector next I'm gonna select the rating so let's go ahead and grab that so rating there we go so rating how we're gonna grab that we can see again there is a div which is having this rating of value and then I can latch on to this strong tag and under this strong tag there's this span tag so we can actually latch onto this one to grab the exact rating let's give it a try so again dollar sign paranthesis codes i want to grab a div which is going to have a class which is gonna be equal to oops not that equal to lighted and copied it I'm gonna be copying this div here which is gonna have a rating value copy that and there we go rating value okay what else do we need to grab we need to grab let's find it out the div is having a strong and then we are having a span let's give it a try so this is going to have a strong and inside that there is a tag that is span and I want to grab the texts out of it there we go perfect no need to even trim that so I'm gonna just copy this one this is grabbing me the rating so rating that's gonna be equal to there we go okay what more information I would love to get a summary - I love to grab almost every scrapper that I have designed for him and EB I've always grabbed the summary in that for even personal purposes so this is my summary text and we can latch it on by saying plot summary and then summary text but as you can see the summary text is unique the credit is getting repeated many times but this summary text is actually unique in itself so we can latch on it directly but I am DB changes that quite a lot so if you can write a little bit extended of the syntax that's always necessary or it's good not necessary okay again we are gonna go like that what are we looking of for we are looking for a div which is having a direct summary text as the title here so let's grab div and he is gonna have a property which says class is gonna be equal to some R etext and grab the text out of it oh we need definitely a trimming of that trim there we go looks awesome so this one is gonna grab me the text now I'm going to show you one more thing which you might be bothering about first let's grab a summary this is the summary selector okay a lot of time people say that you what you can do is you can shorten these things up by just putting up this guy here and you want to select this right click and click on a copy and then there is XPath selector as well as there's copy selected you can also go for that yes of course you can go that in fact sometimes to save some time I use this but the problem is there that if I go ahead and select this selector now see how big this selector is it surely work is going to work but remember this how big this one is and what we have written is much more compact and actually much more readable here this one is not that much surely you can go for that in some cases I prefer XPath selector which is a query selector but I think this is not really something that I would to go for I would rather like to write a customized if possible so again there's too much of it come on get deleted oh there we go so what information we have got we got title rating and summary what else let's grab a release date as well you can grab any information like this if information is multiple like DS 1 we can just design an array and can select in it and push the values in it no big deal we can select any of that okay so how this one is gonna get grabbed in trusting lead this anchor tag is always having a title which says see more release dates so we can latch on it directly I have checked it out already this is actually pretty unique and it's rarely a situation that you find a unique but you can use a little bit of the regular expression here that in the H ref if you find something like release info then you can graph for it and there are a lot of ways we can definitely make a bit of advance tutorial for that let's keep it easy this time so what we're gonna do is again let's go ahead and dollar sign and let's grab this guy oops not like that there become so how do we go on to grab or latch on it div and in this div a positive was it a div no it was just an anchor tag which is having a title in this information okay so I'm not this one let's delete that so I want to grab an a tag this time which is having a title and the title is gonna be equal to there we go exactly that and I want to grab a text out of it and I need to trim that because the light there is a line break that's coming up so I'm gonna say trim there we go and perfectly so this one is gonna be the information copy that and the final that is release date and the selector is there we go so nice all the things are absolutely ready for us if you have more information just grab this and we can extract more information from here I'll just keep it that here only now let's grab at this guy here and I'll drop and drop my empty folder here which is a YouTube IMDB I'll link that in the description section as well so that you can grab all of these exercise files as well no big deal I'll fire up the command terminal and I'll say NPM in it to initialize and make sure before you do that make sure note is installed so if I say no - V it should give me some of the version doesn't matter what version but something should be returned here now NPM in it will be saying - why to not ask me any questions and just give me this file ok so these are the things that we have got up here now interestingly what we are going to grab is all the libraries that we are gonna need so I'll go up onto the website and grab it there from I know all of these but I need one which is a request promise ok so we are gonna be grabbing request promise because sometimes I miss a spell that so here I'll be grabbing that let's go ahead and install all of it so NPM install and request promise we're gonna grab a request as well what else do we need we need cherry oh for sure cheerio and of course we need Jace Jason come on Jason to CSV so let's go ahead Jason to CSV looks great do we need anything else I don't think so let's go ahead and run that and we will be creating a file but the fs module or the filesystem module is default available in the node we can just go ahead and work on that as of now there is nothing so it says you can create an indexed or chase and work on it yes of course I would love to create that I'll be seeing index dot JSP just and foremost let's bring all the stuff here so first and foremost we're gonna call a request and you can name it a bit shorter as well I'm gonna call it request anyway and that's gonna be coming up from not request we need require come on here we go require and that's gonna be coming up from request promise remember request is being deprecated we're gonna need cheerio as well so cherry and that's gonna say it require man again that's gonna be coming up from cheerio we will need file system as well so file system is gonna be coming from require F s file system to create a file which will be creating of a format CSV this reminds me we need to bring CSV as well so we'll be calling it as Jason to CSV yeah we can just call it like that way and we are gonna grab it so require the name is Jason to CSV and of course we need to parse the information a from an array so make sure you call this like this way again it is mentioned in the documentation okay so how we're gonna do that first and foremost I'm gonna create a simple constant here which is gonna be called as movie and this is gonna hold a whole syntax or the URL that I need to request now this brings us to another information that you should know that whenever you are making a web request it's not just about in the production grade specially it's not just about having these information that you can pull it up you also need to go into the network tab and have a lot more in detail information let me try to hit a reload on this I am DB so that you can see what's going on as you can see a whole lot of network information is going on but usually the very first of the network information is the important one remember the URL is exactly the same that we are grabbing here so this request is being made you need to understand that not only the response header you have to make a choice here about the request header the more exact request header you are gonna send in your request then less likely the chance that you're gonna get your IP blogged or something like that it should really look like that there's an original request being made from a browser so all these information are important and necessary I won't be copy pasting all of it but I have seen many people just copy exactly all of this information and the most important information that we need as of now all all these except information for the language the browser information is also usually necessary but not much so I'm gonna be just copying this all information in a minute and we'll be doing that another thing I would like to point your attention here is on the response you need to understand that what kind of response you are getting back so let me try to find the information which I'm looking for in the response header it's gonna take a couple of seconds so just wait there we go in the response header don't worry al zoom it up for you so in the response header remember the content encoding this is the most important information you should be looking up for it says that the content encoding is not your regular utf-8 it sometimes is a JSON encoding sometimes it is gzip so what I am DB is doing it's gzipping the entire content and throwing up on your browser where it is being extracted so that makes it ridiculously fast as a website even the images and everything are being gzipped this is a crucial information and make sure you spend your enough of time in the headers of the request so that you can grab all of it so enough of the theory let's go ahead first and formost I'll copy this with all this information I'd like to put the movie here we are gonna transform that into an array of movies and all these things later on but right now this is all what we need now making a web request usually is an asynchronous operation so we need to understand how we can run an a synchronous operation in a project itself there are usually the triggers which does it maybe on the loading things I could have been working in react where automatically as it in loads up it does the job but right now we will be man yelling fire manually firing that so how do we define a method with no name so let's just say we define it like that and there we go so this is my method having no name at all so how do you run that method immediately you just put a parenthesis there we go you have run this immediately now apart from that what's going to happen is in order to run this entire method immediately we have to make it as synchronous - and right now this is not a proper syntax of running it so we need to make this again inside a pair of parenthesis now whatever this callback method is without any name can be run directly by using this pair of parentheses I know a little bit weird syntax but we do these kinds of trickery sometimes if you make it want to run asynchronously then you can simply say that I want to run it as synchronously and there we go this is a bit weird syntax so just make sure you kind of rewind it a little bit and watch it again remember let me show you again one more time so first and foremost how do you define a simple callback method that's how we define it so how do you run it immediately first and foremost you just kind of packet because it's a one unit so there we go we packed it up and we'll run that immediately and if you want to do it synchronously then we simply add a sync and there we go your method is now as synchronous and as soon as the flow of the code reaches here it's gonna just work it a little nicely there okay so first and foremost we are having a whole lot of information so let's go ahead and create array I am DB data however you want to name it there we go empty will be pushing up all the information in this array remember our JSON to CSV expects an array to be passed on and can convert that array into a CSV object so that's why we are creating this array then after that I would simply like to create a response remember the responses coming up oops the response is gonna be coming up from the request so where is my there we go the cheerio so we will be using this a notch area we will be using this request to create that so let's create a response here so what is this exactly the response this response is gonna be the data that comes up when this request library makes a request to this URL okay makes it clear okay so we got a response and then since this is a sync we'll be throwing up an await request here now this request as I said mentioned that you need to provide as much as information as we can we saw that there in the request response webpage as well so let's go ahead and provide some of the information first and foremost it requires you to have a URI object which is the URL where you want to make a request in this case that's gonna be simply movie after that you provide some of the headers as well so there we go all the header information goes up here and most importantly then you mention is the exact type that you are expecting which is the content type so remember this content encoding if it is there you need to mention if it is utf-8 you can kind of avoid that it's already by default in this case it is not so I have to go ahead and mention that so if it is Jason I'm gonna go ahead and say Jason : true if it is gzip I have to raise that flag so I need to call gzip as true so I hope that makes sense now moving forward what are the headers that we are gonna be providing and how we're gonna provide that so as of now you can just grab as much health there as you need I'll be grabbing the request header which are just accept headers so from all these accept to accept language as well I'll copy that and we'll use that one extra header I usually pass on in such case would be the browser information that's usually is being checked there we go it's gonna throw up errors so all you need to do is wrap all these guys into these coats so there we go let's coat everything up and there we go one more there we go and finally this one as well there we go looks nice and yep we forgot to put a comma there we go so request header so as much as a header you can grab just make sure you grab it save that and there we go so this is how you pass on the header information and there we go looks nice okay it's on the next line so anyways so this is all what we have god and we have grabbed this information now this response is here so entire information of the response is being stored now comes of the thing which we are using which is cheerio which makes it easier for us to fire the queries which is almost similar to the jQuery so we are gonna be going still into a sync method I'll go just below here and now what I can do is I can use the exact same syntax just we saw on the documentation so let dollar handle everything so we're gonna say cheerio dot load and it grabs or gets all the response here once you have that then the thing that we have created can be helpful for us so this is the entire structure copy that paste that tada and it's not gonna work like that so we're gonna be grabbing there we go there we go there we go and we definitely need to put up these let or Const however you want to work with that I usually prefer to work with the Const the more constant you have the easier life it is so all these information is up here all you need to do is push all of this information one by one into this IMDB data so let's go ahead and do that so I am DV data has a method of push which can push up all the information so we're gonna be creating a title or JSON object with the name title and that will gonna have the information of title but since we are using the advanced JavaScript features we don't need to do title calling title and then saying rating : rating we can actually get rid of all of this we can just simply say title comma rating comma summary comma release date there we go save that so now all the information is being pushed up here the final thing that we need to do is go outside and just do that so this synchronous entire operation is being done everything is being pushed into it properly I'll go outside of this only the requests and all these parts need to be done asynchronously so I think that's all good now we need to throw all of this information in the CSV file we can easily do that by pushing this array into the chair Jason J what is the name Jason to CSV okay there we go so we're going to be creating a simple object of it so we're going to be saying Jason to CSV parser okay terrible name and then we are gonna simply go ahead and why are you having an issue am i bit outside I am here and I need to go a bit inside so I'm gonna just cut this out I'll move inside there we go looks nice now okay looks great again so in that JSON to CSV parser what we're gonna do is we're gonna be saying new JSON to CSV again I need to grab this guy here copy that and there we go so JSON to CSV parser is up here and there we go let me check it one more time because these are not easy to define so let's check it out JSON to CSV this is the one let's see how this actually works because nobody remembers everything we all check in to documentation every single time okay so come on give me the JavaScript module okay so there we go so we need to grab a parser okay and then we can parse and send the data we can use this Const CSV again streamline API is probably better options or better things are given yep I'm gonna go with that only so I'll need to do is grab the parser from the json dot csv - we have already grabbed that and then i can try to have a csv by saying this parse and then i can use my data and opt for it and we can also console.log let's try that okay so now we have grab at this one here already and then we can simply say just like the documentation csv and we'll be using this j - whatever that is and we can use a dot parse method and inside that we have to provide the data which is gonna be IMDB data there we go so now CSV file is ready all we have to do is write a file so let's use a FS dot and I'm gonna be using right come on suggest me write file sync there we go the first thing that you need to do is provide a path the name of the file then you have to provide the buffer means the data and then you have to provide optional encoding as well so we'll be saying in the same directory just say you're gonna say it I am DB dot CSV and we will be providing the data here which is gonna be CSV we just loaded it up and encoding as well which is gonna be UTF dash 8 looks ok and then we might want to grab something on the console log I know we don't need it save this and let's see if this is going to work or do we need to polish this a little bit more looks good let's go ahead and see that and I'm gonna say node indexed RGS there are a couple of other ways as well cannot x-axis FS before initialization so looks like we haven't even initialized the FS how is that even possible my bad this needs to go like this save that let's write one more time okay it says is not a function okay looks like okay so why is this parenthesis going up there let's say that nope I'll put a column there we go now you look nice let's try it one more time and there we go finally something is up here remember there are always going to be errors in your program if you don't see this part where I solve these errors there is no all Sunshine's in rainbow so we are grabbing the title rating summary and release date so title is coming up nicely rating summary pretty big and release date as well now coming up onto the part where we modified this particular program to grab multiple movies as well so how we can do that let's first convert this movie into movies and this entire thing needs to now inside a square bracket and we need to grab couple of more as well I'm gonna enter one more time there we go let's grab a couple of more movies okay so probably this one this is getting very famous in India I'll copy that usually I don't like this much a big string I would like to short if I that but anyways and just for fun stuff I would love to grab another one let's see if there is anything you don't like these ones yeah that's a good one come on load this up there we go smaller titles smaller URL I love that so I'm gonna grab this one here now we need to modify this so that we can have a look through of it so looping through that is gonna be super easy I'm gonna be just selecting this entirety so this information leads to loop through and then it will be pushing up all the things in IMDB so this needs to go ahead and kind of a do things again and again till we hit this Jaison parsing interesting so I'm going to open this up I'll select from here till the entire Jason parsing cut all of this and now I'm gonna let write a loop through which I can put all this information to do it should be fairly simple we're gonna use just a simple for loop so they're gonna use a for loop and we're gonna say let movie of movies and did we say that movies here yep movies there we go what do you want to do I want to do exactly what I was doing but this time I'll be keep on pushing the data inside this so a new object will be created every single time in theory let's go ahead and see that we have modified it into working many programs say that and let's try to run this one more time and since we are not appending the data it should have a fresh start there there we go and there we go this time we are getting the movie this movie and this movie again so there we go looks nice and easy so I think that's I would call it job done and we have learned so much of the new things in just one single scraper if you'd like me to talk more about these selectors because this is the entirety the essential part of how things are being done not only in just CSS in the scraping and a whole bunch of other things if you would like me to do more such things let me know in the comment section I would love to make a separate dedicated video about selectors and we'll have a lot of fun in that make sure you hit that subscribe button I'll catch you up in the next one [Music]
Info
Channel: Hitesh Choudhary
Views: 30,874
Rating: 4.9087138 out of 5
Keywords: Programming, LearnCodeOnline, web scraping, nodejs, cheerio, data scraping, python, web scraping tutorials, udemy, linkedin, pluralsight
Id: BqGq9MTSt7g
Channel Id: undefined
Length: 33min 55sec (2035 seconds)
Published: Fri Apr 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.