Build a Web Scraper (super simple!)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello friends on the internet today today i want to show you how easy it is to build a web scraper using under 20 lines of code but not only that show you how to adapt the web scraper in order to scrape whatever you need from a web page i will building this project using javascript node.js express as well as two packages called axios and cheerio i will be doing this with a beginner's mindset in mind so if you don't know anything about node or express please do not be worried i will be taking you through everything step by step and explaining everything we are doing along the way my aim for this video is to make it as accessible to as many of you as possible a basic understanding of javascript is advised but not a hard prerequisite as i am giving you my full permission to take the 20 lines of code so just copy and paste them and use them as you wish after of course understanding what the code does by watching this tutorial but before we get started what exactly is web scraping and what is it useful web scraping refers to the extraction of data from a website quickly and accurately imagine for example you are working at a company that has asked you to make a list of all the companies working at a particular trade show and not only that but their contact name and email addresses well most people would probably open up the website of the trade show and start writing down the first company starting at a then the name and then the email associated with that company and then move on to the next one and so on and so on and it could literally take you days to get all the details that you need and most likely some spelling mistakes would be made with web scraping you can have all that information in seconds many people move on to selling their web scraping tools for money either by building them as a chrome extension or api or selling them to data capturing companies so the option to make money off this tool is there for you too okay so now that we understand what a web scraper is and what it can be used for it's time to get building one so here we are i'm just going to create a blank project using webstorm please feel free to use whatever code editor or ide you wish and just create an empty directory so i'm just going to go ahead and click here and just call this web scraper just like so so that we can start completely from scratch so as you can see here is my directory there are currently no files in it before we get going i just want to make sure that everyone watching has node.js installed on their machines node.js is essentially a open source server environment and we will be using it to create our own server or in other words our own backend it's free and allows us to use the javascript language in order to create it so i am a big fan so i'm just going to head over to node.js now i am using a mac so i would of course click here in order to download this onto my computer however here are all the other options you have for installing the source code so please go ahead and choose the one that you need now i already have this download so i'm not going to go ahead and click here but please go ahead and click whichever version or option is required for you okay great now let's carry on so back in our projects it's time to get coding the first thing i'm going to do is just open up my terminal right here and i'm going to type a command the command is npm init okay this will trigger initialization and spin up a package json file we are creating a package.json file so that we can install packages or modules into our project to use if you want to have a look at all the packages that are available to us please go ahead and visit npmjs.com so here are all the packages available to our disposal if you go ahead and just type one axios and click it you get all the information on how to install it as well as how many weekly downloads it gets so there we go you can literally search through all the packages that are available to you right here on this registry as a general rule any project that uses node.js as we will be using will need to have a package.json file so let's go ahead and create one so i'm just going to go ahead and type enter and these prompts will be shown now i'm just going to go through and go enter version 1 enter is fine description i'm going to leave blank entry point is in dates.js that is fine and then i'm just going to leave all these blank like so and click ok so there we have it now if we go into here you will see that a package.json file has been generated for us based on the commands that we just had so once again here was our web scraper the version is one because this is the first version of the app that we are building the description we left blank and the main file that we are going to be reading is index.js so let's go ahead and create that index.js file i'm just going to go ahead and create it like so and there we go the package.json file there's actually a lot more than just hold our packages and the versions of them that we need so if you'd like to know more about it please pause here and google beginner's guide using npm but for now let's carry on so wonderful now that we have that let's get to installing some packages the first packages that we are going to need is a package called express express is essentially a back-end framework for node.js okay we're going to install it in order to listen to paths and listen out to our port to make sure that everything is working okay what i mean by this is that if we visit a certain path or url it will execute some code and it will listen out to the port that we define but enough talking let me show you how so as i said the package that we need is called express so i'm just going to show you it on here let's search for the package express and it will give us the instructions on how to install it so i'm just going to copy that and go back to my project and whack the command in here so npmi i is essentially for install it's a shorthand and i'm going to click enter and wait for that to install as a dependency to my project so that is now done and you should suddenly see a dependency show up here and there we go so express is our first dependency and it has shown up here with a version now what is quite important for you to know is that if this project is not working you for any reason it could be it doesn't have to be but it could be because of the version so if that is the case make sure to delete whatever's in here and write the version that i am using and just install the package again by running npm i for short okay so that will reinstall the package and will generate a package lock json file so as you can see here this file has been generated since we installed the dependency and if we look here we will find the express package so i'm just going to find that in here by typing express and there we go so you will see the version as well as which registry it has been installed from wonderful another reason that this project could not be working is that the node version that you install could be uncompatible to check your node version all you have to do so i'm just going to press command k to clear this down here all you would have to do is type node v to check the version and make sure that it's the same as mine now if you want to change the package you can do so it will require some extra configuration and you can use the nvm command to essentially install different packages so i'm going to show you how to do this this might not work for you if you haven't configured your computer correctly but essentially you can install a certain package onto your computer so i can install version 0.10 31 for example and click enter so now i'm essentially installing this version as well as having this version okay and once that has done loaded i'm going to show you how to use that version so let's just wait for that to finish and i can use that version by typing any vm use and then this package right here even though as default it has now switched to this version so instead i'm going to use this version and vm use to switch back to using the node version that we installed and there we go we are now using node version 14.7.6 wonderful so those are two reasons that your project might not work if you are watching this in the future perhaps there's been newer versions of express or newer versions of know that have come out that has made something brick so that is just something you need to know that is a bit of knowledge because that is not only applicable to this project but in general is applicable to many projects that you will come across as a developer okay so we now have the package express as a reminder the express package is a back-end framework for node.js okay now another package that we need to use i'm just going to clear this again is a package called cheerio so once again i'm just going to go here and search for the package cheerio and there we go cheerio is a package that we will be using to essentially pick out html elements on a web page it works by passing markup and provides an api for traversing and manipulating the resulting data structure cheerio's selector implementation is nearly identical to jquery so if you know jquery this might be familiar to you so now that we know what we will be using this for let's get to using it to pick our elements from a web page okay and we're going to be doing that from this webpage right here so let's go ahead and install it i'm simply going to copy this and in webstorm just install the package cheerio just like we did with express and once again it should appear in our dependencies right here so here we go there is cheerio and the version of cheerio that we installed wonderful we have one more package to install and that is axios so once again let's go in here and find axios axios is a promise based http client for the browser and node.js axios essentially makes it easy to send http requests to rest endpoints and perform crud operations this means that we can use it to get post put and delete data it is a very popular package and one that i use quite a lot as a developer on a day-to-day basis so once again let's install it i'm going to show you how to use it in a bit so once again i'm just going to put that in here and wait for that to install as a dependency okay wonderful so there we have it there we have all three of the packages that we're going to need for this project now that we have that i'm just going to do one more thing and that is write a script so to write a just gonna get rid of that one because we're not gonna need it i'm gonna write a start script so that if i use the command npm run and then start as that is what you have called the script i'm going to essentially i'm on index.js listen out to changes on the index.js file so that is what no demand does it listens out for any changes made to our index.js file so that is now done for the setup for our package.json file please feel free to take this from the code that i have shared with you in the source code hopefully you understand what all of this means for now and exactly what we need to get going so now let's head over to our index.js file the first thing that i'm going to do is actually use all the packages that we have just installed so if we go to the documentation you will see that the first thing we need to do in order to use these packages is to require them in the index.js file so i'm just going to copy that line and in here i'm just going to paste the line like so and i'm actually going to do it for all the packages so we've got axios we also have cheerio and the packages again called cheerio and then we also have the package express so there we go there's all three of our packages that we need now the next thing that i'm going to do is actually initialize express so to do this i'm actually going to get express so what i'm doing here is essentially getting the package and getting all this wonderfulness everything that comes with and storing is express but we need to actually call express in order to release all this wonderfulness so i can do so by grabbing express and calling it and now that we have called it let's say that something else i'm going to call it as const app you can call it whatever you wish so express essentially comes with great stuff like use get or listen and because we've saved it all under app i'm going to use app listen to listen out to a port so listen out to the port that we decide let's decide that our port is going to be const port 8 000. so we are saying that we want to listen out to port 8000 to see if any changes are made and essentially we want our server to run on port 8000. again this can be whatever port you wish that is totally up to you so i'm going to listen out to port 8000 uh what the syntax for this looks like is like this support listen and then i'm going to pass through a callback and i'm just going to say so if this is working i want it to say server running because this is my server on port and then pass through whatever port we defined up here so this is looking good server running on port let's get to starting our app to see if this has worked so all i'm going to do is use this script and this script is npm run and then i've chosen to call it start so there we go and wonderful our server is indeed running on port 8000 and that will essentially listen out for any changes we made to this file so if i make a change to this file let's just go ahead and call this bob and call this bob for example and click save it will restart due to changes and start again on by running node index js okay and then we get the message server running on port 8000 so let's change that back to app just to make things more readable and carry on so great that is step one now step two let's get to actually doing some scraping so to do this i am gonna start using some packages and the first packages i'm going to use is axios okay and axios works by passing through a url and it visits the url and then i get the response from it and in this case i'm going to get the response data and save it as some html that we can work with so in this case let's pass through the url that we want to work with so we know that this is the guardian so i'm just going to copy that and i'm just going to paste it in here like so we can of course make this much more readable so i'm just going to save this as a url as i don't plan on it changing and save this string and then just pass through the url just like so okay so now that we've passed through that url i'm going to do some chaining if you don't know much about chaming i do have an asynchronous javascript miniseries that i really do recommend you watching uh for now just please carry along curling with me anyway so this will return a promise and once that promise has resolved then we get the response of whatever's come back so response and then well we're going to get the response data and let's save this as html okay so you can call this whatever you wish now if i console log html and i am just going to click save you will see all this html come back to me this is essentially the html that is from the guardian home page okay you will see it here guardian all guardian related stuff so this is great but how do we start picking out certain elements okay like what if i want to pick up this button for example well we do so with cheerio so let's go ahead and do that i'm just going to delete this for now and i'm going to use cheerio so the package we just installed and it comes with something called load that will allow us to pass through the html so all of this and then we're gonna save it as let's just do a dollar sign okay so there we go so now whenever we use the dollar sign we're essentially using all of this html and now i can essentially find so i'm going to use the dollar sign and i can essentially look through all of the html element and look for something with the let's go ahead and see what we want to pick out so i'm just going to inspect this page if we want to pick out for example all the titles in here so i can do so i can pick out each of the articles title and perhaps the url that comes with them i could look for let's go ahead and inspect something which inspect this one we could look for something that has the uh cfc maybe not this one maybe let's make it bigger to have a better view of what we can and can't use so for example if we inspect this h3 tag right here we can see that it has the class of fc item title so let's go ahead and use that because in it we also see that this has an a tag with an href which is a url so i'm just going to copy this as the class name that we want to look out for so here we go and i'm just going to paste it like so making sure to put a dot in front of it as we are looking for a class name so that is what we are looking for in the html so don't forget to put that that is the syntax that you need and for each item that you find like this well what do i want to happen let's write a function so this is a callback function and for each item that we find that has the class fc item title i want to get that item so this is the syntax for doing so this i want to grab its text so we know this is an h3 tag so it will have some text if you want to have a look here there is some text in here so if we look in here there we go there is some text and that is what we are grabbing essentially and i also want to grab the h ref so i can do so once again by grabbing so this and getting the attribute of h ref that exists inside it if i want to be more precise and i think that might be a good thing to do i can also find the a tag that exists in that item and then get the attribute of href from it okay so there we go that is the syntax for doing so let's go ahead and save this as title and let's save this as the url that we are looking for and there we go so for each element that we are finding we're getting a title we're getting something that is the url and now i'm actually going to create an array so where shall we create this array let's go ahead and just create it up here so i'm just going to do it here const articles and an empty array now for each item that we create i i want to get a title i want to get this url and i'm going to get to the articles array which is currently empty and use a javascript method called push to push something into it and i'm going to create an object and this object is going to have the title that we just picked out and the url okay so that's all we really need to do the next thing i'm going to do just to show you this is working is just console log and then uh console log out the articles just like so and just for good measure we're gonna catch any errors so this is how you catch errors i'm just gonna catch uh the errors so catch error console log error okay great so now let's check it out i'm just going to save that and let's see what comes back there we go so we are indeed getting the array that is coming back we have literally scraped the webpage and we are getting back so here is the results of our scrape we are getting back the title and the url of all the articles that exist on the guardian homepage okay and there is a lot so there we go we have now successfully scraped a webpage and that's really all there is to it so hopefully that was easy enough again if you want to just take this code so let's maybe make a bit smaller this is all it is these are all the lines that you need along with the setup you can of course adjust this to scrape whatever you wish so as long as you know what you're looking for on the web page you can pick out the sun elements you can search for a times you can search for h3 tags you can search for things by class name it is completely up to you so hopefully this has helped you in creating your own web scraping app please do hit me up if you have any questions or if you just want to chat do so in the description below thanks very much
Info
Channel: Code with Ania Kubów
Views: 848,597
Rating: undefined out of 5
Keywords: python web scraping, web scraping tutorial, web scraping, webscrapping, web scrapping nodejs, nodejs, node tutorial, ania kubow, software development, express tutorial, cheerio tutorial, axios tutorial, javascript scraping, axios, cheerio, express, npm, package.json
Id: -3lqUHeZs_0
Channel Id: undefined
Length: 23min 25sec (1405 seconds)
Published: Sun Sep 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.