Intro To Web Scraping With Puppeteer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey what's going on guys so in this video I'm going to give you an introduction to web scraping with a tool called Puppeteer now if you're watching this video I'm sure you know what a data API is where you have different endpoints that you can hit with an HTTP request and you can fetch some formatted data whether that's from a public API or a private API but what if the data you want isn't available in any kind of API well in many cases you can scrape that data yourself and there's a lot of different tools that you can do this with but Puppeteer is extremely powerful it's used for more than just web scraping it's essentially a headless Chrome browser so anything you can do in the browser normally you can basically do that programmatically through Puppeteer so you have complete access to the Dom you can fire off events you can parse JavaScript you can create screenshots and PDFs of websites so it's really cool and I would encourage you guys to look further into it if you enjoy enjoy this video now what we're going to do is scrape all of the courses from my home page so travestymedia.com we're going to take all the courses and get the title the udemy link the course level and the promo code for all the courses and put them into a Json array and then save that to a file on my system all right so obviously there's different things you can do with different data but I'm just going to show you how you can how you can get that data all right so let's get started foreign guys so we're going to get into this and I would recommend that you follow along you don't have to but I would recommend it so this is the Puppeteer website where you can find all the documentation it's p p t r dot Dev okay now what we want to scrape the data we want to get is at traverseemedia.com we have these courses okay so all these courses right here I want to get them I want to get the title the level the URL for the the udemy link and then also the the promo code all right so I want to get all that put it into Json objects in an array and save it to a file what I'm going to show you a couple other things along the way that we can do with Puppeteer so I have vs code open just an empty folder called courses scrape and of course you need node.js installed if you don't have it just go to nodejs.org download and install it and you'll get node as well as npm which is the node package manager so we're going to first run npm and knit and just add a dash AI so we don't have to go through the questions and that will just initialize a package.json file which holds our scripts and and dependencies and all that and then the only dependency we want to install with npm installer npmi is Puppeteer so let's do that and that will create our node modules folder with Puppeteer and all its dependencies and it'll get added here now the file that I basically our entry point file I'm going to create that and call it index.js you can call it whatever you'd like and then I'm going to create a start script for that which you don't have to you can just run node index but I'm just going to make it so we can run npm start and in here we want to run node index all right so if we save that and then in our index.js we'll just do a console log and then down here in our terminal go npm start and we should see that log okay now the first thing we want to do of course is bring in Puppeteer so let's do that we want to require the package and then all the all the objects that we're going to use all the methods we're going to use are going to be asynchronous so we're going to have everything in an asynchronous function so let's create an async function you can call it whatever you want I'm going to call it run and then we're going to run it now the first thing we want to do here is open the browser or launch a browser so we're going to create a variable called browser and then set that to away Puppeteer dot launch so this is essentially just launching a browser programmatically so that we can do things like access different pages and elements on the pages fire off events things like that now to to access a page we need to initialize a page variable and we can do that with a weight and then browser and then there's a new page method okay now to go to a specific page we want to await on page.go to and then we can pass in here the URL of the page that we want to go to which in this case is going to be travestymedia.com now before I do anything else I just want to go to the bottom here and make sure I close the browser and we do that with browser.close okay and then within here is where we can access Dom elements and do pretty much anything we want now there's some some other cool things I want to show you before we get into targeting content and data so first of all we can create a screenshot we can do that by await page Dot and then there's a screenshot method that we can call and we just pass in an object with a path which is going to be where we want this to go so let's say example.png okay so if I run npm start it's going to run our index file takes a two seconds or so and then you can see up here example.png has been created now it's going to set it to a specific size and it's not the whole page what you can do is you can add in another property here a full page and set that to true if we do that and then I run it again it's going to overwrite the initial example PNG and now you can see we get the whole website as an image now you can do the same thing with a PDF so I'll just copy this down comment this one out and instead of page dot screenshot we can do page.pdf I'll change the extension here to PDF and then instead of full page we can put in a PDF format and we'll just do H not H A4 and if I save that we run this file again and now we can see we have a PDF so if I bring that over and open it up you can see it's not going to have like the the same exact style or whatever but it has all the content so if you need to for some reason generate a PDF of a specific web page you can do that now I want to get into targeting some of the content now so let's comment that PDF line out and let's look at how to get the entire HTML of a page so we'll create a variable called HTML and we're going to set that to a weight and then there's a method called content that we can call that will give us all of the HTML so I'm just going to console log that and then we'll come down here and run and now you can see all the the HTML for the from my home page is in the console right now and of course you can do whatever you want with this I'm just console logging it so that'll get the HTML Let's uh let's comment that out now if you want to get the title or really anything if you want to Target h3s or whatever there's a method called evaluate on the page object so let's say we want to get the title so what we could do is say page dot evaluate whoops evaluate and this is a high order function so we can pass in a function here and then we have access to the document object so I can say just document.title and let's console log Title Here all right and then we'll come down we'll run this and we should just see the title of the page or not and that's because I forgot a weight and there we go so travesty media learn web development so we can get the title if we want to get let's say all of the text on the page we could do that let's comment that out and we'll just say const text and set that to await page dot evaluate so once again we're going to use evaluate pass in a function and we have access to the document object so I can simply say document.body and then we can get the enter text and then we'll just uh we'll just console log text okay so if we come back down here and run it we should just get all the text you can see this is the stuff in my footer patreon YouTube channel so all the courses so it's just all the texts right now let's say I want to get all the links on the page so I'm going to comment that out and let's say const links and we're going to set that to a weight and again we're going to use whoops we're going to use page dot evaluate now where we're getting all the links we're getting multiple elements right so instead of using like we we could do query selector and get any single element but we want to get all the links so we're going to be using query selector all it's the same way you would access the Dom in you know front-end JavaScript now query selector all will give us something called a node list so so we're going to pass in our function here and and we want to wrap our query selector all in the array Dot from method which will basically create a shallow copy of an array from from an iterable object such as a node list so in here let's say document Dot and then query select all and we want to select all a tags let's close this up so we'll pass in here a now array Dot from takes in a second argument which is going to be a function and we can pass in an argument here that represents the element and we want to say for each one we want to get the href or the URL so we can take that element and get href okay so that should just give us all the links and then let's just do a console log here of links and if we come down here and run that there we go make this a little bigger so we can see so these are all the links that are on that page now we can do the same basically the same thing to get all the courses we just need to be a little bit more specific than just you know all the a tags so what I'll do is let's actually copy this and then we'll comment it out and then we'll paste that in and let's change links to courses right and then we're still going to do the query selector all but now we actually have to look at the structure of the website in order to scrape it because we need to know what to put in here so let's go to travestymedia.com and I'm going to open up my Dev tools and let's just take a look because we know we want to get these courses so if we look at the HTML you can see there's a section with the ID of courses so they're all contained in that and then if we look further into it every course has a and down here as well where it says more every course has a class of card around it and if we look into that every card has a class of card body and card footer now the data we want is in both of these divs so in the card body we have the title which is in an H3 and we have the level which is in a div with the class of level then in the footer is the let's see is the link that we want the udemy link and then in the this div with the class of promo code is our promo code which is actually in another class called promo yeah so we want to get that as well so you really have to kind of you know go into the structure of whatever website that you're um that you're scraping so now that we know that let's jump back in here and instead of just all the a tags let's let's get the let's say courses so the ID of courses remember that section wraps around all of them and then we want to get all the cards that are in that okay and then here instead of just returning all the href I want to return an object of courses so since I want to return an object I'm going to put some parentheses here otherwise it's going to just be looked at as a code block so I'm going to put then my object in here like that okay so whatever whatever properties I put in here will get returned so I want we know we want the title now inside this card we have two other elements that are wrapping the fields we want we have a card body and a card footer so here what I'm going to do is take the element which is going to be the card right so we'll say e and then we want to then use Query selector on that because we want to go into the card body so card Dash body and then we know the title is in an H3 in the card body so I want to get H3 and then finally I want the text so I'm going to use inner text right so that should give me the title and we can actually we can actually just try this out and let's console log courses and then we'll come down here and run this okay so it looks like I screwed something up uh valuation fail cannot read properties now enter text oh I forgot the dot for the class all right let's try that again and there we go so now we have an array that has all the the course titles which is pretty cool now I also want to get the level remember the level is uh also in the card body right here it's in a class of level so what we can do is just copy this down and let's change title to level and change the H3 to the class of level so that will get us to level and then we'll just keep going here now the other two things we want the link and the um the promo code are in the footer so let's do URL so here we're going to do e Dot and then query selector and this time we want to look in the card Dash footer and we want the a tag that's in the footer and we want the link so I'm going to do Dot href all right so that should give us the URL and then finally we want the promo code so let's call this promo and we're going to say e dot query selector and in the card footer we want the let's say promo Dash code class now if I do that if I just do that promo Dash code it's going to also include this right here this text of code and then the colon we want to go deeper into it into this promo class which will contain the actual code all right so let's add on to this dot promo and that should do it so we'll save that and then let's run this again and there we go so we have the title the level the URL okay the promo code isn't oh I forgot the inner text because all that does is select the element we need to get the text from the element so now if we do that should show yep there we go so now we have all of the promo codes all right cool now there is another syntax we can use to do this without using array Dot from so I want to show you that as well but it'll get us the same thing so we'll say courses and we're still going to use our page object and we're still instead of using evaluate we can do this this double money sign and then eval and then we can put right in here what we're targeting which is going to be the ID of courses and then all the cards Within all right so we're doing this without the initial array Dot from and document query selector and then this takes in a second argument which is going to be a function and we're going to pass in here elements and then this is going to be we're going to then have to take those elements and then map through them so we're going to use dot map here so we're just doing what we did above in a different way and then we'll just pass in E so for each element we'll just say e that's going to take in a function and this is where we want to return our object from so I'm going to have parentheses around the curly braces and then we should be able to do pretty much the same same thing we did here so we can just copy those and then uncomment these and it should work yeah the same way because we're passing in E here so we have access to that and then we're going to use Query selector all right so let's try that we'll save and then if I run npm start we should get the same yep we get the same thing all right now instead of just logging these to the console let's actually save these to a file now to do that with node.js we're going to use the fs or the file system module so let's say const fs and set that to require and we want to bring in the fs module and let's come back down here and then let's say save data to a Json file so we can do that by saying fs and then we're going to use the right file method and pass in a couple things we want the name of our file so courses.json and then we need to it needs to be valid Json before we save it so we're going to run it through Json Dot stringify and let's pass in here courses which is our array and then this will take in a callback with a possible error okay and then let's say whoops let's say if there's an error then we'll just let's say Throw error and then we'll just do a console log here and say file saved all right so that should do it let's open up our sidebar here and let's run npm start it's going to go and fetch or not really fetch but scrape the data and then put it into that array and then it's going to put it into this courses.json file and if I save that prettier will make it pretty okay and now I could do whatever I want with this data and you might just do it for personal reasons maybe you want to keep up to date with all the the codes my codes are pretty easy to remember they're just like the course name or topic and then the month and year but if these are like crazy codes you might you might want to just be able to run a command and fetch all the codes there's just as a million reasons why you might want to scrape data all right so that's pretty much it there's a lot more to Puppeteer there's a lot more you can do you can fire off events and stuff like that like I said I think I said that I may do a more in-depth course but I wanted this to just be kind of an introduction to data scraping so that if there is data you want to get and use in some way but there's no API you you have a way to to do it yourself all right thanks guys for watching and I'll see you next time
Info
Channel: Traversy Media
Views: 87,655
Rating: undefined out of 5
Keywords:
Id: S67gyqnYHmI
Channel Id: undefined
Length: 21min 24sec (1284 seconds)
Published: Wed Nov 09 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.