Web Scraping with Puppeteer, NodeJS & Shopify

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Hey I really enjoy your style of teaching! I noticed you used to get a ton of views, I'm wondering why you stopped? Also, from what I can tell you mostly teach more intermediate/advanced topics, do you have any beginner JS tutorials? I am struggling to teach myself, and am becoming discouraged from other books/video series.

👍︎︎ 1 👤︎︎ u/SuperSecretDaveyDave 📅︎︎ May 23 2018 🗫︎ replies

Nice production quality, enjoying the videos :)

👍︎︎ 1 👤︎︎ u/ScriptingInJava 📅︎︎ May 23 2018 🗫︎ replies

Anyone getting a Cannot find module 'debug' error when running this? It's coming from puppeteer/lib/helper.js:19:20

👍︎︎ 1 👤︎︎ u/steeleb88 📅︎︎ May 24 2018 🗫︎ replies
Captions
hey guys so today we're gonna be looking at some web scraping with Google's puppeteer now puppeteer is a pretty cool project from Google from the chrome team that basically uses a chromium headless browser so it's sort of just like a browser to do web scraping or testing or you can take screenshots or you can generate PDFs it's got a lot of utility and it's got a decent API but the reason I want to include it in this async/await series is because the usage is all async/await driven for for this so you know I was trying to scroll down there you go so you can see everything is async/await across the board with this so it really pushes what we are learning about async/await it's also gonna be pretty interesting so what we're gonna do is we're gonna look at this Shopify experts page and our goal is going to be for each of these main sections so experts designers developers marketers and photographers we want a list of the names on page one of each of these so we're gonna have a browser click this and it needs to give me the names of all these people we could grab the other details too but that's not a big deal and in the end we just want a CSV that has each section and the names of the people and so that that's gonna be our goal to create a scraper that can do that so to get started let's pop over here we need to install puppeteer now if you don't step up to you from NPM as of today which is May something it actually won't have Ilya 1.3 and we want the latest and greatest one point four because there's a feature in there that I want to show you so we're actually going to just grab the puppet to your url and do NPM I write from github because we want the latest and greatest stuff so this is actually going to download chromium which takes a little bit of time because again it's bundling in from doesn't use the one on your computer it bundles in its own version of chromium for each version of puppeteer that they can they can run okay cool so let's sort of get started on that all right so const puppeteer I always forget is a to heiti eer that's the hardest part is trying to spell puppeteer I require puppeteer now to use puppeteer again everything's async/await so let's get started with our normal same process we've been doing in each video for async/await and having a main async function so async function rather yeah async function main like that again we need to wrap this in parens so we can then call it and then of course we want to wrap the whole thing in a nice try-catch to protect against errors and we'll just say console dot log our error you might wonder why I'm saying our error here and the reason is is that I want to see - I want to make sure that there's not a system error that's being thrown or an error thrown from some other package that it's getting caught in our error and we're throwing it so that that's why I do that cool okay so don't get puppeteer started we need a browser object and a page object so the browser is just construe equals a wait puppets here dot lunch so puppeteer that lunch launch takes a parameter which defaults is what we want but I'm gonna put it in here so that we can manipulate it later and it's headless either true or false this is a really cool thing I'll show you in just a minute but if headless is set to false it actually launches chrome and it shows you what it's doing as its as its executing its tests so it's pretty cool I have to say so now we have a browser now when your page so Kant's page equals a wait browser dot new page okay there we go so now we have a page the other thing going to throw in here is a user agent because we are web scraping and we really want to act like a normal browser and I'm pretty sure the user agent that puppeteer will give you by default is sort of detectable as a robot so we want to change just something that's not just in case I don't even know if Shopify is doing a detection on this it's just something like to do where'd I get this I just want to Google on said what's my user agent string and I got it from there so the way puppeteer works is we have to start with a go to link okay so we need to go to somewhere where are we going we're going to be going to this chef of an expert page so we're just gonna say oh wait page go to and put that link in there now let's first discuss really fast exactly what we want to do we need to get these buttons and then click each one get the data from the following page and then come back here and click the next one right so after we go to this main page the way puppeteer works is you need to tell it it you need to tell it when to do the next thing I can't just tell it go ahead and click the first button because that button may not be there yet this is happening in real time so what we need to do is tell it to wait for something you have a couple options from here you can say a wait page wait for navigation you have that option which waits till the network is finished its activity for documents I don't like to use that as much because it's a bit ambiguous I like to wait for a particular thing to show up on the page ok I mean in our case I want to wait till one of these it doesn't really matter which one I want to wait till one of the sections dot sections shows up ok that's sort of what we're looking for so I'm gonna say page dot wait for selector and now it's going to take a sort of a CSS or really it's a query selector all a really query selector statement here so I'm gonna say dot section so we're gonna wait for the section to show up before we're done and if I do a consult out log it's it's showing right let's test this as it is right now so you can sort of see what's gonna happen and I'm gonna set head list of false so you can actually see it happen in real time so we're just gonna run node index and chromium's launching it opened the page and nothing really else happened except our our console says it's showing and that's because we didn't click anything we didn't do anything but it did open the page detect that our selector which is just the first one is showing and that's really all we wanted to know so if I close this out my hidden ctrl see it actually closes chromium since it's sort of running in that process okay so we're good to go to move on to the next action which is what so our next thing is that we want to grab each of these let me decrease the size of this browser so we can sort of see we want to grab each of these sections okay so that selector here these are each section classes or section elements however you want to do it and so in puppeteer the way you can grab multiple things so in essence doing a query selector all is using something called like page it smell everything's gonna be awaited oh wait page dot dollar sign dollar sign now a little bit weird syntax think jQuery for a minute but dollar sign is doing this similar to document query selector which is getting a single element and dollar sign dollar sign is how they're doing document dot document query selector all okay so that so dollar sign dollar sign is getting many dollar sign is getting just one so what do we want to get many of we're going to get many of dot section okay and we can say Const sections equals and if we do console dot log sections dot length we will get to see here and let's turn head lists to true because I don't want it to pop up it's going to do the same thing just as not going to show me the browser and I say note index and we can see there's five sections so great so we're properly reading the page right now for things so after we have each section I'm gonna make a mistake right now just to show you what's going to happen but what you'd think is that you want to loop over the sections now we learn from our first async/await that we can't do section stop for each because this creates a new function it's not asynchronous and as we learned if anything fails inside of here we can't get out of it so this is we never want to use a for each loop when we're doing async/await stuff we can however use a for of loop because it doesn't create an anonymous function so we can say for Const the for Const section of sections okay so now we're looping through using a for loop if you don't know for of loop it's sort of an it's a loop over an iterator it's probably the best way I could describe that doesn't have I as an index it just has section here so we're a loop over the sections and then in our first section what we want to do is get that first button so now let's just look at the page again you can see that we have our first section what we need to do is we need to crawl into the section and get this button so if we look at the page here you can see given this whole thing if we look at this button you can see that it's a marketing button so great so we have our so we have a section element and real quick I'm going to hop over to puppeteers website just so you can get a look here what the return of some of these things are if we look up page dot dollar dollar okay we can see it takes a selector right and it returns a promise array of element handles if we click on element handles you can see element handles represent elements that can be clicked which is sort of what we're going to be looking for okay so great so we have a section so how do we go from a section element down into its parts okay elements have this thing called element handle dollar sign again sort of think jQuery so this allows us to get new elements from inside so we can say Const button equals section dollar sign and I think we said it was an a dot marketing button is what it was okay so that's how we get access to the button we have to use a weight you can't forget that soup super important that we don't forget that and now that we have access to that button what we can do now is click it so we can say button dot click we have access to the button itself we have awaited getting that button and now we can click it let's see what happens I'm gonna turn headless back on well off I should say so we can see what happens no it index there we go and we have clicked it and you can see here we have gone to the first page it actually clicked seeming like it click the last one is what actually happened in this case and we'll go over that bug in a second but it did click the link it just clicked the last link okay so great so before we click a link though well right that's fine after we click the link we want to let's pull up that page here let's do see our top experts we want to pull all these names ok so again we need we need to get some sort of list type of element here so chrome is doing a wonderful thing right now where if you come back to it you have to close it open it again thanks girl alright so this is a ul called expert results and in here you can see each of these is an Li that has the results in it ok so that's what we need after we click the button again just like we did before we actually need to wait for something to show up we're gonna wait for this ul to show up so we're gonna say oh wait page dot wait for selector and we're gonna wait for and this is an ID selector so hashtag there we go wait for a selector so we're gonna wait for that to exist then we're gonna get our ul so we'll say Const actually we don't need to get the UL we can just get the Li so Const Allies equals and we can just get all the different allies that are on this page so again the same way we did before we can just say oh wait page dollar sign dollar sign and then we can just say expert results a lie and this is going to get all of our allies okay now same thing we did before is we can do for us we can say for Const Li of lis there we go now we're looping over and makes comments here loop over each la in our page alright so you follow along there great so as we're looping through these allies we want to you from the Li itself we want to dive in and get particularly this name let's take a look at a selector on there it seems to be the only h2 on the page so we can use that h2 and the name is going to be either the title property or the inner text I'll just use the inner text so we can say there's a new method given in a particular element you can actually extract properties from it using the eval method so we'll actually can say Const name equals l i dot dollar eval is the method and it takes a selector in our case h2 and this brings back a function which passes that element okay like that and then we can return h to enter text okay and that's how you can do that but to be even more clear here because this is es six this is there's actually a trick whenever you see this pattern there's a trick here whenever you see one parameter with one return in it you actually can get rid of all of this and these parens and all of this I need that paren though and this is the same thing this is saying this is a function passes this and it returns this it's called an implicit return and it cleans up your code quite a bit when you have that pattern so this is the name we can go ahead and and console dot log the name so you can see it again this is there's still a bug where we're only clicking the last one get there if we do note index you'll see it loads page goes there and oh we get a ton of unhandled promise rejections that means we didn't await something and we did not await so now there we go clear run it again fail to find element matching selector h2 so what is it saying aha okay so this is my I got this got me when I was preparing this it got me again so there's actually more allies on the page via ul Li let's actually hop over to here and say document dot let me make this bigger and clear this if I do document dot query selector all and why is it cut off whatever I'm saying pound expert results Li this actually too big okay this actually gives me a lot of stuff that's inside of the card and remember I wanted each card so it's not matching my pattern which means I actually need to do greater than Li which is a direct descendant selector if I do that now I get a list of just the cards and of course I could also instead of that I've just said Li dot expert card that's another way to do this to be more explicit but I know that this works I'm just testing right now in the browser okay so what we need to do is change expert results Li to expert results direct descendant Li to give me just the Allies of the cards that I need now each of those should properly have an h2 so let's clear this and node index go to the page do the thing there we go and it got ignore that error we'll get there in a second there we go it got all the names of all the people okay so we're on our way to what we're looking for now what is this error this is saying cannot find context with a blah blah blah well what's going on here to be clear is that this four of Liu finishes we're still in this for of loop okay but that so after this finishes we get to the next section and it's like okay great let's find the button section marketing and let's wait for the thing but but we were already out of that context now we've moved pages okay like we're not in we're not in this context anymore we've moved different pages so this section no longer exists in fact we're on this inner page and we need to go back so that's part of the problem so let's well we'll leave the login here but we actually need to go back now to the original page if we want to do any more work there okay and so you know naively we can just grab this this whole thing really put it at the end and say great let's go back to the beginning and then you know wait for the section again and now we're back here in our normal thing right and and now now we're back to the original page so now we can finally do it right that's what you'd think it's not the case now I want to show you right now it's going to do that thing run through and it still throws the same error what the hell well the problem is that this section refers to this original context specifically that context and just because we went back to that page it doesn't mean we're still in that context so we can't use a for of loop to loop over sections when inside of that loop we're changing the page we're not allowed to do that we can use the for of loop here because we're not leaving the page that we're on so the context of this li is staying the same the context being the page that we're on but this section belongs to this page and we are changing the page that we're on when we click this button so this second iteration of this loop it will never hit there okay so instead of that instead of having to loop over sections this way we need to loop differently so I'm just going to loop with a normal for loop over the number of sections there's five sections so I can say var let I equals 0 I is less than 5 I excuse me oh my god I can't damn semicolon I plus plus you know write these that often say really anyway so now we're looping over these as normal but we need to get section so Const section equals sections I now I have a section okay and now we can print that out also now that we're sort of dynamically looping over these things I sort of want to print out the name of the section that were on so I want to sort of I want to say like and so that that section refers to let's go back I want to print out the name of the button basically and so given this is sort of a weird intricacy down here given an element I was able to use dollar sign eval but a selector is required so if I already have the element that I want I can't use eval to get content from the Dom it's weird you can't do it so the way to get the name of the button to print out is I actually have to say Const I'm gonna say button name equals and this is a bit weird I have to do page dot evaluate this is just again intricate detail because page dot evaluate does not take a selector it takes well the selectors it takes an element handle which we have as the last parameter so the first parameter that we have is going to be the button that we're passing and it's really that it's really that function that we're going to be handling and then the the thing that gets passed in is actually the button so let me write this all out I am gonna write this whole function the way it's supposed to be so button once the buttons getting passed in so we're passing the button as last parameter it gets passed in as the first parameter of the function and then we're gonna do my implicit Return button that inner text is what we want so again what's going on here is because there's no way to get to evaluate on an element without a selector again down here this Li we're evaluating directly on it but we're going into the li to get the h2 there's no way to do that with the button that we already have so we have to use page dot evaluate instead the first parameter is this function this second parameter is what's going to pass into the function and in our case this is an element handle again this is an element button that we have access to it's getting passed into the function and then that function is buttoned innertext and that's how we're getting the button name and so I can say console dot log I'll just say by the name and I just want to print out which section I'm on okay so there we go fine let's go back to here let's note index okay so we got and we got another context error and we got a promise pending because I probably forgot and a wait yes of course all right so for await all these things we're getting context error because of this down here so that means that we really can't go to the page at the end and this is sort of silly to do anyway because we're already doing it up here and now it doesn't really matter because we're not using these anymore so what we really let's just do this let's move this to the beginning of the loop up here and we're going to each time we need to do this right we need to go here we need to wait for the thing and then we got to get the sections all right like this and then we got to get the section down we're on okay so on each and we don't need these anymore so on each iteration of the loop we're gonna go to the main page we're going to wait for it to load we're going to grab all the sections we're gonna grab the first section we're gonna grab the button we're gonna grab the name of the button we're gonna print it out we're gonna click that button we're going to wait for that page to load we're going to grab all of the list items on that page loop over the list items grab the name of the h2 of the inner text of that one of the elements and print it out okay and now that should remove our context error I hope node index there we go now we're printing everything out beautiful we just printed all of it out let's format that a little bit better you know I think before we print button name we should probably console that log just some backslash ends here so that we get some spacing and I'm also going to turn head list to true so that browser stops popping up and now let's run it one more time there we go beautiful awesome so now we're doing the scrape as we intend it to I don't like hard coding this five in here so let's just make our script a little bit better I actually am going to do these three once up here just so I can get sections dot length because I don't like hard coding things for these so again a little bit repeat code but you know if you wanted to make that a function you could let's just test one more time now okay so it still works awesome fantastic last thing I want to do it let's make this slightly more useful and let's let's put this into CSV so I'm just gonna NPM ifs extra again FS extra is a is the file system utility for node but it's got a bunch of extra stuff in it and it returns promises so it's awesome so we're just gonna say Const FS equals require FS extra again it's perfect for async await stuff when the page begins we're just going to say a weight FS dot right file and we're gonna write to just out dot CSV and we're gonna write name and name comma wait a minute write name comma well let's call it section common name that's fine and then we need a backslash and for a new line because this is a CSV and then down here when we have this we're just going to say a weight FS append Phi we're gonna add to it we're gonna add two out dot CSV and we're going to add let's put it in these we're gonna add in quotes double quotes the section name which we called a button name actually button name comma click close that quote comma in template rolls again name and of course a backslash n now we've written a CSV file and then when this whole thing is over down here we can just say console dot log done and then we can say oh wait browser dot closed which closes the browser and says that we're all done ok let's see what happens clear node index cannot find module puppeteer okay so we just reinstalled puppeteer for some reason all right so it's doing our log great it says done so we know when we're done and let's go out to out dot CSV and there you go section here you go those name there's other people and they're all separate out by section so now we have a useful utility that we've just generated and we've used puppeteer to write out a bit of a scraper there we go so there you go that's how you can build a scraper with puppeteer and how you can use this again you can use this for testing or scraping or PDF generation or anything like that but this just shows you the basics of using and this is version puppeteer 1.4 what I think it was there you go
Info
Channel: optikalefx
Views: 59,929
Rating: 4.9853158 out of 5
Keywords: puppeteer, node, node js, node.js, shopify, web, scraping, web scraping, api, javascript, js, programming, learn, code, async, await, async-await, promises, es6
Id: IvaJ5n5xFqU
Channel Id: undefined
Length: 27min 54sec (1674 seconds)
Published: Thu May 17 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.