Build a Web Scraper with Node.js and cheerio - IMDB Movie Search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello friends welcome to coding garden with CJ we're live today I'm going to be building a web scraper with nodejs [Music] awesome awesome let's get into this okay so here's the idea I want to essentially build an app that scrapes the data from IMDB so I'm gonna write some code that essentially like makes a request for like a search for a movie and then I'll pull out all the titles of the movies and the images and the year and essentially turn that into a JSON object that I could use like in a front-end app and then for any given movie I want to scrape this page and pull off like the poster of the movie maybe the rating well not only the like the star rating the theater rating the genre the plot maybe even like the cast but essentially I'm gonna write code that a bet picks out those things and then turns it into a JSON object so I could use it in some other project so let's get started what I did was uh I'll start from scratch so like on IMDB you can like search for something so I'll search for Star Wars and by default it's search is titles names keywords all kinds of stuff I want to be very specific so I'm gonna build something that only scrapes for movies so if I go to category search you'll notice that that changes the URL up here to only search for movies and so the essentially this URL is what I want to use to search for things so instead of Star Wars if I search for Fight Club in the URL I should get back just the movies that match that search whether or not it is Fight Club I guess our top result is Fight Club but it shows some some other stuff in there too and the way I'm going to do this is I'm gonna programmatically write code that picks out the different pieces of the app so if we look at the element inspector and like we look at this title you'll notice that this table represents all of the results and that has a class of finalists and then inside of there there's a body and essentially like every single result is inside of a table row and then each result result has a image and then also some actual text and then a link to the actual page and so essentially I'm gonna make a request for this page and then programmatically pick out those parts and make an API with it so let's do that so I'm gonna create a directory let's call it IMDB scraper and in this directory I will initialize it as a node project so if I do MP a minute - why'd creates a package.json so I can install some dependencies I'm gonna be using an O - fetch to make the requests so if you're familiar with fetch in the browser it's exactly the same thing but I can actually use it on the server side so it won't install that and then I'll create just an index J s file that's where my initial scraping code is going to go at least and then we'll open this directory inside about them yep cool so in index J s I'm just gonna bring in fetch and so that was from the know - fetch library and now I'm just gonna start playing around with it so I want to make a request against this URL so first I'll just go ahead and do that so let's say like URL equals well say URL equals that in quotes and then we can kind of like pick this out so right there is the search term so let's do this if I say and Q equals and essentially we can just put the thing we're searching for on the end of that so let's make a function like search movies and we take in a search term and then we're actually gonna use fetch so I'll fetch against that URL and then we just want to put the search term on the end of that which we'll make a request for the search results of that movie then we get back the response and here we actually want to return the turn the response into text a lot of times when you're dealing with api's you turn it into JSON but in our case it's just HTML so if I use this text method it's going to give me back the actual HTML body and then let's just return that and so right down here if I say search movies for [Music] Star Wars and then I should get back like the body of the page this is my break timer we're just getting started but look away from the screen take a quick stretch get ready to learn about some scraping cool and I'm just gonna log the body so essentially I have written code that is gonna make a request to IMDB for a specific movie and then just log out all the HTML to the console so let's see that so if I actually run this file it makes the request and then just spits out all of the HTML from the page so now what I need to do is essentially pick out of that HTML the things that I care about and for that I'm gonna use a thing called cheerio so cheerio is a server-side library that has a very similar API I think it's almost the exact same API as jQuery but you can actually use it inside of server-side code so yeah you kind of just do it like this you bring cheerio in you then load an HTML string and then you have access to all of the jQuery methods you know and love to extract some stuff out of that HTML so I am gonna do that so in this folder I'm gonna install trio and then we'll use it so up at the top here I'll also bring in cheerio and let's look at their API so you do dollar sign equals cheerio dot load and then you pass in the HTML so I'm going to do that and then pass in the body and then you can actually use dollar sign to access elements so the things I want are essentially anything with the class find result and I can actually test that out locally I know this page has jQuery so if I say give me all the things that have the class fine result it's 79 items and then each one represents a search result so let's do this same thing but in our code so well do get all the fine results and then say each one will have a function I believe you have access to this I don't know let's just see if I log this text content maybe have to do this text I don't know we'll see so if I run this code again now instead of logging the body it's going to say undefined yeah because there is no text content can I do text text is not a function I believe I have access will it pass me the element let's see if I can do it element text no but let's say we'll say dollar sign element equals cheerio load the element then we'll say dollar sign element text why don't we get a whole lot of nothing this is going very well let's let's look at their API real quick so you what see what I can do I know there is in each method you do like a dot each because essentially I want to go over oh I see okay so it's a function that takes the index and then the element but I should be able to let's see so this is index and then element and then if I throw element into the selector that should give me something back let's see run the code Hey look at that so I made the request of the page and then essentially extracted out the text of each search result very cool but let's inspect that further because I don't want to just plain old grab the text I want to be very precise so if I select the image then I can actually grab the source so let's let's do this so let's say image is equal to our element dot fine so now we're searching in within this specific result and I'll say element dot find and I want to find a TD then an a and then an image and so that should give me the image and then let's just log image dot attr so get the attribute SRC and so hopefully this will grab the poster yeah so this grabs the poster URL of each each image result so we got the image what else do we need we all so want to grab the title so if I do so class result text dot well result text with an A so let's say constants and title is equal to we want to find the TD with the class result text and then the a that descends it so we want to do TD class of results result underscore text the a and then we'll grab the text of that and then guess the Year comes after that what would be a good way to grab the year I think I'm just going to grab the the image and the the title so let's do this let's say we first start off with movies which is just an empty array and then for each one we're gonna create a movie which has an image which is equal to dollar sign image attr SRC so that grabs the source of the image and then I also want the title which is going to be title dot txt cool and then here we'll say movies dot push movie and right there I should be able to log movies and we'll have a nice JSON object of all the results let's do it boom so now instead of just getting back a bunch of she's like HTML now I have some like structured JSON which has a title and an image very cool here's what I'm gonna do I'm just gonna throw this into an Express route so I can like make a request and then just get back this JSON data from Express cool let's do that first let's let's move this into a separate file so let's call this movie movies a jeaious let's call it scraper über das will move this search movies functionality basically all of this we're gonna move this into scraper Jess and instead of calling search movies essentially I want to put this after this stuff in because will parse it as text and then we grab the body and then instead of logging movies I'm just going to return of movies and so now elsewhere I can call this function and just get back an array of movies so right here I'm just gonna say module exports equals search movies and that's gonna make it available to the outside world so I could like use it inside of an Express or at Express route so let's install Express cool and then we'll just create a basic Express app so we'll bring in Express we'll create an app we'll listen on a port so port is gonna be equal to process ian vide node env know process study you need a port sorry or port 3000 and then we're gonna make this app listen on that port a callback function or we can just say listening on galvanize says what a handsome man thank you so much galvanized I'm curious who that is because it could be literally anybody I work with logged in on the galvanize account cool so we're creating the Express app we're gonna listen on port and we'll just create a basic get route so when you get slash rec res we're just going to return nice little object with a message that says scraping is fun cool so got my basic Express app I will and go ahead and install node mom node Mon NPM install mom and after that gets added so first I'm just gonna have a start script so this is just gonna say node index tjs so when the app load like to start the app we run that file but in development we just want to do node Mon and then index edge is cool so now I can do npm run dev that should start up my server listening on 3000 so in the browser if i go to localhost 3000 scraping is fun so we got back some data awesome so let's make a search route so let's just do apt-get slash search and essentially the URL might look like this so like search actually let's do the search slash title so you search for the title of movie so our URL might look like slash search slash Star Wars or slash search slash Fight Club or slash search slash office space could be literally anything but then we'll use that title to actually do the scraping so in this file I'm gonna bring in our scraper that we created so let's say scraper equals require relative path to that scraper file and here's what we want to do when we get a request to slash search we're gonna say scraper dot search movies will pass in rec dot params dot title so this will be whatever was in the title and then we should get back some movies and then we'll just rezzed a JSON the movies cool and just like that if I do slash search slash Star Wars we get back an array of movie results so they have titles they have images fun times let's also see so one one other thing we might want is the actual link to the movie itself so you'll notice on the anchor tag so like when I click this it takes me to the actual movie page and I want to grab that ID right there so we know essentially how to make the request for this specific movie so let's also add that in our scraping so over in my scraper right here this is grabbing the text of the title but we want to grab the href attribute and parse it so let's say this so we'll say href is going to be our title dot attribute href and then we essentially want to grab it out so let's let's take a look at this so if so let's say like URL is equal to that thingy and if I say URL dot match I'm gonna do a regular expression that basically pulls out the ID from this from this URL so I want to match will do title slash and I need to escape that slash zero or more things followed by another slash and then I'll escape that slash so let's match that and you'll notice the first parameter in that array is the actual IMDB ID and that's what I want so we're going to match that at one and let's call this IMDB ID and we'll throw that in our result as well cool so now when I make a request for movies hopefully if it doesn't break we also get the IMDB ID of each individual movie very cool now let's make an in-point that gives us all the information about a specific movie so let's say we have a get request against slash movie slash IMDB ID then we basically want to say scraper dot get movie with rec dot params IMDB ID then we'll get back a single movie and we'll return it cool but I need to define this get movie function so let's do it so I'm gonna create a function get movie that takes in an IMDB ID and then it's gonna do a very similar thing be a little bit different but we'll we'll make a fetch against actually we need we need a different URL it's my break timer take a quick stretch hmm take a drink of your lacroix which will look very weird because it's the same color as my green screen okay so yeah I need a new URL actually so if we look at whenever we request the URL for a movie this is what the URL will look like so let's call this well let's rename this to search URL and then I want another variable for movie URL and that's going to be equal to that and essentially we just append the IMDB ID to the end of that and we should be able to get that so movie URL with IMDB ID on the end of that and then we should get back the body and for now let's just resdac JSON oh no yeah why not Rosa JSON and then we'll just throw the body in there so we're gonna send back a JSON object that has oh no never mind sorry sorry sorry let's just log the body in our console so we'll see it come through and we'll return the body so now if we actually do hit this this endpoint we're actually just gonna get back some HTML let's do that return an object that has a body property so now if we try to get episode four I can do slash movie slash IMDB ID get movie is not a function I have to export it so down here make sure we export that and then try again something happen okay so it's it's making the request for the movie but right now it's just a bunch of HTML let's kind of go through the same process we did before to pick out some information so let's look at this title we'll notice that it's in a class of title wrapper and it is an h1 and has some text inside of it so if we do mr. jQuery please give me the thing that has the class title wrapper and I want the h1 inside of it and then grab the text of that that gives me the title of the movie but I actually don't want I don't want the Year on the end of it let's look again yeah so the h1 has that span inside of it is there a way to just grab the text of an element but not the out the the child's elements of an element I don't know what see let's do cheerio get text without child elements get text in parent without children using cheerio that's exactly what I want to do that first contents dot filter grab the thing that is the text looks interesting I think I can do this with jQuery too so if I do that and then throw that on the end of it lame that did acts absolutely nothing nothing let's try it just inside of inside the route but maybe Cheerios a little bit different so we'll create our dollar sign which is cheerio to load and we pass it in the body body body and then we'll go ahead and say Const title is this title wrapper h1 and then let's do cost title is gonna be dollar sign title with all of this stuff on there so we'll do dollar sign title dot first and then now let's just return an object that has a title property let's see what happens so now if I go to localhost 3,000 slash movie slash IMDB id cool I grab the title it's got a bunch of weird space on the end so let's trim it so I can do text trim and that should get rid of the extra space cool so we grab the title what else do we want let's grab the rating so if you look at subtext there is a meta tag itemprop equals content rating here we can do like a really interesting selector so we can say give me the meta tag where the item prop equals content rating and that should give me that element and then we want to grab the content attribute PG cool so that's the code I need let's call this Const well I'll throw this here so we grab the title grab the rating that's gonna be that let's throw the rating in there let's see what happens rating PG this is fun so much fun so we grabbed the rating let's also grab the run time so it's a time itemprop duration so we can do a very similar selector so time itemprop is duration and i actually just want to grab the text content of that well weird two hours in one minute but it's also giving me 121 minutes 125 minutes where is that even coming from the date/time here does say I think this is like playtime 121 minutes interesting let's look at another movie and see what what that looks like react developer tools is crashed looks like they were using react on IMDB but let's look at that playtime 98 minutes I honestly just want I just want that maybe we can we can do the same thing we did with the when we were selecting the header text let's try it so let's say Const runtime it's gonna be equal to and then we basically want to do this same thing to grab only the first text content let's see what run time we get back and sweet two hours one minute that's beautiful it's very nice and clean okay so we got the rating the runtime let's grab the genre so this is very weirdly formatted so they are okay I got it so it's anchor tags with no spans with item item prop genre so let's do spam well it's gonna be very similar so a span with an item prop of genre and that's all of them so what we could do is iterate over them and like grab grab each about you so let's do that over here so let's say span itemprop John row dot each there's a function that's going to give us the index and the element and let's create an array of genres initialize it to empty and inside of here we're gonna say genre is going to be element so we wrap it in the cheerio selector so the element is now a cheerio object or jQuery object dot txt and then we'll say genres push genre and then we'll return the array of genres well when the truth look at that action adventure fantasy hello Jason Michael this is so much fun we're we're scraping the web too much fun so we got the John Rose and then let's grab the release date so this is an anchor tag itemprop date published they make this way too easy so this is a meta itemprop equals date published don't text oh no not that text we want to want to grab an attribute what's that attribute the content attribute content yeah that's what I want Wow okay so we got our John Rose we'll call it date published and that's gonna be equal to that and then we'll return that does it work yes it does may 25th 1977 let's grab the IMDB rating itemprop rating value cool so we want the span with an item prop of rating value and then we want its text to easy too damn easy okay so Const IMDB rating equals that I passed that in and actually now that I'm thinking about it we should probably pass in the IMDB ID when we send it back to the front end just so we have it or whoever makes this request so yeah we got the IMDB ID the IMDB rating let's grab the movie poster so this is an I guess we just do div class poster a image so we should be able to do div with a class of poster with an a child with an image child so that grabs that sweet and then we want the attribute of SRC and that is an image so that's what I want so let's just call this the poster it's gonna be equal to that cool so now I want to make the request get that back and get back the image I wonder if we can get it bigger though whoa when I click on it it's way bigger I wonder if that's actually in the Dom though let's see so when I click on it how do we select this so this is some serious stuff so zoom wrap if we look at the image URL though how does that compare to the image URL we got back so it's the same is it nzm it's the same but with a slightly different ending piece so here's my thought if we take this image and then replace this with that that's the big image hmm could also select it though Jason saying the URL has the IMDB ID so tt0 six five seven nine guess I'm not seeing it I guess I could I could try to select this it has a very weird class though on it Oh PS WP image let's see how many of those are on the page so go there the page is making all kinds of requests please stop stop grab the thing with the class PS PS w image nothing is this taking us to a new URL oh it is taking us there okay I won't worry about this for now but later on we might come back so that we actually get a higher higher resolution image and I really wish it wouldn't show this so let's I think I can turn off turn off errors cool I don't know if you can hear it but my computer is the fans are going crazy for whatever reason this I am TV site is taking up a lot of resources but next thing we want is the plot so div class summary text too too easy so we want the div with the class summary he was underscore text yeah and we'll just grab the text of that cool and we'll trim it to get rid of the white space on the edges awesome so over here we'll say summary is that and we'll also return the summary very cool Luke Skywalker joins forces with a Jedi Knight a cocky pilot a Wookiee and two droids to save the galaxy from the Empire's role destroying battle station while also attempting to rescue Princess Leia from the evil Darth Vader that's a great description whoever wrote that's great okay let's grab the director - this might be curious though because what about movies that have multiple directors is that a thing I guess I don't know much about movies but do movies have multiple directors anybody know a movie that has multiple directors [Music] Fight Club it only has one director though yeah okay well grab one if there's if there's multiple we'll handle that later so we want spam span No yeah so this span itemprop equals director itemtype is a person in the text let's just see so if I grab something like that so span itemprop equals director cool and just trim that King Kong has multiple directors let's check it out brake timer oh I realize I'm I'm out of frame have my camera in a very specific place oh yeah let's look up King Kong the big lebowski it says Peter Jackson maybe in the 1976 version hmm oh yeah Coen brothers um what's that Coen Brothers movie I will search for Big Lebowski though hey there we go so this one I need to filter get rid of the heirs this one does have multiple directors but I haven't shown this yet but the code I'm writing is extremely generic so right now I've just been testing this with Star Wars if I throw another ID in there this will actually still give me back the data for that that specific movie which is awesome yeah it looks like the so you're talking about the old King Kong it look like that one had Oh like 1933 yeah multiple directors so let's let's look at our look at look at our selector like if I'm selecting like if I do itemprop director it might actually return multiple I'm thinking and then we can just iterate over the array instead of only returning a single one so yeah so yeah there's multiple item prop directors so I add that code yet no so here instead of doing just text we essentially want to do in each on it and then basically have an array of directors does this have a length property yeah so let's do this let's say directors equals that and if there now I'm just I'm just going to return a an array of directories even if there only is one director it'll make it'll make it easier if I'm like using it on the the front yeah there are a ton of Kong movies King Kong vs. Godzilla go Keira okay so we kind of want to do the same thing we did with genres so for each director we will basically say director is going to be this element dot text and we'll trim that text and then we'll have an array of directors and then we'll push the director into directors and then we'll we'll add our directors to the array so now if we get the Big Lebowski directors joel coen and Ethan come in unaccredited very cool and if we throw Kong in there Mary and see Cooper and Ernest be schoedsack really not rated that's fun but isn't this awesome like I can literally throw any IMDB in the ID in there and because the page is structured in the same way it can just extract out all the information let's see what else we might want to add hmm let's add writers well what does happen something weird happened cool itemprop creator are our writers so should be able to do the same thing so itemprop create tour and that gives me back a few things so let's say Const writers is an array and then I will say for each similar function Const writer is going to be this thing and then we'll push into the writers array writers that push writer and we'll throw the writers on there RKO radio pictures interesting let's find some other IMDB IDs to test let's do rogue one Chris Chris Weitz Tony Gilroy Lucasfilm interesting I don't see those on the page but it looks like they're they're still being selected let's copy that because if I'm running this here really I don't believe you span itemprop is creator weird weird weird weird I think maybe it was running my code against a an iframe or something with that Oh interesting so there are more writers down here oh this is production company interesting so even though this is a production company it's still listed as the creator we need to be more specific with our writer selector so instead of just a itemprop creator we want credit summary item as the parent class so if I do credit summary item I think that's up it's a class followed by a span then it's only two that's what we want because basically there's multiple things on the page that have that same class so that was appearing so now if we do that we should only see writers and not yes so then we don't see production companies let's also grab the stars mmm itemprop actors my guess is the cast also has the same item prop yeah itemprop actor I don't want to grab the cast just yet let's get the stars I think I might have to do a similar thing so credit summary item item prop actors is it horse yeah so I smell a refactor so like this function is exactly the same for like each one of these I want to write it one more and then let's let's refactor it to a function stars is an empty array and I do that selector with basically this exact same thing but instead of writer this is gonna be a star and we're gonna push into the Stars array we can return the Stars to directors writers stars awesome oh it says a comma in it though I don't like that could i I don't I don't care enough to remove it I could do just like a replace but my thought is what if that name actually does have a comma in it I don't know okay what else do we need so we have the title we have the poster the rating that they publish the directors the writers the stars what else so I'm guessing what is the summary different from the storyline it is let's look at four like so rogue one the daughter of an imperial scientist joins the Rebel Alliance and a risky move to steal the Death Star plans but then the storyline all looks lost for the rebellion against the Empire as they learn of the existence of a new super weapon the Death Star let's grab the storyline itemprop description' so div also ID story title story lines so we can actually use that ID that'll be a good selector so let's do the thing that has the ID title story line and we want to find the the div inside of it with the itemprop description' and then the paragraph inside of that so that followed by a div where item prop is description followed by a P tag that got it cool if we do the text of that that's that and if we trim it cool so let's call this the story line that and then we'll return that to break timer hmm in my floating Lucroy cool can anyone think of anything else I might want for a like a movie API couldn't get some box office stuff I think it'd be cool uh yeah let's do this let's grab the budget it's gonna be a tricky one mm-hmm so up until now selecting things has been pretty easy because they have like the exact class that we need or something like that yeah so a question in the chat does companies track things using scraping probably and there's probably a Terms of Service that says you shouldn't do this this is just for fun I'm not gonna host this API anywhere but you should read the Terms of Service of a website typically though if you're not pounding the site with tons and tons of requests you're gonna go unnoticed cuz you're gonna look like just somebody else browsing the website and so if if I were to actually host this on a server I would probably cache the request so every single time a request comes in like right now for this movie with this ID I'm making a request to IMDB and then scraping it and then sending back the results you could just scrape it the first time and kind of like cache the results so that way you're not sending as many requests to the website but definitely I would not build a commercial product based off of scraping some really popular website it is it is fun though and especially if you want to have just like a simple project that uses some data from a website I'm not a lawyer don't take my advice I mean this is not legal advice be cautious and be friendly don't don't make tons and tons of requests this this is going to be extremely hard to to target one trick I haven't been using if you select an element in the element selector and say copy selector this will give you a very very precise selector to grab that element so yeah so this is grabbing the twelfth child of title details but I don't know that every single page the twelfth child is the budget I don't know that so that's but I want that let's do a selector on Oh weird okay I'm not gonna do it it's too hard because it's gonna probably change from paid to pay page to page and there's there's not exactly a precise way of selecting that let's do the production companies though that'll be the last thing we do and then maybe I'll build a small front end that uses this API so there's that itemprop creator again Oh item type is organization interesting let's do this so we want a span selector where I've actually never done multiple attribute selectors let's just do item type where item type is organization there's three of them and that's each one of those cool so let's grab that and it's gonna be a very similar story to the things we did before let's let's refactor it so this is a function basically oh you tube trailer URL yeah I'll do that next after this refactor thinks thanks for reminder yeah so notice each one of these functions it's doing exactly the same thing only thing is it's pushing into a different array so here's what I'm gonna do I'm gonna use closures so I want a function so let's call this let's call this get get items initially this is going to take in the item array and it's going to return a function that does this but now this is gonna be item in here we're gonna say item array push item okay really weird syntax but here's now here's what I can do right here I can say get items with directors and so this is the like a cool thing about JavaScript so this invocation will return this function that still has access to the directors array and then can do that same thing and then I want to do the same thing with writers so I would say get items with writers and then get items with stars and if I did it correctly it should still work me does semicolon there so we should still be able to get directors writers stars yeah works beautifully but now I want to do that same thing on this selector because this is going to give me all of the what were they called production companies yeah so let's call this companies so const companies is an empty array and we want to do that and do the each but we'll pass in companies and then we'll return companies in the result cool Lucasfilm Allison is your mayor productions yeah and let's try to find the YouTube trailer URL so trailer trailer trailer I wonder if it is linked to YouTube or if it's just this let's see what happens when I play this okay this is gonna be tricky because I think when when I click this it actually is it's loading a totally separate URL I don't I don't know if this is on the same page can I close it they call it the desktop Shh let's just see if this has a selector well if it has a we might not be able to embed it but we definitely can add the link to the trailer on IMDB so itemprop equals trailer and we should be able to grab the hrf of that so so this is an anchor tag with itemprop trailer so that didn't work Oh itemprop not itemtype okay I'll check out so Jason is saying Star Wars a new hope actually has an embedded YouTube video let's see if we can grab that otherwise we'll grab that so yeah let's just look at that real quick just open a new URL for new hope oh there aren't videos down here maybe maybe I can grab that um I'm not seeing Oh seat oh yeah and just to mention if you are joining for the first time I do have a poll where you can request videos for me to make or livestreams to do and I will potentially do those things so you can add options or vote for existing options I'm actually working on an app where you'll actually have to sign in because I think a few people actually logged in multiple times and voted for stuff but it doesn't matter but yeah the app is coming soon look out for that I do not see the YouTube embed not sure am I looking in the right place Jason YouTube are we thinking this thing because I think this this right here is technically still just like a video like IMDB video embed I think I think I'll show that as the last thing comp tease I say her first name it might not be your name but yeah basically I wrote this scraper but then I could deploy it and then we can just call this API from where it's deployed I'll probably deploy it and then take it down because I don't want people getting my server in trouble for scraping a bunch of data but I'll show how to do that definitely Jason was wrong that's okay it was a fun detour I think we'll still grab that same URL so the itemprop trailer and then I want the source know the href yeah the href which will take us to the the trailer and I'll have to pre pin that with imdb.com that's okay so let's say con Strahler is gonna be that thingy and then in our object the trailer is going to be a template string with the trailer prepended by HTTP colon slash slash dub-dub-dub imdb.com i should do it see what happens get the movie yeah so i'm tv.com and if I click this Oh his name means private account in French are you are you streaming or listen watching from France yeah this is awesome so here's what all do hmm okay let's add a little bit of caching so basically if I request the same movie twice it doesn't have to scrape it twice it just scrapes it the first time and then sends back the same result so let's do that we'll do that inside of the scraper so let's call this will call this movie cache and that's just going to be an object and then we'll call this search cache and that also will be an object will do the search caching first so when you make a search from four-term after we parse it so but before we return the movies here I'm going to set a property on search cache so search cache at search term is going to be equal to that array of movies cool and basically if you call search movies I'm gonna first check and see if there is a if there's something something in the search cache for the term we're just going to return that and we'll do promise dot resolved so the first time you make a search for Star Wars we'll do the scraping and return that but we'll store it in memory and then the second time you do a research for Star Wars we'll just immediately return that result and then we'll do the same thing for getting a movie so after we get the movie let's store it in a variable so say constant movie equals that and then we'll return in the movie like basically like we were doing before but before we return it will say movie cache at IMDB ID equals movie and then we'll do the same check at the top of this function so if you have already requested this movie we're just gonna promise that resolve that otherwise we'll make up make the request so now we're actually caching the data so the first time we make the request it scrapes the website the second time it will just serve it from memory and I could go one step further and like insert it into a database or something like that I'm not gonna do that for now look away from the screen and let's also just add some logs in here so right here I'm just going to log serving from cache and then we'll throw in the search term and then same thing up here so serving from cash and throne the IMDB ID cool so now when we make a search for slash Star Wars notice this is gonna take a few seconds because it's scraping the web but every time after that instant notice how like make the request boom done and if we look at our logs it's been it's serving it from the cache so the first time we made the request it had to scrape in the second time we just send it right back and then we'll do the same thing with a movie so that takes a few seconds because it has to scrape but then the next request instant and it's served that movie from the cache so the first time we're hitting their server and we actually are scraping them but every time after that it's just serving it from the cache cool that was fun last thing I guess is all just deployed this real quick so I use a deployment tool called now you install it on the command line and you just type now it supports static sites and it also supports nodejs sites and my app is basic enough that I should just be able to say now and send it up to the cloud if let's make sure my package JSON has a start script I have all of my dependencies listed in there yeah that should be it hey autumn thanks for joining ok yeah so now is pretty sweet so like from this directory basically it needs a package JSON but if I just type them now this will deploy it to the web if you check out their website they tell you how to install it and how to get it set up you have to log in and stuff like that but I am now on the Internet so I'm gonna go ahead and create so by default it gives me like a just like a random URL but I can give it a an alias so it'll be if it's not taken oh it's already and used by some other account let's do IMDB Ripper bet that's curious so somebody else has deployed an IMDB scraper at that same URL and so now we're on the internet we can go to this URL and scraping is fun and if I do a search slash Fight Club take a few seconds get back some results and then if I want to actually get the information about Fight Club I can do movie slash ID and we get back that info so if I wanted to use this on the front end I would probably add course I think I'll do that I've been streaming for an hour and 10 minutes I kind of want to end soon I think I'll do a basic front-end app that just makes a request to this to this API so let me install course and then inside of our Express app I'll bring in course and then we'll just use it and now any front-end client can make requests here so I updated the app I need to redeploy it yeah I could have added CJ to the name I don't I'm gonna I'm gonna take this down once I'm done with the livestream I don't want anybody like using this and then like charges my account a bunch with now they do offer free deployment so if you if you sign up on the free plan the only thing is you're so your source code is available to the world I actually have the paid plan so you won't see my source but it's really good for just like hobby projects IMDB shrimper like so dan says this is fun you can literally do this to like any website it's so much fun so yeah so now I've redeployed we have cores enabled if we look at the network tab when I make a request here so make a request if we look at the headers access control allow origin star so now I can actually make this request from a front-end so let me do that real quick I'll do it in a separate directory let's create an IMDB scraper client let's go in there nope up there there we'll create an index dot HTML file open it up inside of atom do-do-do-do do-do-do-do-do do-do-do-do-do okay basic HTML file I'm gonna add boots watch if you haven't heard of it it's pretty sweet it's a its themes for bootstrap so the default bootstrap theme looks like bootstrap but these all use the exact same classes as bootstrap but they just look different cooler various styles let's use cyborgs so if you got a preview you can see what all of your component your bootstrap components will look at I like dark themes it's my favorite so you can grab the CSS and let's just pull that into our app right there I will add a main area we'll give this a class of container and then let's just give the yeah so in the chat they're wondering why Python is used in scraping examples um I don't know but there might be some popular libraries I love doing with JavaScript especially that cheerio library it's extremely similar to jQuery so if you know jQuery you can pretty much scrape websites I'm at a quick style tag we'll push the body down so I'll say main has a margin top of to MS and then inside the main I want a form okay I have this these snippets built-in if I do that I get this a giant form I don't want all of that I just want a basic input box so let's get rid of all of that and so this will be for search will say search the type is text the ID will be search placeholder will be search for a movie yeah I mean Python is used a ton in in mathematics really mainly because there's some really good open source math mathematics libraries server I I mean I wouldn't say that so Danis saying a server code in Python is that you're saying Python can be written more precisely I would say it's it's almost the same as JavaScript it all depends on like the server-side framework that you're using if you're used to node in Express k-- flask for python is very similar you your your routes have a method a URL and a route handler it's both Express and flask are based off of Sinatra which is a rails framework but it started this idea of like a very simple back-end framework that allows making a back-end app for for serving up requests yeah and so uh compte come through you've probably not saying that right but you mentioned using selenium webdriver for scraping so in my example we're using a website that is basic basically totally server-side rendered so by just making a basic request like we're doing in here we're just making a basic request for the text of the site and then scraping that sometimes you're dealing with a website that has JavaScript enabled and like only things will be President after JavaScript actually kicks in then it gets a little bit harder because you can't just request the text of the website you have to actually request you have to use a headless driver like selenium or there's also phantom Jas and that's essentially a web browser that will actually load the JavaScript code and then you have access to grab the elements inside of it a little more complex than this and actually takes up more resources but if you're scraping sites that are just HTML this this is definitely the way to go I digress okay we have a form and I'm gonna start up light server look at that so I got a nice little form I want to be able to say Star Wars click Submit call my API list out the movies let's do it so down here I'm gonna add a script source we're just gonna call it a pas we'll create an app j/s when the page loads I'm gonna get access to the form so I'm going to do a document query selector the thing that is the form and I want to listen for the submit event of the form so I'm gonna add a bit listener for submit and I'll create this function form submitted function form submitted it'll take in the event I want to prevent the default action it's so funny because I teach and every time I teach this I stop here and what do I do why is the page refreshing you got to prevent the default action so event dot prevent default default okay and then we'll just log form submitted cool so back to the browser yeah look at the dev tools if I click the button form submitted sweet so when I click the button I want to grab the value of this input and send it to the API so let's just close my server-side code I could use form data but I'm just going to get this this input so let's call this search input same thing document query selector grab the thing that is an input and inside of here I'll say search term is search input value and then let's just log the search torrent search term so when I type Star Wars and I click Submit it logs the thing from the search box and now we can use that so I will say get search results we'll pass in the search term let's create a function called get search results that takes in the search term we're gonna use fetch here so I'm going to return fetch let's grab my handy-dandy API URL I think it's IMDB scooper this thingy so I'm gonna be making a request against that so let's store that in a variable base URL is that and we're gonna make a request when we're searching it's going to be a base URL search slash search term cool and then we get back the response we'll turn that into JSON and then we get back the results and for now we just will log those to the console so just like that locally I'm gonna say Star Wars click Submit and we get back the results so this called my API and then now has a bunch of results let's add them to the page let's look at bootstrap Docs I think I might just do list items because I know the the thumbnails are like super small like if we look at this this image so let's do let's just do like a list group that has the movie title and then like the image inside of it so let's do this and then right below the forum will have the list and basically we're going to dynamically create those list items so let's just give this an ID of results store that in a variable we'll call it results list is document query selector grab the thing with the ID of results this will just get returned and then we'll basically say so after we call get search results then we will show the results and I'll create a function to do that so function show results that takes in the results and basically we want to say results stuff for each that's going to give us a movie and we want to essentially app in that movie to the results section let's do some Dom manipulation so we'll say list item is going to be document create element create an list item we'll say list item dot text content actually let's let's create an image and then append it to the list item so image is document create element image and then we'll set the source of the image to be movie image I believe let's let's log those results make sure that that's the right property Star Wars yeah so source should be dot image and then we'll append the image to the Li and then we'll set the allies text content to be movie dot title and then we'll append the Li to the results list so results list appendchild Li let's see what happens I'm curious if setting text content will overwrite the children let's see Star Wars go it over oh if we if we look at the actual Dom yeah so let's app in the image after setting the text content it'll be weird the image will be on the right-hand side I don't know that's kind of weird let's just we'll append some children to it so let's do an Li that has a div inside of it that div has an image and then a span actually here a pin the image and then we'll create a span and then we'll set the spans text content to be the movie title and then we'll append the span to the VLA okay this will do what we want I think yeah look at that we've got nice little images and then nice little titles and I've been screaming for an hour and 23 minutes I'll maybe go for an hour and thirty last thing I'm gonna do build one more page that shows the individual movie so basically let's make this instead of a span let's make it an anchor tag and then the href is going to be will say movie dot html' IMDB ID equals mu V dot IMDB ID see what happens so search for Star Wars okay and so yeah now if I click this link it takes me to slash movie I'll create that page I'll grab the ID and then make the request to my server to get a single movie let's do it so create a file movie dot HTML HTML file we'll add in the same boots watch basically all these all these same styles so throw that in there it will have a main section class of container we'll add a script here let's call this movie je s and in movie is when the page loads let's log window dot location dot query let's see so now this page exists and oh I think it's dots search so when this page loads we're logging the thing in the URL so basically so I'll search for Star Wars if I open this page it logs that specific ID if I open this page it logs that specific ID so when the page loads grab the ID and then make the request for that specific movie let's do it I'm gonna do a nice little regular expression so if I do window to location location dot search we basically want to look for I am DB id equals and then anything after that cool and that pulls out the ID so that's what I want so on this page let's say IMDB ID equals that at one cuz it's an array then we'll just log the IMDB ID so yeah grab the ID and then now let's make the request for a single movie so let's do a creative function called get movie where you pass in the IMDB ID it's gonna return a fetch against our base URL I could create like a shared file I'm just gonna copy this this variable into here because the JavaScript for these two pages are totally separate so make the request against the based URL which and then it's movie slash IMDB ID and then we'll turn that into JSON cool and then after we get the movie we want to show the movie so when the page loads will say get movie with IMDB ID and then show the movie and I will create a function called show movie that takes in the movie and right now just logs it okay this is wrong that should be like that okay so now when the page loads yeah it makes the request it grabs the movie logs it to the console and this should work with pretty much any movie so again if I search for Star Wars a new hope last Jedi Star Wars solo a Star Wars story rogue one each one of these pages when it loads logs the data for that specific movie now we want to add it to the page let's do that so for that I will look at the bootstrap Docs maybe there's something good I can use like mm a card maybe we could show like a pretty big card with the movie poster in it the name of the movie the summary I dunno there's there specific classes for like showing like director some value something else some value and it like shows the left the left thing bold what that's called though let's just search for text utilities text text alignment I think I want typography in line elements no I mean I kind of want oh I don't I guess I couldn't show it I don't want to show a table I don't think cuz I could have like the different properties spending way too much time thinking about this but let's do totally forget what they're called I mean I'm gonna look at the because I know this was a thing in bootstrap 3 I just want to find it real quick maybe under CSS was it under typography might be definition I think it is let's search search this definition description list alignment yes thank you mister missus galvanize I really don't know who's watching under the galvanize account right now yes so this will say like director and then the value or directors writers that kind of thing let's use this oh it's just a DTD D it's good enough for me though that's all I want and I think DL stands for ya description description list I found it by searching for definition so that's cool ok cool so when the page loads we do all this stuff I'm gonna go ahead and grab the main element that's where I'm gonna append all of the movie information so we'll grab the main and then we'll create a div we'll create a section document dot create element section we will append that section so I'll do main dot append child section and now instead of doing a bunch of Dom manipulation I'm just gonna set the HTML so I'm gonna say section dot outer HTML so this will essentially replace the section with what I'm about to about to set it as and so then we're gonna do a section inside of that let's have so do a section class of row and then we'll have like a div where we'll put the image and inside of there I'll have the image the source is going to be a template literal of movie dot I think I called it poster let's let's log the movie and make sure we can get all the properties this will have a class of column small twelve so take up the whole row show that movie and we'll give this a class of image - responsive I might have to look that up let's see what happens so when so let's search for let's go to Fight Club submit and then click this there it is I think image - responsive is not a thing what if I do image that's responsive no let's search the bootstrap Doc's responsive images image - fluid oh that's a new bootstrap thing I guess a bootstrap for I'm really familiar with bootstrap 3 do I have to set the parent this isn't working let's let's look at the the dev tools my my my hope is that this image takes up the whole row but let's let's see what we're doing here so I've got a section a class of row did I do the right row classes so call them small 12 that takes up the whole thing I see max with I actually want to set width to a hundred percent yeah it's kind of pixelated that's okay though I'll just create a custom class lame if something exists you guys let me know but for whatever reason do image has a width of a hundred percent and a height of Alto so now when the page loads cool full image right above the image let's throw the title so that's just gonna be movie that title so right here let's just throw an h1 that's the only thing about writing my HTML like this I don't have snippets there might be a plug-in for atom that gives you snippets inside of template literals though movie dot title and let's undefined cool that's not true did I spell title lowercase title Fight Club there is the text alignment properties I want to make everything centered or at least this thing centered text - Center so on the h1 we'll give it a class of Tech Center cool no not cool it's only work on paragraphs hmm it has the class text-align:center important but it's not centering oh well I won't worry about that below that is where we're gonna use those description lists to throw in all of the other details so we want a description list like this that's going to be we put it inside of a div because the div itself will be full width column small 12 so take up 12 columns there's all of that I'm not going to use all of it but we'll look at it for a second basically I just want that and we'll do this so let's show the let's look at all the properties we have so we have the throw the rating and then right there I'm just gonna throw a movie dot reading reading are kind of don't like how that looks disappointed in myself that's fine let's just throw a bunch of properties on there let's do a fun thing though so notice this is gonna get really repetitive because I'm gonna do movie dot rating movie that plot all that all that kind of stuff we're gonna do this dynamically okay here's what we're gonna do we are going to create an array of basically properties that we want to show so let's call this properties will do rating what else do we want we want the runtime well let's do that it's good I'll do this for now it'll it will maybe correct it later so rating we want run time we want date published we want the summary I'm curious the difference between summary and storyline I think storyline was like written by like a contributor and then summaries just like a general summary so we'll do you summary storyline we'll do I think it's it so we'll take these things and essentially we're going to map over them and create a bunch of these DTD DS so let's do properties actually will reduce it will reduce it to a big string of DTS and Deedee's so we're gonna reduce this let's call this the HTML and then we're gonna have the individual property I'll use a fat arrow and so let's say description HTML is equal to this it starts off as an empty string and we're gonna return basically this for each thing so this will be just one big template string and we'll say HTML plus equals that and then we'll return the HTML but we want this to be the property and we want this to be movie at that property and then we'll just throw this description URL or HTML right about here ok let's see what happens so now in the page loads we get rating run time date publish summary storyline it's fun that was easy huh let's let's fix this so instead of just the properties because I want this to be like capitalized I don't yeah I want this to be a little more custom so let's say the property is that and the title is rating and then the that property is runtime and the title will be run time I guess I don't know if that's a compound word same thing for this so property is gonna be date published the title will be let's call it like released that and that quick break oh wow I went longer than I said I was but I'm almost done this was fun and we'll also do summary title will be summary and then lastly storyline and the title is story line cool and so now when I reduce these this is going to be property dot title and then property dot property I should probably give it a better a better name well but now on the left hand side it actually gives us the thing very cool let's add a button up here that goes back to the the search page so just like right at the top here a lot a button go back to search and this can just be an anchor tag with an href of slash so I clicked that takes me back to the search page I can search for Big Lebowski go that spell that right cool click it give it a second we get the info so there's the title the poster all the info hey Bob welcome welcome welcome yeah that's great one thing I'm gonna do is format this date I came across a pretty cool date library recently you've probably heard of moment J s but it has a lot of overhead it's pretty calm it's pretty large I wouldn't say it's complex but it does a lot of things that a lot of JavaScript developers don't like so for one there's this library called Lux on which is actually by one of the core developers of moment it's a modern library used in a similar way but I also found this date functions library which I like so I'm going to use it documentation can I find a CDN for CDN and download what oh I guess I have to pass in a version let's just go to a CD nsj date functions there they are let's grab this and so I'll add this as a script tag right here throw that in there and then over here let's use it so if you look at their Doc's we want to use format so actually first I need to use parse so I should be able to say date functions parse and I want to do that let's do it before we map it so let's say movie date published equals date functions dot parse movie day published let's see what that does for us and now we see like this this big long thing but then we can format it actually I don't think I do that so let's do this well stored in a variable so that's that's one of the things about moment it like it it mutates the date which a lot of JavaScript developers don't like so this has a bunch of like one-off functions that you have to use individually so let's call this date and then I can pass date in to date functions dot format pass the data in and give it the format so what format do I want to give it let's look for a format let's just do the full month mMmmm the day of the year with the suffix I guess it is third that s T that kind of thing so month/day/year we want the full year yeah let's see that March 65th 1998 I think I grabbed the wrong thing so I don't want not the day of the week not the day of the year the day of the month sorry so it should be do March 6 1998 cool all right can I turn on super chat for fan support I don't know what that means Bob yeah I don't know how to turn on superchef maybe you can tell me but I am about to in the stream let's do a review of like everything that I just did cool because it was a lot so let's start at the back end basically I built a scraper that goes to IMDB and scrapes their pages to pull off the data so let's look at that code in an atom again and so I have a basic Express app that is using those scraper functions that I built to basically accept a request and then scrape it scrape the IMDB and send back the result is JSON so I have a route that says you can search for a title this will use my scraper to search the scrape scrape a scrape the page of the IMDB search page and then give you back an array of movies and then I have another route where you can pass it a specific IMDB ID and that will scrape the movie page to grab all of its information if we look at the actual scraping code I'm using node fetch to make the request and cheerio to parse it so basically I make the request to an actual page that returns HTML so like this URL here and then throw some search term on the end is the actual IMDB page with all the results but then I take that HTML so this makes the request to that page I then take that HTML and load it into cheerio so cheerio is a server-side library that basically implements the jquery api so inside of my server-side code i can use selectors on this HTML to pull out the things that i want so when i make the request to the search page i pull out each individual result because they all have that fine result class i then pull out their image their title and their IMDB ID and then create a little object that has those properties and then it into an array so when we make the request to our API I think it's I am DB scraper search slash Fight Club it scrapes that page and then creates this JSON object which we can now use basically in in any app the other function I wrote was to scrape an individual movie so you pass in the IMDB ID and then it requests the individual movie page so for instance a page like this and then it pulls out the title the poster the rating the runtime the genres the plot the directors the writers the stars all that stuff and so this has some more code in it but it's the same thing so make the request to the page get back the HTML loaded into cheerio and then just use Dom selectors to pull things out so pull the title out grab the text I had to do something special here because you'll notice the title also had that year on the end of it I just wanted the title area get the rating get the runtime grab all of the genres date published essentially these are just selectors that I could actually run on the page itself which gives me back information so that's the rating but then at the end of it all I just create one nice big object with all those properties that I pulled out and then send that back to the user so if you then make a request to slash movie slash that ID it'll give you back a JSON object that represents that page we scraped all this stuff out of it so that's the back end I added some simple little caching here so this is just an object and whenever you initially make that first search request for something you haven't searched for before I put it into that object where the key is the search term and then the value is the result or the array of movies and then same thing for a movie so in its cache I put the key is the ID and the value is the movie so the next time a request comes in for that same thing instead of scraping the page all over again I just immediately serve it back from the cache and that makes it pretty fast so because I've requested fightclub before it just comes back instantly doesn't have to scrape the page again and then I made a front-end for it so the the main page just has a nice little search box when we submit that search form we make a request to the API that I created basically the scraper we get back all of the results and then add them to the page as list items so like when we're searching for Fight Club we get that but on the front end it looks like this so if I search for Fight Club it makes the request out to that endpoint it's kind of ugly right now but it does show each of the little posters and then a link to the movie itself and then I built a separate page where when you click on an individual movie it calls our in point to get the information about a specific movie and then adds that to the page and of course this is pixelated I could probably find a better image but that's that's okay for now and then adds all of the information that was coming back from the API that was a lot that was so much fun though thanks everyone for tuning in um a quick reminder if you check out check me out on coding garden obviously you're already there but coding garden I do have a video that says requests for videos that has a link wait where we at go back that has a link to a poll and if you have any requests for videos you'd like me to make or live streams please add them there also please remember to subscribe I can't make money until I get to a thousand subscribers so if you like what you see subscribe tell your friends I think that's it thanks for watching any any last questions in the chat also thanks everybody for tuning in super fun to interact in the chat and just real quick I'll push this code up to github so you guys can all have access to it I'll throw links to the code in the description of the video but other than that Here I am look at that and here's this [Music] good bye everyone thanks for watching see ya
Info
Channel: Coding Garden
Views: 28,226
Rating: 4.9638991 out of 5
Keywords: beginner, coding, programming, debugging, educational, full stack web development, css, backend, vscode, learn to code, full stack, devtips puppeteer node.js funfunfunction, debug, mechanical keyboard, javascript, learn node.js, frontend, node.js, live coding, web development, learning, how to code, html, frameworks, lesson, full stack javascript, live streaming, learn javascript, learn web development
Id: U0btOGPwrIY
Channel Id: undefined
Length: 112min 56sec (6776 seconds)
Published: Mon Feb 26 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.