Build a Web Scraper with Ruby (in less than 30 minutes)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so before we get started let's actually look at a quick preview of what we're going to be building in this video in action so i'm going to jump into my terminal and i'm just going to run my scraper function so right now what's happening is my scraper is running through this entire website and you can see we're iterating through the pages and we're extracting all the job listings out of the website so now my scraper's finished running and if we look i should have a jobs variable now if we run count on that we should have 2287 job listings stored in that variable so if we look at jobs.first we can see a cleanly sorted object that contains all the information that i want to extract on each job listing from this job board okay so first things first i'm going to jump over into my terminal and see where we are here so i'm just going to create a new directory for this project so we're going to do is do mkdir and i'm going to call it scraper and then we can just cd into our new scraper directory so now we've created our scraper directory the next thing i'm going to do is add a gem file so i'm going to do touch and then gem file and then i want to create the file that we're going to use to actually write our scraper code so i'm going to create another file and just call it scraper.rb so now if we do ls we can look and see that we have a gem file and we have our scraper file so at this point i'm going to actually just open this up in sublime so now we have our scraper directory and inside of that we've created our gem file and our scraper file so for our scraper we're going to be using a couple of gems so the first thing i'm going to do is jump into the gem file we just created and i'm going to add a couple of things here so first i'm going to do source and i'm going to do rubygems.org and then we're going to add a couple of gems here so there's really only two gems that we need for this one is hdtparty and the other is nokogiri i'm also going to add the buy bug gem just so that we have that so we can interact with the functions that we're going to be creating here so what i'm going to do is i'm going to add gem hd party and i'm going to add gem nokogiri and i'm going to add buy bug as well and that's pretty much all we need here so the next thing we can do now is run bundle install to install our new gems perfect so now that's set up and we've created our gemfile.lock so the next thing that we can do is we can jump over into our scraper.rb file here and we can start building out our scraper before we get started writing our scraper i'm going to require in a few of the dependencies that we just added into our gem file so we're going to add nokogiri we're going to require an http party and we are going to require in buy bug and now we can start building our scraper so down here i'm going to create a new method and call it scraper and this is pretty much where all of our scraper functionality is going to live okay so now inside of our scraper method the first thing that i'm going to do is create a variable called url and i'm just going to pass in the value of the url that we want to target and then what we can do now is using http party we can actually make a get request to that url so i'm going to create another variable and call it unparse page and what we can do now is do http dot get and pass in that url so what that'll do is make a get request to that url and what we'll get back is basically the raw html of that page so what we can do next is we can bring in nokogiri and we can actually parse that page so let's create another variable called parse page and i'm going to set that equal to nokogiri double colon html and we'll pass in that unparse page variable okay so now just to recap all we've done so far is say here's the url that we want to target we've made a get request to that url and gotten the raw html back from it and then using nokogiri we've parsed that html into a format that we can then start to extract data out of okay so at this point i'm gonna throw a buy bug in here and what that'll do is just essentially set a debugger that lets us interact with some of these variables so once we've added that we can jump into our terminal and you can see we have we're inside of our scraper directory so what i'm going to do is i'm going to do ruby scraper.rb and i'm just going to add scraper down here to actually run our function so what will happen is our function will run when we enter this and then these values will get set and we'll hit our buy bug which will actually let us interact with our variables so over here i'm going to run this so we've hit our buy bug and now we can see that we're inside of our scraper method and all three of those variables values should be set now so for example you can see that url equals url we're targeting if we look at unparsed page this is just going to be a bunch of raw html from that page so we're looking at the html of this page right here and if we look at parse page we can see that same content is now formatted with nokogiri and what we can do from here is use nokogiri to interact with this data so this is where things get pretty cool because what we can do now with nokogiri is we can target various items on the page based off of html elements and classes and ids and we can start to extract specific data from the page so let's jump back over into the web page we're targeting for a second and i'm going to open up the inspector and let's take a look here so basically on each page of this site there are 50 jobs listed so what we can do is we can grab this and we can look at what the classes are of each of these blocks so if we jump over here we can see a class called listing card that is being assigned to each of these blocks right so each one of these on the outer wrapper of the block of data that we're trying to get at has a listing card class so now what we can do using nokogiri is basically target all of the jobs on this page by targeting that card listing class so remember that our variable we created for this is called parse page and using nokogiri now we can use dot css and what we can do is target the div and then that class of listing card and we can try that and what we're basically getting back here is a list of all of these blocks of html and css so let's actually assign that to a variable i'm just going to call this job cards and let's say that equals parsepage.css and inside that we're going to target div listing card so we get that back and now if we look at that if we do count on job cards we should have 50. yep so what we have now is a list of all of these blocks so what that means is if we take a look at job card stop first for example we are basically looking at this listing for the chief financial officer job offered by coinme in seattle washington um so now how do we extract that data out of this listing so let's just look at a quick example here i'm going to create another variable and just call it first job and set it equal to job cards for some reason i camel case that and pull the first item out of that list so now we have first job and this is equivalent to the chief financial officer position that we're trying to target right here and you can see we're still using that nokogiri format in order to pull the data out of this that we want which is the job title the company the location and the url to apply we can dig a little bit deeper into this using that dot css and we can target some of these based off their html element and the class or id for that so if we jump back into the page and inspect this a little bit further for example we can see that this is wrapped with a span that has a class of job title so if we jump back into our terminal in our dot css we can do span dot job title and we can see that we're getting back the html and css for that specific item so if we want to grab the actual text out of that what we can do with nokogiri is just call dot text and just like that we've plucked out the title from that listing so using that same logic if we jump back over here we can look at the company and we can see that that's wrapped with a span with a class of company we can see location is a span with a class of location and if we jump over here we can see that we have an anchor tag with you know a button or button standard class on it so we can use all of those to then target each of those items so for example we can change this to a company and we can grab the name of the company we can change that to location and we can grab the location and we can also let's say we want to grab the url this is where it gets a little bit different but it's actually not that hard so let's say we want to target the anchor tag inside that block so the formatting for anchor tags is a little bit different but if we look at this we can see that this is actually an array and this entire block is the first item in the array so what we can do is just hit the first item of the array and if we look at this we can see that there is something called attributes here so let's just add on uh dot attributes and then if you want to go a little bit deeper into that what we can do is we can uh try doing let's say href uh and then we can add on value and what we can see is that we're getting the final url uh for that listing now uh the actual url would be something like this and this would get tacked on like that but you can uh you know add a little code that would piece that together yourself so now let's take some of what we just did in our terminal and add that into our scraper function so i'm going to go back into sublime here so what we want to do first if you remember from our terminal right there is we want to create a variable that represents let's say all 50 job listings on a page and we want to pass in the data for all of those job listings into basically an array so let's say parsepage.css and again we're just going to target that div dot listing card class and what that'll do is basically give us the data for 50 jobs and from there what we can do then is iterate over our job listings using dot each so we'll do job listings dot each do job listing singular and we'll set up an iterator like that so now inside of this iterator all that we have to do for each job listing is basically what we just did in our terminal so i'm going to create a new variable called job because it'll be each individual job and what we can do is just jump back over to our terminal and we can actually just grab some of these that we just did so we want to grab the job title so i'm going to grab that and we'll do title and we'll set that to job listening dot job listing dot css uh hit the job title and get the text from that uh similarly we wanna do the same thing for company i'm going to do the same thing for location and we want to do the same thing for the url of the job listing uh so let's just go back over here and we'll do the same thing for company we'll do the same thing for location and then finally we will also grab what we just did to get the final url um and in terms of getting the full url we can do you know a couple different things so for example we could just do this and sort of you know just prepend on something like this if we wanted um and add those two together to basically create the full url um do whatever you want for that but that's basically all there is to that so if everything is working properly so far what we should have now is a way to iterate over uh 50 jobs in a page and we should be able to extract the data that we're trying to target out of each of those jobs so let's actually let's move our buy bug up into our iterator so we can actually see what's happening for each individual job and let's rerun this again so i'm gonna i'm gonna continue through the buy bug there so now we're back in our directory and what i'm gonna do is just run our uh scraper method again so i'm gonna do ruby scraper.rb and let that run and what should happen is we should hit our buybug we hit our buybug and now we should be inside of our job listings iterator which is iterating over our 50 listings and we should be in our first job listing so we put this after jobs so if everything works as we are hoping that what we should have now is a job object with all the data that we're trying to extract so looks like we have our title we see if we have our company we have our company we have our location and we have our url so now if we continue we'll go to the second job listing on the page and again if we look we should have a job title a job company a job location and a url okay so it looks like everything is working perfectly so far i'm going to exit this just so we don't go through all 50. so you can see we're back in our scraper directory so over here in the code let's um let's create an array uh that we can use to store all 50 listings from what we're doing with that iterator so i'm going to create a jobs variable and just create array.new to just make a new empty array and basically inside of our job listings iterator let's just scoop every new job into that jobs array and then down here we can put our buy bug outside of the iterator so basically what we just did is we created an array and now as we iterate through those 50 listings from the page we're going to pass in each one of these jobs into our array so we should end up with by the time that we hit our buy bug is 50 items in our array and each one will be a job with data on the job title company location and the url to apply so let's jump back over here and let's just run our scraper.rb so it's going to iterate through all 50 and then it hits our buybug and now what we should have is a jobs variable with all 50 listings in it so we do count we can see we have 50 items we can look at jobs. first we have this nice clean sorted set of data for each job listing now right so now just like that we've built a super simple web scraping tool with ruby and we have scraped this webpage and we've pulled all the data from the job listings on this page so that's pretty cool but what if we want to scrape all 2287 jobs from this entire website and not just this page our web scraper has to be a little bit more intelligent so now let's make a few tweaks to our web scraper we'll take pagination into account and we'll scrape all 2000 or so listings on the site instead of just the 50 per page so there's a couple of things we'll want to know in order to make this work the first is basically just how many listings are getting served on each page here so we already know that it's 50 listings per page the other thing that we want to figure out is the total number of listings on the site we already know that we have 2 287 listings the last thing we want to figure out is just what the page structuring looks like for the pagination so i'm going to click this and we can see that the url changes a little bit and we get a query parameter of page that gets set to the page number so now using that information let's make a few tweaks to our web scraper function so after we assign the job listings i'm just going to jump in right here and we're going to add a few new variables to our scraper so the first thing that we're going to want to have is we want to have a per page and basically that's going to be how many job listings are there per page um and we know that it's 50 but we want to sort of future proof this so theoretically if the number of listings on the website changed we'd want to make sure that our scraper didn't have this value hard coded into it of 50. so we can actually do since we've already assigned job listings up here and plucked all of the listing cards is we can just do job listings dot count and that should basically still give us 50 listings but what we're doing is making sure that if that number changes uh to you know 10 or 20 or 100 in the future when our function runs it'll be set to the appropriate number and then i'm also going to create a variable of total and i'm going to set that equal to the total number of job listings on the site and if we look back here we can see that it's 2287 listings but again we want to future proof this right so if another listing got added to the site then our total would be inaccurate so what we actually want to do is since we've already at this point uh created our parts page variable we can try to grab the number uh from the page based off the initial request and we can try to keep it conditional so that if that number changes in the future when this method runs it'll always have the correct total similarly to always having the correct per page so this is probably going to be a little bit hacky but let's see if we can use our parse page variable and still get to that final number of the total job listings on the site so we know we're going to do parse page and then we're going to try to grab something probably using css so let's jump back over here so how do we grab this number so if we look we can see this is a div with a class of job count so we can start by doing css div dot job count and let's actually jump into our terminal and see what that gives us uh so we can see we have a nokogiri object there so let's try to do that again and go another level deeper by using text so now we've grabbed the string that we're trying to target but what we want to do is extract that number out of it um so maybe just as a quick thing and again this is going to be kind of hacky there's maybe better ways to do this but let's see maybe we'll call split on that split it at the spaces and then we can grab it from its index in the array and then we can just really quickly g sub off that comma and then finally we can convert it to an integer uh and that is probably not the greatest way but that's a way that we can grab that number out of it and as long as that string doesn't change that would keep grabbing the correct value for us as the number of jobs added to the site changes so let's go ahead and let's just grab that and let's just throw that in there for the total and that should pretty much give us that number of 2287 while keeping it a little bit more flexible to jobs getting added to the site and that number changing so now we have our per page value and we have our total i also want to create a page variable and just set where we're going to start so we can set that to 1 because if you look back in here you can see that we have a new url structure of listings and the page number so we can start this at 1 so now what we have is our starting point which is going to be page one we have our per page variable which is going to be 50 listings per page and we have the total number of listings on the site which is going to be 2287. so now the next thing we want to do is basically tell our scraper what the last page is going to be because what we're going to do is we're going to increment on our page number and we're going to iterate through all the pages of this job board and scrape each set of 50 listings on each page so we want to set an end point so that it's not running infinitely so we can do is create a variable of last page and we know that we have 46 pages so we could set this to 46 but again we want to keep this a little bit more dynamic so that as more jobs get added to the site and that number goes up to 47 48 and so on our scraper is going to be able to detect that based off the data that gets passed in when we run it so let's see if we can use the above values to get to that same answer of 46. so what we could do is for this variable we can make it our total and i'm just going to make sure that it is a float we're going to divide our total by our per page value and i'm also going to make that a float and then we're just going to round that number and actually let's just jump into our terminal and try some of this just to see that this is uh on the right path so we know that our total should equal 2287. that works we know that our per page should be 50. that works and now we know that we want our last page value to equal 46 based off that data it looks like that works too so all of our numbers are correct but on top of that they're also set up so that as those values change over time our numbers can adjust accordingly so now that we have those values set up we can adjust the way that we're looping through our job listings to account for pagination so what i'm going to actually do now is i'm going to wrap this in a while loop and what we're going to do is basically pull in our starting point which is this page variable and our endpoint which is last page and we're just going to say while page is less than or equal to last page do everything inside of this while loop so now inside of our while loop what we want to do is we want to add in what's going to be our pagination url so basically this is going to be this url um and what we're going to actually do is instead of hard coding in the page number we're just going to pass in our page variable so this will start at page one and then after each iteration in our while loop we're going to increase our page number by one so now what's going to happen is we are going to start at page one and we are going to again on each of these pages run uh httparty and nokogiri and i'm just going to adjust this a little bit to call this our pagination unparsed page and our pagination parts page and what we'll just do is we'll adjust all this a little bit so basically the same thing that's happening above is going to happen here but all this is going to be happening with the incrementing paginated page so basically we're going to be doing the exact same thing we did here where we passed a url uh we make a get request of the http party and then we use nokogiri to parse that html and then allow us to scrape the data out of it but we're going to be doing that to a number of different pages based off the pagination here so you can kind of think of this up here as the initial recon mission uh and then this is actually now where the meat and potatoes is going to happen so we basically want to grab our 50 listings per page again but we want to adjust this a little bit so we're just going to again you know call we're just going to adjust the names of the variables a little bit again so we're calling this on our paginated parse page uh here which is going to be changing with each loop and let's just make sure that we're using that variable here so hopefully that's not too confusing but basically we're just repeating what we did initially in our single page version but now we're accounting for uh pagination and then we're going to be doing you know the same thing where we make a request to hddparty for that url and then we use nokogiri to parse the page and then scrape the data out that we want so everything else can stay basically the same and i think we're actually probably ready to try re-running this this is going to take a little bit longer because it's gonna be iterating over uh you know about 50 pages instead of just that one request so i'm gonna actually put in a couple of uh puts into this just so we can kind of see what's happening from our terminal as it's running so let's spit out the pagination url and then let's uh just make a little prompt that says page and then we'll make it pass in the current value of our page variable and then let's add a little spacer here and then down here after we add the job let's do another puts and let's make that say added job we'll pass in the title and then we'll add another little spacer here cool so now if everything we changed works accordingly uh what will happen now is this will run over about 50 pages and we will collect data on about 2287 job listings so let's jump into our terminal and just see what happens when we try to run this so i'm going to exit we can see we're inside of our scraper directory so i'm going to do ruby scraper.rb so now what's going to happen when we run this ideally is this is going to iterate over you know all 46 pages uh it'll grab 2287 jobs and as it's running we'll be able to see what page number we're hitting and we'll be able to see which jobs are getting added into our jobs array and then when it's done iterating through everything we should hit our buy bug after the while loop and we should be able to take a look at our array of data on 2 000 plus jobs so let's run this and see what happens cool uh so it's kind of jumping quickly but you can see the page number is increasing and we're seeing all of the job titles as they're getting added into our array so we're on page 19 20 21 so i'm just going to let that run until it hits all 46 pages cool so now we have hit our buy bug over here on line 36 which is uh an indicator that we have finished scraping the entire site and what we should have now is a jobs variable which is an array of our job listing objects and it should be basically 2 000 plus uh objects with data on each job listing uh so we can see we have this big massive array if we count it it should be 2287 i think yep so we have all of our jobs and now basically we have data on every job listing from the site so we can see uh yeah we have the title company location and url for each job listing so basically just like that we've built a simple but powerful little scraping tool and we've seen how we can basically scrape this entire website and pull data on 2287 jobs so that's pretty much it and as you can imagine you can do a lot more once you start playing with this than just scraping job listings so if you want to learn more or if you want to grab any of the code from what we just did i'm going to include a link below so check that out otherwise yeah that's that's kind of all there is to it and have fun

Info

Channel: zayne

Views: 29,070

Rating: undefined out of 5

Keywords: ruby, web scraper, nokogiri, httparty, ruby on rails, web scraping, how to build a web scraper, how to build a web scraper with ruby, nokogiri web scraper

Id: b3CLEUBdWwQ

Channel Id: undefined

Length: 27min 34sec (1654 seconds)

Published: Mon Apr 09 2018