Web Scraping With C# - Craigslist Scraper Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey what is going on everybody it is Rob with diligent Deb and today we're gonna be doing a little web scraping using c-sharp so what we're gonna do is we're gonna go to Craigslist and look for web development gigs that we might be able to apply to and supplement our income with some freelance jobs to get started we're gonna be using scrape be sharp I will leave a link to the github repo in the description below and you can go there check it out and look at their readme the next thing we're gonna be installing if you don't have it installed already is the dotnet core SDK I am on a Mac you might be on a Windows go ahead and download this and get it installed if you don't already have it installed also install Visual Studio code and that's what we will be using for our code editor so what I've done is I've created a blank folder in My Documents folder and opened it using Visual Studio code so the first thing we're gonna do is we're going to go here and open a new terminal and we're gonna type in dotnet new console and what this is gonna do is it's going to go and create a new console app for us automatically we go ahead and look at the files we have here we've got our project file and we have a program CS and this will be the main file that runs our scraper now what we need to do if you don't already have it installed is go to extensions type in c-sharp and go ahead and install this extension once that's installed we'll go to our debugger and you'll see that it's already set up for dotnet core to launch the console but if you don't have this go ahead and click this button you should get a drop-down select dotnet core and you'll get this file the only thing we're gonna change inside of this file is instead of using the internal console we're gonna use the external terminal and I'm on a Mac so it might be a little bit different you just want to launch your command prompt or terminal on your computer so now that we have that set up let's go ahead and run it and see what it does so mines running and you can see that it printed hello world as that is the only code that is in program dot CS at the moment now that we have the project setup and our debugger let's go ahead and add scrappy sharp so what we're gonna do is go back to the terminal and we're gonna type dotnet add package scrapey sharp and what that did is it pulled in the nougat package so that we can use it inside of our projects okay now that we've added the scraping sharp NuGet package let's go ahead and import some dependencies that we'll be using in our project so at the top under using system we'll type using system dot collections dots generic will also use HTML agility pack scrappy sharp dot extensions and scrappy sharp dot Network now that we have our dependencies let's go ahead and create a global variable that we're gonna be using for scraping so run underneath your class and above your main method let's type static scraping browser underscore scraping browser and set that to a new scraping browser now that we have that let's go ahead and create a function that's going to return as the HTML from a given page so static HTML node get HTML and we'll go ahead and pass it a string of URL inside of this method we will create a web page and set that equal to the scraping browser dot navigates to page this takes in a URI and that URI will pass in the URL and then what we'll do is we will return the web page dot HTML so now that we have that we're going to get rid of this console dot write line hello world and we will pass it get HTML and right now we'll just set it to an empty string so what we're going to do is we're going to go to our browser I have gone to the New York Craig Craigslist computer gig so I'll kind of show you how I got here this is the main New York Craigslist and underneath gigs I just went to computer and we'll go ahead and copy that out and we're gonna go back to visual studio code and put that in there all right so at this point we may has well run the program to see what this what kind of HTML is being returned and make sure everything's working so I'm going to go ahead and set a breakpoint on our return statement and to get HTML and go ahead and run it so it's hit our break points we'll see we have the web page and the HTML and it looks like we're getting some good results from that so I'm going to go ahead and stop it there so the next thing we're gonna do is we're gonna write another method and this method is going to return a list of string and we're gonna go ahead and call it get main page links and we're also going to pass this URL so we're gonna say far home page links equals a new list of string and this is where we're going to store all of those URLs to all the different Craigslist listings so you'll see them all right here the next thing we're going to do is we're going to get the HTML from each link so HTML equals we use our get HTML function and we'll pass it a URL the next thing we're going to do is we're going to set a variable we'll call it links and we'll set it equal to the HTML that we returned and we're gonna look for a specific CSS selector on that and we're gonna look for all the anchor tags on the page so that should go ahead and grab us all of the links that are on the page and the next thing we're gonna do is we're gonna write a for each and we're gonna say for VAR link in links and inside of this we're gonna write an if statement where I say if link dot attributes and say we want the attribute href dot value dot contains we're going to say dot HTML so basically what we're doing here is we're grabbing all the links or all the anchor tags and we're taking the attribute of the href off of those anchor tags and seeing if it contains a dot HTML extension now the reason we're doing this is because there's going to be a lot of links returned off the page that are not links to gigs this ensures that we're going to a link or a gig page sorry so after we have that we will say home page links dot add link dot attributes href and that's got to be in quotations and we'll just grab the value and put it in there so to reiterate all this is doing is going and grabbing all the links and then it's getting the href off of each link which is the URL that's on an anchor tag and the last thing we need to do is return home page links so let's go ahead and replace our get HTML up here with the get with the get made page links method oops it's copy that's the and let's go ahead and put a breakpoints on the home page links and run the console and see what it's returning so we have home page link here and our variables and you'll see there's 240 and they're all going to gigs so we're gonna go ahead and stop the console and now we're gonna write a method to get the page details off of each individual gig page so if we go back to our browser let's say we click on one of these what we want to do is grab the title and the description from each gig so in order to do that at the bottom of this file underneath your second-to-last curly brace we're gonna write a new class and we're gonna call this public class page details and inside of this class we're gonna have three properties so if you type PR o P and then hit tab it's going to go ahead and create a property for you and we'll just call this one string title will create another property below this one we'll call it another string and we'll call this one description and we'll create one more property and we'll set that to a string and this one will be URL if I can type there we go all right so now that we have that right above our get HTML method I'm gonna write another method and this is going to be a static list of page details and it'll be called gits page details and we'll pass it a list of string and that will be URLs all right so first let's create a list of page details and now that we have that we're gonna loop through each URL and grab the details from each page so we'll say for each bar URL in URLs we'll say a bar HTML node equal to get HTML and we will pass it the URL so if you remember all that's doing is passing us back the HTML in node elements then we'll say far page details didn't want to capitalize that page details equal to new page page details so we're setting up a new object so now we want to grab is the page title so we'll say page details dot title is equal to the HTML node that we grabbed earlier the owner document of that and the document now so basically we're at the root element of the page and then we'll say select single node and what we're gonna do is we're going to pass it an XPath so let's say for slash forward slash HTML / head slash title then we'll go to the end and we'll say dot inner text and the way that I got that was let me go back to the browser we will right click and hit inspect you'll see that the root of our XPath is the HTML and then I drilled down into the head and then I went to the title that is right here now we need to grab the description so I'm gonna set a variable of description I'm gonna set that equal to HTML node owner document dot document node dot select single node and we're gonna pass this another XPath so we'll say four slash four slash will go to the root element of the document then we will go into the body we will drill down into a section drill down into another section into another section and finally into another section and then we will take the inner text out of that section so just to make sure that everything is working I'm gonna set a breakpoint here we're gonna take this get page details we are gonna set a variable and set of two main links and then we're going to call gets page details and we are going to pass it main links so we should be good to run this I'm going to take the debugger off of this some reason this is not liking this page details it wants me to return page details all right so we'll see we have a title here we're gonna go ahead skip this line and make sure that we get a description here so now you see that we have the description and this XPath that we got here just to reiterate what we're doing I will go back to the dev tools we will close the head so we went from HTML to body then we went into another section and then we went into another section and another section and in this last section you'll see that now we have the description that we're looking for so essentially what you have to do is just keep drilling down into the different elements that are inside of the HTML I went ahead and reran the debugger and put a breakpoint here because the reason we set this description equal to a variable instead of directly setting the page details is because we want to get rid of this stuff up front here so we're gonna grab this ugly text and we're gonna copy it we're gonna come down here we're gonna stop the debugger and I'm gonna say page detail description equals description dot replace and we're going to paste in this text that we have here and then we're just gonna replace it with a blank string and then the last thing we need to do is set the page URL equal to the URL and add our page details to our list of page details all right so what we're gonna do is go ahead and run this this is gonna take a second but I'm gonna go ahead and run it and show you the results that we get so I'll be back once this is done running Graper is finished running let's go ahead and look at our list of page details so we open it up and we open up the first element in let's expand this you're gonna see we have a good description we have a good title and a good URL but you'll also notice that we have two hundred and forty of them now if you've ever been on Craigslist you know there's a lot of spammy stuff and you got to filter through all of it so this would really do us no good it would save us no time if we were looking for gigs so what we're gonna do is we're gonna add a filter to it up in our main method we are going to write the following code will say console.writeline [Music] please enter a search term and right after that we'll take a console.readline which will allow us to enter some text in and we will set that equal to our search term so now that we have that let's go ahead and pass this search term to our get page details so we'll come down here we'll say string search term and now that we have that set up before we add anything it creates a variable that's something like search term and title and set that equal to page details dot title dot contains search term and to make this a little bit better let's take the title and make it two lower and also take the search term and take it to lower that way we'll ignore casing on it and we'll create another one and we'll call it search term in description and we'll say page details dot description uh to lower to ignore casing dot contains search term that to lower so basically what this is doing is it's checking to see if our search term is in the title and this one is checking to see if the search term is in the description and you'll see that this is a boolean so what we can do is we can say if search term in title or search term is in the description then we'll go ahead and add this to our details page so I'm going to launch it again I'm gonna open up my terminal and you'll see now it says please enter a search term and we'll type something in like WordPress that's pretty popular I'm gonna hit enter and this is gonna take a while so I will be back after this is done running now our new and improved scraper has finished running and you can see that we whittled that down from 240 to twelve and if we open this up and look at this first one you'll see this fall's very well in line with what we were looking for they want someone that's a web developer and designer and we typed in WordPress and I'm sure if we hover over here we can see that WordPress does appear in there so now that we have that it really doesn't do us any good because we'd have to open up our debug tools and look through all this so next let's write all of these to a CSV file order to write these to the CSV file we're gonna import another NuGet package called CSV helper so go back here to our terminal and I've got a lot in here so I'm just gonna clear this and will write dotnet and package CSV CSV helper now that we've added the nougat package let's go to the top of the file and we'll import some dependencies we'll say using system dot IO another using system dots globalization and we will import the CSV helper package now back down at the bottom of our file here right above get HTML I'm gonna write another method and we will say static void exports gigs to CSV and what we'll do is we will pass in our list of page details and we're also gonna pass in the search term that will help name our files inside this method we'll say using bar writer equal to new stream writer and here you're gonna set your file path now I'm gonna use some string interpolation and I'm setting it to my file path but you should set it to your personal file path so say dollar sign at so it's the literal slash users slash Roberts Lewis class desktop and then we're gonna say search term underscore and we'll do another interpolation and we'll say dates time dot now dot to file time and then at the end we're gonna write dot CSV and right underneath that will do another using statement save our CSV equal to new CSV writer will pass it our writer that we initialized above and we'll say culture info dot invariant culture and then we'll just say CSV dot write records and we will pass it our list of page details now that we've written our function all we need to do is copy it let's take it up to main let's set this equal to bar LST page details and then let's call our export gig to CSV pass it our LST page details and our search term and then I'm just gonna run it and once it's done I will be right back ok as you can see I ran the scraper I entered in the search term of WordPress and it created the CSV on my desktop so we'll go ahead and open it up expand this a little bit and you'll see that we have a good title we have a good description and if you scroll to the right you will see that we have the URL in case you wanted to go check out this job on Craigslist as you can see it is duplicating the gigs that's something we could have coded out but in terms of this video I'm not going to do that just to try to keep this on the shorter side so if you guys enjoyed this project let me know if you have any questions or concerns you can put them in the comments below the video and I will also link a github repo to the code that we have coded out here today
Info
Channel: Diligent Dev
Views: 9,565
Rating: 4.9076924 out of 5
Keywords: C#, .Net, Web Scraping, Web Scraper, Craiglist, Developer, Software Development, Programming
Id: gWfVr66GQq4
Channel Id: undefined
Length: 27min 18sec (1638 seconds)
Published: Sun Jan 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.