Developing a Web Crawler in C#

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi developers I'm hosting the lack Microsoft MVP today we will learn how to create a growler to collect data from the web so this case of using the growler is let's say you want to create an application that exposes the prices of the new cars so you first step you want to go to the car buyers and get this data from the API but unfortunately there is no API they don't have an API that is rich they might have an idea but it doesn't have all the information you want like the options price of all the models and so on so a solution might be creating a crowd because they expose this rich API in their website and their HTML web pages will go to these HTML web pages down to them and we read them and try to extract the data that we need from these websites so it's like not going to get data from the API endpoint but get data from the HTML page of the these vendors that means we need tools to read to em first of all download the HTML and parsing the HTML for that the tools we will be using today are the HTTP client libraries and also the HTML agility pack so let's take our application or our method that will start a Kroger to collect data to get the process of cars for example so let's get started so let's start by firing visual studio and here I go and start by creating a new project and let's choose the console application so that we can see that in action so let's call it growler demo so to create it a growler first of all we need to download the HTML content and for that I go and use the HTTP client libraries so I go to my project right click and manage nugget packages and look for the HTTP client libraries which is here microsoft.net HTTP lessons and test library this one will help me to download the HTML content of the website that's using the HTTP client object so that's done now you can go to my main method here and use that HTTP client HTTP client object equally new HTTP client of course you need to add the namespace system dotnet HTTP then we use that object to call the method get string a sign and give it a URL so let's say the URL I'm targeting is the one we have here so that's going to be my URL so now I'll try to get HTML content from this URL for that I go and put the HTML here and because get string casing is a nesting method I need to at the ovate keyword under the away key word I should use the async keyword but with the name here I cannot use the async keyword if I do then my program will not run it will not recognize that this method is the main method because we should not change this signature so to solve that problem I go and create a new method let's call it start crowler for example oops when years start crowler acing because it's anything method let's create that method should be an instinct method and stretcher task now let's paste that code here and here it accepts the weight because he I have the s tank keyboard so now I have got the HTML code so I try to execute this application and here I should add console lead line so that I make sure that this code executes before my program exists so let's try this one so as it start and it make a breakpoint here so that I can see the value of HTML so last thing to do successfully and here you see I'm getting value from the HTML which is the HTML of the page so FA see it as a text you see here it's an HTML content see here we have all the scripts and links and Deep Ellum so this is the HTML content of this web page if I go and take a look at this web page by right click and here choosing view page source you will see almost the same code as displayed in Visual Studio okay so now that we have downloaded the HTML content now we need to parse this HTML content we can use the XML document or the XML object so that we can parse it as an XML document buttons sometimes it doesn't work good for you because HTML might content contains some other features not accepted in extent because the way XML works and etching are not the same the same way for that we'll use another solution to parse that HTML which is the HTML acidity pack so let's understand that library by going to our project right click my national get packages again and look for HTML HTTP Dyk let's install the HTML agenda pipe click ok good so now coming back to our program here I go to use the HTML HTTP bag in order to parse this HTML content for that I put the HTML document equal new HTML document here it added the namespace HTML HTTP back which contains this here I seen a document we use the HTML document to load the HTML which is of type string might in my case here so now the object is email document understands the HTML content and now we can go and parse the HTML content so it's the HTML document dot document node then go here and load for the descendants for example and before doing that let's understand a little bit destructor of the HTML page we have here so here the affirmation want to extract is these cars with these affirmation the name of the model the image and the price with the link to each car so we first creating the crowder we need to understand the HTML of this page so let's right click and click inspect clicking aspect we'll get this window for you this way ahead we help us to understand the structure of this stage so if I go know one by one and try to find the one the right dip or the right HTML element that hosts each car so I see here this div here is for this first car and the first deal is for the second car and so on so we start by the first div here if I expand that one here you can see I have some information almost all the information I need for for this car so I have the link to the car and they have an image of that car then they have the name of the model and also I have the price so now let's try to extract this information from this HTML page so I go back here and here because I need to get a list of all these cars I go and try to get all these deals so so here we need to create or to set a criteria because our HTML page contains some other tips for other things so if I see here for example you see it have other gifts for other content but you want the tips for exactly our cars here so we can set a criteria which in most cases is the name of the class used for these cars so here you see all the my gifts here you have the unique class which is article new car article last model and the other addicts do not use that class so now I'm sure that if I look for the dibs who have this class name then I get exactly those dibs for these cars only so let's go here and try to find whose dibs so the HTML the document node have some metals here that can help me to find those divs and here I have something called descendants with dissonance I can specify the element I'm looking for which is a deal so here I'm telling the HTML know that I'm looking for animals are that are a div in my HTML code but because they have many tips then I want to specify that I'm looking for the thieves who have class this class name so let's copy this one and let's specify that we want these exact thieves by creating we're here so let's say that we're the HTML node have that specific class name and because here the class is an attribute inside inside the div so you specify that get attribute value which have the name class so we are looking for the attribute class and telling that the attribute class should be equal to should be equal to this value here so let's copy this one here so now we have all these dips let's convert them to a list and now I have all the divs that all these divs for these cars and let's get the response and a variable called G for example good so now that we have all these dips just extract information from H deep because here each give is a is meant to be for a certain car so for the first one for example you see here we have these information they link the image the name of the model and the price so let's try to parse that information for that I go to to my dibs one by one and for each deal I try to extract information so for the first div I try to get for example the name of the model which is here in this case the nwc you want e Phi P and because here on site the div so let's let's just run this up here and let's see the value of each deal so let's hit start so that you can understand it better so here I have the value of the dibs is it is great because here in my website I have 12 cars let's see it again so it's 12 and for each T if it have the HTML inside that did you see here it embeds information about the price and also other information about in model name and so on so it's like embedding all these div element here inside that div because it has the name of the module inside this h2 element then we'll go to that h2 element and get in the text content of that element the way to do that is by going to give dot just intense and descendants is responsible for getting all the descendant elements from inside Vince exact deal the descendant we are looking for is the H cube because here the name of my mother is inside the h2 element so let's call the h2 and if I just go to H which will get me all these element the edge with the content but I want only the content or the text of inside this element and the way to get the text is by Khan in here the inner the oh sorry across here because before going to the text here this sentence is and type in a robot and inside my teeth I only have one h2 so I'm sure here that if I call the first or default then I'll get this only h2 element and when I get inside that element I go and call the inner text property which will give me the inside that h2 element which is my KC a BMW 0 1 5 p so let's put it inside a variable call it name for example let's run our application and put another breakpoint here so that you can see the value of name and each time so you see the first value it's the MWC b15 P and let's click f5 again and now I'm getting the second the name of the second car which is BMW série 3 so that's good now we have got the name of the model now we want some other information like let's say now we want the the price of this car so here it's indicating the price and DT so let's try to get that value and this create price value and here again call DT D Sundance and let's see the price here it's inside another D which is inside my big deal and inside my div here because I have only one deep as I have only one H 2 then I go and call my first gift so again called my first or default give descendant and here inside the div if I I can call the inner text and in that case I will get all the text embedded inside this div and this case is a partir do a turn on set knocks on DT ok that's was in flash so let's call the inner text to get that value let's run this application again and let's see if we get that prey priced right so we see here for the first prize is a party to this value which is that same exact value we are looking at it's a seven nine hundred so this I eight seven nine hundred click a five to see the second prize which is 109 and reduce right good so now we have the name and we have the price of that motor now we want to get also the image URL because here each model have unique image so let's create the image URL variable inside of which we would go to the deed to the descendants and look for where we have the URL image is here it's inside the image IMG element let's go to the first or default EMG IMG element then you see the link here is inside the SRC attribute so let's call the attribute child attributes child attributes enable us to access the attributes and slide an element so the attribute you want to access here is our SRC attribute so let's go letter C and B core he returns a list of attributes let's again call the first or default attribute and then try to get the value of attribute which will be here our injury with the same way we can get let's say the link to this car because it will see and this div embed a link to this car which is inside the a element here and inside the href attribute so let's create a variable in equal D to D sentence and look for the a dissonance because the essence like a element then go to the first or default one then look for child attributes and specifically the HDF one and again called the value put first or default then we call value let's execute this code now and they see it in action so let's dig for the first car which is here BMW 1-series this price and this image let's both it have that name that price and that right image URL and also you get the link to that image button gives us more information about that cache zone that's extremely good now we have extracted HTML content from this web page good here I'm having some recommendations from resharper because here in my D it means in some cases it might be null for that I need to check for each value if it's not null so this is a new new syntax and C sharp I think the version 6 with this mark here to tell that if the div is not not done I can go and check the descendants so it's like saying if give not null then inside that if I can go to the descendants and again check if the div dot d sentence is not null then I can go to the first or default so instead of doing the if statement we can just use this mark here before the point and in each time the compiler will resolve that for us so this is really a unique and useful future and still sharp sticks this add-on here but another thing we want to do here is to try to collect this data because here and this code in each time I collect a data of a car I know I lost the information about the previous car so let's create it is in which we have information about all these cars not only the one track so let's make this application even better by creating a model so let's say I have a public a class forget a car for example this car class will have the information related to the model the care model and also information about the price the price here is still missing but you of course you can go and convert it to an integer let's call this one the price and let's add some other two properties of type training also to put them link link and another one for the English URL good so now we can store that data inside a list here is now we can create a list of cars and in each time I collect this information I go to create a car so let's say our car equal new car and then I can reuse this code here and instead of declaring each time I go and go the motor of the car to get that value in here to be common then the value of the price then the image URL and the last value will be the value of the link good so now you have collected the list of cars now we need to add that car to our list by owning cars dot add and pass that carrier that car here let's execute this again let's see the values here let's now try to look at the value of cars so we have 12 cars each car had this information related to the image you are ending model and price so that's good actually I want to add something here because here you see I don't get useful information from this window here I only get the index of each car but I want to see the name of the car right here so that I don't go to each car and expand its properties so the way to do that is by going here to the class the guard glass and add divider display property and inside this debugger display property we can specify that inside this debugger display I want to display the model and the price here we should specify the attributes inside this Kaunas collimated now specify the price good so this model is right this model course and this price should be the same name here so now if I go to start then you will see a better display of the cars list so see here we get the name the model and the price for each car so this is real of course a better display for the cars object good so now at the now we have gone to this web page and downloaded the HTML content by using the HTTP client libraries then using the HTML agility agility pack to extract information from this web page so that now we have created we have extracted the data that we need from this web page good
Info
Channel: Houssem Dellai
Views: 22,392
Rating: 4.7709923 out of 5
Keywords: c#, collect, html, collect data, visual studio, crawler, web crawler
Id: oeuvL1_5UIQ
Channel Id: undefined
Length: 32min 43sec (1963 seconds)
Published: Sat Jul 23 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.