How to make a FAST WebScraper C#

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

and as you can see it's about five times as fast as selenium on the static page and about 15 times as fast on the dynamic page the website that i will be scraping can be found at my github where you can download it click code download zip wait for it to download then open it up unsave it using some unsaving tool i'm using winrar and then open up the solution and once it is open press f5 first create a new project and then select console app and i'll just call it test project next i'll add a nougat package to the project so open it up and then search for html ideology pack and then just install it then create a new class called user and this user class is going to represent a row on the table on the website and it has three properties first name last name and age create a class for the web scripting logic and i'll just call it fast webscriber add a method which returns a list of users and receives a url as a parameter and since this method returns a list of users start off by creating a new list and this list is basically going to contain all the users from the website and then return the users create a new html web instance and this is the class that contains all the loading and passing of html and to load a html document from a url i'll call web.load and then pass the url and now i can actually scrape the website so let's go back to the website and i just want to get the x path of arrow and i'll right click it and then copy and copy xpath then call doc dot document node dot select nodes and then pass in the xpath now the thing is i don't actually want the first row since it only contains column names so i'll do position is higher than 1 then for each row in the table except the first one i'll create a new user and i'll then add it to the user list [Music] first name is going to be equal to node dot select single node and then i'll select the first table data in the row and dot inner text and then i'll do the same for the last name and age except they're going to be the second and the third and since my age is actually a number i'm going to pass it as an int so this website actually has pagination so you can find more users on the next page to handle this i'll wrap my code in a do while loop i'll add a new variable outside of the do while loop for the next button so i can use it in my while statement and i'll initialize it to null and after i've scraped all the users i will set the next button i'll open up the website again so i can get the xpath for the next button and as you can see right here it actually has an attribute called href which is a link to the next page and this is the link that i'm going to retrieve from the next button anyways right click again and copy and then copy xpath and pass it in so it should actually only go to the next page if it can go to the next page and to do this i'm going to retrieve the class attribute and it is a string and the default value is just null and as you can see right here the previous button is actually disabled right now and it also has a disabled class so that's what i'm going to be looking for if the class contains disabled then don't run the while loop i'll also move the load statement inside of the do while loop so it will load the page every time it goes to the next page basically but first i will have to update the url that should be loaded and to do this i'll use the ui class to help extract some of the url components so i can do u r i dot scheme plus a colon backslash backlash plus ui.authority and then i'll add the link at the end from the next button now you might be wondering what are these properties from the ui and scheme is basically just https or http and authority is there the domain name and in my case it's localhost colon some numbers and that's about it now it should be able to scrape all the users from the website let's see if it actually did so so yeah it contains 11 users which are all the users on the website there is just one problem with this solution and that is when the websites are not static but instead are dynamic so if i go to the dynamic page pagination still works you can still see all the users but if i put it in my web script it no longer works and the reason for this i'll show you if you look at the actual html that is retrieved in the network tab in the doc and then response you can see the table is actually empty at first but if you look at the static page and look at the html from that one it has all their users and the dynamic page actually has another request which can be found right here and as you can see it contains a very nice json formatted document and it's also the same users which can be found on the static page they are just loaded differently which can be found at the bottom of the page right here and this is the code that loads the users and then populates the table and if i open it up in a new tab you can see it's a very nice json and it still has pagination now to scrape data from a very dynamic page like this i'm going to create a new class and i'm just going to call it dynamic web square and i'm going to have the same method structure once again i'll create a list for all the users and then return the list and this web scripter is just going to use the built-in http client and straight up make a get request to the json endpoint that i showed before so client dot get async and then the url plus and then the stuff after dynamic and then replace the index with a variable which i can change dot results dot content dot reader string dot result and that's basically it uh this will load all the json as a string i'm also going to create a new class which will represent the json object which you can see right here i'll then pass the json string using neutron soft dot json i'm just going to install it quickly and then dot deserialize object and then user response and i'll pass in the res [Music] now i can actually add the users from the user response to my user list and then i will put it in a do while loop like the other web scraper and once again i will add a variable for the has next so i can use it in the while statement and i will set it equal to dot has next and then obviously index plus plus so it will go to the next page and that's actually about it now you can switch to the new web script and it should actually work and it looks like it works it didn't crash and it contains all the uses again so i have implemented the first solution in selenium and yeah let's see if html agility pack and http client are faster and as you can see it's about five times as fast as selenium on the static page and about 15 times as fast on the dynamic page and the code isn't even that much more complicated so overall i would say it's a very strong tool to have and you can use it on fairly simple sites or if you want to make very fast web scrapers please leave this video a like if you enjoyed it and subscribe for more videos you

Info

Channel: Scrapax

Views: 2,937

Rating: 4.8571429 out of 5

Keywords: Webscraper, Html agility pack, c#, web scraper, fast web scraper, tutorial, selenium, httpclient

Id: wbBuB7-BaXw

Channel Id: undefined

Length: 11min 46sec (706 seconds)

Published: Sat Mar 06 2021