How to build a web scraper using go and colly - golang + gocolly tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

oh my god this is so much work why do i even need this information i am never going to be done before bedtime literally a monkey could do this work but wait i have an idea good morning good day good evening good night hello fellow coders welcome to my channel my name is thomas and i love to automate the boring stuff today we are going to extract some juicy data from a website using a web scraper but what actually is a web scraper let's say you visit the youtube website what your browser shows is a bunch of videos but technically your browser is representing html content which typically looks something like this and now it's time to introduce a scraper and from the comic sounds font i used you can already guess that he's a star of the show a web scraper takes the html content and walks through it line by line when you create one you tell it what to look out for for instance you can tell it to look out for every href attribute within an anchor tag and if it finds one it should extract the link so every scraper has at least one combination of criteria to look out for and a command to execute when the criteria is met now when running your scraper it walks through the whole html content looks for the href and extracts the link we are going to build a web scraper using gokauli gokauli is a very simple to use and fast library for goaling so let's jump into the rde let's start with a completely empty project the first thing we need to do is to download the collie dependency simply type in go get github.com gocally collie you can already see that it downloads a bunch of dependencies collie needs let's start with importing calling when we want to build our scraper the first thing we need to do is to create a new collector the new collector function accepts a list of options that we can pass to it we will be passing the allow domains option this tells the collector which domains he is allowed to scrape using the collector's visit function we can tell it to visit a specific url the final thing we need to tell our collector is on what criteria to watch out for and how to behave if the criteria is met this is what the on html function is for the first argument of the function is the selector which corresponds to the criteria so for instance if you want to execute a command whenever we find a div element with the myclass selector we can simply type it in like this the second parameter is the callback function which gets executed every time the criteria is met instead of using the class selector you can also search for ides or all anchor tags with an href attribute within the callback function kali now provides us with an html element which we can act upon so let's simply get the text of the element and print it to the console and that's basically it this is how you would write a golang web scraper using the gokauli library you could give it a website to visit some criteria to watch out for and you tell it what to execute if the criteria is met but let us now tackle the real world example so you get a better understanding on how things work and i'm going to show you some tricks that you can use when you write your own scraper let's have a look at the website we are going to scrape today you can already see that the url is basically just the base url with an identifier at the end the identifier is called ezil the website provides us with a bunch of information about etfs most importantly the tracking difference right here hence the name of the webpage in addition to that we also have some other important information like for instance that the earnings are getting distributed or that the total expense ratio is 0.65 percent if we open the developer console we get access to the html of this page while writing a scraper this is usually the first thing you do you go through the html manually and see how it is structured around the information we are looking for so let's quickly inspect the first of these information boxes and see how it looks like in the html we can already see that the information we want to extract seems to be within the paragraph with a desk class if we have a look at the parent element we can see that it's a div with a desk float class this seems to be the case for all these information boxes we want to scrape so this is a good indicator on where to start our scraping process another thing i'm interested in is the title of the etf and we can see right here that it is an h1 element with the page title class let's start with that and extract this information first back in the ide let's first put in the allowed domains next let's define a scrap url variable where we put in the url of the page we are going to scrape let me quickly copy and paste it from my notes next let's change the criteria when an on html function should be triggered the title is within an h1 element with a page title class finally let's put in scrap url to the visit function and give it a shot okay we just extracted our first information from the page so the scraper seems to work but there are a few things we can already improve let's have a look at all the functions the collector provides us the on request function gets fired every time a request to a given url is made let's use that and simply print out the url we are visiting another very helpful function is the on error function whenever we encounter an error we can simply print it to the console oh this should be a printf2 now if you run the main go file we can see which page the collector is scraping now that we have the title let's see if we can gather more information on how to extract all the other infos if we look at the first few divs with the deskfloat class we already saw that they contain a paragraph with the desk class every single one of these paragraphs has three children first a span second a line break and third another spell every first span has a desk title class so i assume we can use this to extract the description like earnings and asset class for the value i think we need to differentiate between the ones with the icon like physical and distributing and the ones with only text like font size let's have a closer look at the icons first the value can be extracted as the text of the span with a cat value class i assume this stands for category value now let's look at the divs that don't have these icons again we can see that they have three children a span a break and a second span this is a good sign because the html structure is the same as for the icon ones the first span containing a description has also the desk title class so we can definitely use this to extract the description but the value is missing a class so i guess our best bet is to just get the third child of the paragraph and extract the html text so let's head back to our ide first we need a second on html function to trigger our collector every time it hits the deskfloat class inside that div we want to have a paragraph having the desk class this way we can get the paragraph as our colli html element and simply can check for all the children one cool thing about gokali is that it's providing us with a pretty awesome go carry library this is basically like jquery so we can use this to do fancy stuff like finding or filtering our elements if we have a look at the html element we can use the dom field which actually is a go curry selection this is exactly what we want so for now let's store it into a new variable called selection as for the description we saw that it's an html text of the span with a desk title class so let's simply use our selection and call the find function to search for them then simply print out the text to the console for now running the application now prints out all the descriptions but we have one problem though as you can see the values are in german this is because the website we are scraping is a swiss website and presumably the default language is german but of course collie provides us with a way to change this to english we can use the request in the on request function to set headers like this in our case we need the accept language header with the en us value if you run the code again we can see that nothing has changed though this is because we call the on request function after the on html functions to fix our problem we simply need to put the on request function above the on html functions this way the header gets set before the on html callback functions are called if you run the code again you can see that the extracted information now is in english it's probably a good idea to put the on-arrow functions up here as well now that we have the description let's store it in a variable for later and focus on extracting the value as we saw while examining the html of the page every single paragraph with the desk class had three children and we wanted to extract the value from the third one to do so let us get the child notes which we can extract from our selection the notes property will give us the actual html notes so the two spans as well as the line break we can now make the check if the number of the children is 3 and only extract information if that's the case this is simply a safety measurement in case the html contains some more paragraphs with the desk classes we can use the find notes function to extract the node that we are searching for find notes takes an ellipsis of nodes as a parameter and returns a selection which we can use to extract the text as for the parameter we know that the value is sitting in the second span which is the third child let's store the value into a variable and simply print the description as well as the value to the console congratulations we just scraped the webpage and extracted information but we do not need all of them and as of now we simply wrote everything in plain text to the console let's get some more structure in this and define a struct so that we can put the information into one place let's first clean the terminal and head back to the top of the file i'm going to call the struct etf info and give it every single field i'm interested in these are the title the replication the earnings the total expense ratio the tracking difference and lastly the font size now let's create an instance at the very beginning of the main function and fill in everything during the scraping process first let's add in the title right here and in the second on html function we now need to fill in the rest so i guess the best bet is to switch over the descriptions and simply put in the value as soon as the description is the same as we saw on the page so for instance if the description is replication we set the replication value in our atf info object and now we do the same for the total expense ratio the tracking difference the earnings and the font size lastly i want to print out the whole object as soon as the collector has finished scraping for this i can use the unscraped function of the collector let's simply create a new encoder and use the standard output as the i o writer of course we want to encode our etf object instance and let's give it some intent so it looks a little nicer if we run the code we can see that the information lies within our beautifully intended json object but we are left with one problem though right here in the tracking difference the value did not get set in our etf info instance this is because the tracking difference description has a leading and a trading space this is something that happens all the time while scraping a web page you have to make sure that the information you want to retrieve gets cleaned in some way or another the most easy way of doing this is to add spaces right here this will actually fix the problem but it's not a very elegant way so let's simply write a clean description function that gets rid of the white spaces for that we can use the trim space function of the strings package now we can simply use the clean desk function and put in the output of the html text to remove the spaces right here remove the spaces right here and run the code again you can see that the tracking difference now got extracted from the page while running our scraper cleaning the data is usually a huge part of the scraping process and i'm very glad that it happened in this example and usually you need to clean a lot more than only these two white spaces even though we are basically done with our web scraper i'm going to show you one more cool thing that you can do most of the time you don't want to scrape only one page but several pages if you take our website for example in most cases you don't want to have the information from one ease in but rather from different reasons i'm going to show you how you can define a list of reasons run the scraper once and let it do the rest for us in order to do so let's define an ease-in slice and fill it with some ease-ins i simply copy-paste from my notes next let's make an etf info slice this will hold every single etf info we will fill with data during the scraping process now let's take the scrap url variable and make a function out of it first we need to remove the easem from the url then append the ease in parameter and return the whole url for the collector to scrape so instead of visiting just one url we can now iterate over the easens and let the scraper visit every single one of them let's put the encoder at the end of the main function and of course let it encode the whole etf info slice here in the onscreen function we now need to append the etf info to the etf info slice and create a new instance for the next scraping process executing the main file now gives us three adjacent objects with all information we scraped from the tracking difference page one thing to look out for is that the tracking difference in the last etf actually is an empty string but if we check the page we can see that there is actually no tracking difference provided so our scraper seems to work fine and now it's your turn find the website you want to extract some data from and write your own rap scraper using go length and go call it and that is it for today thank you so much for watching if you enjoyed this video please give it a thumbs up if you're new subscribe to the channel so you don't miss out on any new content of course ring the bell until next time keep on coding

Info

Channel: Thomas Langhorst

Views: 132

Rating: undefined out of 5

Keywords: go, golang, webscraping, web scraping, webscraper, web scraper, colly, tutorial, how to

Id: bfVxq-oQA3c

Channel Id: undefined

Length: 13min 43sec (823 seconds)

Published: Sun Dec 05 2021