Golang tutorial: How to scrape websites with Golang & Goquery | Golang project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this golang tutorial we will scrape the tech crunch website with the go and go query package it's a small golang project for today and i want to get from techcrunch titles of these posts euros to this post and this text excerpts after that i want to save the script data to a csv file it's a quite common and relatively easy task so in this tutorial we will look at how to write data to text files how to write data to csv files how to perform requests to websites and how to use go query package feel free to ask your questions and suggest anything in comments and also please don't forget to leave a like and share the video in social networks it helps to promote the channel okay first of all i want to create a new go module go mod emit and i want to use my github path github.com tech crunch for example if you just started to learn go language consider watching off my golang imports tutorial it's about of modules and packages paths and go so let's create a new main go file and its package name will be the main tool then let's declare the main function it's an entry point of our program and now to scrape the techcrunch website first of all i have to perform a get request to it and get its html code after that i have to parse this html code with go query library and at last i will save the script data to a csv file so let's start with getting the html code of the main page of the techcrunch website let's define in the main function the url variable it's a string and the string literal as you know in go are only double quotes now i want to send a request to this url we can do it with the net http package let's import it and to perform a get request i have to call the get method from the http package so http and yet the get method will return two objects the server's response and an error if something will go wrong and i want to say these returning values to variables response and error it will be a good idea to check the error variable whether it has a value or not if it has a value it means that something wrong with the server and we didn't get the response anyway i want to see the value of the error variable so let's check the error variable if error not nil it means that the error variable has a value i want to print this error and of course i have to import the fmt package okay i got the response from the server and now i want to check the status code of the response if it is not awk that is the status code is more than 400 i want to see it so here let's check it if response status code is more than 400 fmt print line if the server responds with a 200 status code it means that everything is okay 301 and 302 means redirects 403 x is forbidden 404 page not found 500 codes are related with the internal servers errors so if the status code is more than 400 we will see the message ok the html source code of the website is stored in the body field of the response variable and i have to close the body of the response after i will complete the scraping process to do it i have to use the different keyword and call the close method of the body field so somewhere here so i want to defer the call of the close method of the body field of the response variable prefer response body close now let's install the go query library or package that will parse the html source code of the page let's add in the import statement the url to this repository okay and uh to install it let's use the go get command dash v means the rubles output because i want to see the installation process imported and not used okay let's use it let's assume that we got an html code of the website and we have to convert it to three of go objects tree of structs that will allow us to search through it to do it i have to call the new document from reader function from the go query package and pass into it the body of the response so go query new and the new document from reader function will return to objects the document and an error let's use them document error let's check the error the same way we did earlier and so right now we got some code redundancy because we have the error checking twice so let's create the check function that gets an error and i want to copy this here if error variable is not nil and let's call it here and here so let's use somehow the dock variable and install the goquery package it's installed and now we got the document let's examine the website for what exactly i have to search for i wanted to get the title of each post its url and this excerpt text inspect and we can see that these posts are article tags inside the div with a reaver css clause so let's get this reaver div first i am using the doc variable i am calling the find methods and i need the reaver div reaver dot here means the css class and i want to know whether the go query finds something or not i'm going to call its size method size and let's run the code and i got one to get the html code of the found element as a string i can use the html method html if the find method will return a set of elements list items for example the html method will return the html code of the first element of this set and the html method will also return an error let's check it check error and let's run the code okay we got the html code of the reaver and by the way in this html code we can see that there is no article tags with the post block css class there are just divs with a post block css class so i think that's for convenience it's better to save this html code in a file and then work with this local file in with inspector so let's create a new function let's say it'll be right file function and it will take two arguments the data to save and the file name data file name they are both a strings string then i have to create a file to save my data in i have to use the create function from the auth package let's import it then aus query 8 i want to create the file name and it will return a file object and an error file error let's check the error and i want to defer the file closing so somewhere here file close and then let's write into the file our code file objects i am calling the write string method and passing into it the data variable this data and that's it let's call it write file reader file name [Music] index html okay let's open it in chrome and let's open the inspector yep we can see that each post is a div with a post block css class let's get all these posts so let's comment this i am calling the find method again and i need to get depths where the post block css class and also i want to see the size of this selection object size returns one value okay 20 posts okay this time the find method return the selection object as a set of elements and for each element i want to get the title the url and excerpt to do it i have to call the each method each the each method iterates through the selection object and it gets a function as an argument each function will execute this function for each element of the selection that we get from the find method this function has a standard signature it gets two arguments index of an item it's an integer and the item itself it's a pointer to go query selection object and then in the body of this function i have to describe what actions should be performed with the item variable with the element of the selection object i want to get the title of a post and its url so i have to get the title i have to get the age tool tag i am calling the file the method of the item variable and i want to find the h2 tag by the way we can remove this reaver variable then the title title is the text of the age tool tag so just call the text function and i want to get rid of white spaces at the beginning of the title and at the end of the title and i need something like the strip method in python golang has a trim space function from the strings package let's import it strings and then let's wrap the call of the text function with the strings trim space function okay we got titles now let's get the urls the url variable the url is the value of the href attribute of the child a tag so let's find the a tag and let's get its href attribute the author method returns also the second boolean value whether the attribute exists or not i don't want to use it and i want to use the underscore variable so let's look at heroes title declared but not used okay title url we can see the urls okay and also i wanted to get this excerpt it's a div block with a post block two underscores content class so let's get it i want to use the item variable again and i have to find there's dev with this css class i want to call the text method and i want to get rid of white spaces okay we got all data for each element of the selection now i want to combine them into a slice it will be a slice of strings so let's get rid of this print line let's create the posts variable and it will be a slice of strings title the url and the excerpt and now we are ready to save the posts variable into a csv file so let's import the csv package encoding csv then somewhere before the call of the find method i want to create a writer object writer csv new writer and the new writer function gets a file object as an argument so i have to create a new file file error als create and let's say it'll be posts csv file let's check the error variable and the new writer function gets the file object and then in the body of the each function let's call the right methods of the writer objects right right and it gets our posts variable and also we have to push output buffers to file to be sure that all data are saved to the file so we can do it by calling the flash method here outside of the each function writer flush and that's it let's run the code sorry miss print string of course okay and we got the csv file okay and we got our data our 20 posts if you like the video please leave a like and subscribe to the channel thanks for watching
Info
Channel: Red Eyed Coder Club
Views: 1,165
Rating: undefined out of 5
Keywords: golang, golang web scraping, golang scrape website, go programming, golang tutorial, go tutorial, red eyed coder club, go lang, learn golang 2021, golang goquery, goquery golang, goquery tutorial, golang tutorial web scraping, web scraping with golang, goquery, golang goquery tutorial, golang project, golang projects, learn golang
Id: 4VIoT50mzzo
Channel Id: undefined
Length: 22min 27sec (1347 seconds)
Published: Tue Oct 19 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.