Make a web scraper using golang

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we are going to make a web scraper using go but before that what is web scraping the web is composed of web pages and websites that contain data of all sorts so when you plug this data from a web page or a website this act is referred to as web scraping this process is automated using specially written programs called web scrapers which are designed to scrape data in large amounts with web scrapers you only have to give them an idea of what data you're interested in and then you can step back and let them do the work for you in case you need to scrape data automatically from multiple websites across the web you can alter your program such that it goes around the web and checks websites for the desired data if the website meets the data requirement it is scraped otherwise it is skipped and the process repeats such programs are called web crawlers now that we have some idea about the topic let's make a web scraper so we start by creating a file we can say ghostscraper.go we are going to use calli framework for this it is a very well written framework and i highly recommend that you read its documentation to install it we can copy the single line command and throw it in our terminal or our command prompt it takes a little while and it gets installed since i already have it i will close it and switch back to vs code here we begin by specifying the package that is main and then we can write our main function that is func main no errors the first thing we need is a file name so we can say f name and that could be data.csv now that we have a file name we can create a file so we can say file comma error and we can say os dot create and we can throw in the file name here so we can say f name and it will create a file named data.csv now that we have created a file we need to check for any errors if there were any error in during the process so we can say if error is not equal to nil we can log that error we can say log dot fader f fatal f basically prints the message and exits the program so we can say could not create file and we can give the error you can say error was error and will return it the last thing you do with the file is you close it so we can say file dot close but we don't need to close this file right away we need to close this file once we are done working with it go has something special for it we can just write differ so once you write defer anything following that will be executed afterwards and not right away so here once we are done working with the file go will close the file for us and we don't have to worry about going and closing the file manually all right so we have our file ready and as i hit save you see go did these things go imported the necessary packages so that was helpful now that we have our file setup next thing we need is a csv writer whatever data we are fetching from the website we are going to write it into a csv file for that we need to have a writer we can say writer and then we can say csv dot new writer and then we are going to write into the file so specify that we'll hit ctrl save and encoding csv was imported as you can see here okay so then uh the next thing we do with the writer again is once we are done writing the file we throw everything from the buffer into the writer and which can later be passed on to the file so for that we need to write writer dot dot flash but again this process has to be performed afterwards and not right away so we can add in the keyword defer save it and no errors so we have our file structure and we have a writer ready now we can get our hands dirty with web scraping or collie so we start by instantiating what is a collector so we can say collie dot new collector and if you notice noticed go imported cauli for us right so we have c dot new collector and here we need to specify what domains we are working with so we can say collie dot allowed domains uh for this example i am going to scrape data from internachala so i will throw in that in turn shallow dot com right okay there is an error of course because i'm not using it yeah c declared but not used we'll get rid of that in time so the next now we have a collector the next thing we need to do is we need to point to the web page we need to fetch data from here's how we are going to do that so here i have internship opened it's a internship it provides internships uh you might know about this so let's go to internships and we can see we have all these internships and this is what we are interested in we are interested in what internships do we have if you press f12 here we have the code for the page okay and this uh as you hover over it it represents it shows you what it represents there is this visual cue so let's go into wrapper and then keep going uh usually we go into the content okay so we have footer photo means the end and here we have the content so if you open that and then again internship for women container fluid yes we need to go inside that we have the container fluid this is the max width as you can see so we'll open that this is the searchable criteria and then okay here we have a reference so you go inside reference this is the search criteria container and this is the internships so this is basically how it works so you open a web page or a website and then you open its code and then you look for you know you look for the code so what happens is here this is visualizing and this is representing the search criteria container here if we needed anything from this part of the page we would dive into this but we need internships and so for that we have the list container so we will dive into that so the basic idea for web scraping is that you open a website or a web page and you find out what data you need and you find out how that is structured okay and then from our program we can pull data from that specified structure so since we are interested in internships we can open list container we have div and we have this internship list container we open that again okay we'll open it again and here we have these individual internships okay and now if we open one of them of course because it has to be a list this is the data we are interested in as you can see so we have all the information here the start date the duration the stipend work from home what company it is what category so here as you can see we have internship underscore meta and so here's how we go about it so we instantiated our collector we need to point to this structure from our code we need to point to internship underscore meta so we will say we'll say c dot on html and then we'll throw in the tag here we have internship underscore meta we will write that dot intern ship underscore meta so it basically points to this part of the web page this part of the web page okay now that we have that we'll write a function we'll write func and then we will have a pointer to that html element we can say calli dot html element okay and this is where we will fetch our data from so it is pointing to this structure right now the code that we just wrote okay so this with this function we will write that data into our csv file so we'll uh remember we had writer so we can say writer dot write what will it write it will we we are going to write a slice of string so we can say slice of string okay and so next thing is we need to specify precisely what we need so here we can say e dot and i'm going to say child text of course as you can see there are different things that you can fetch you know cauli is quite a quite a web scraping framework you can build uh spiders web colors everything with it so for this example we'll say e dot child text what e what child text does is as you can see okay so child text returns the concatenated and stripped text of matching elements okay so here we can throw in a tag and what tag do we need so here we have the internship meta if we open it we can see okay we have all the details here but what is something common from which we can get the information if you open it and i've already i already have already done it so you know we have this profile here you open it you have a a tag okay it has a link but it has this text it has text in test engineering similarly if you open this you can say a again and then this is whatever company it might be and then again we have okay we have an image here we don't need that okay we open this again this is the location and we'll open it again nope not this will open span and then here we have a and then in a we have work from home this is the text okay so we can say with e.child text we'll throw in the tag a and so what this does is it gets the text for us for that given tag so we have a everything every child text of a we can fetch that so we'll say e dot child text and then we have a of course we need to have comma because we are writing a csv file similarly we need something else i already went through the web page so i know how it works we will also need child text for spam that will throw in the stipend for us okay so now that is working and we don't have any errors so this is basically it we we here we had the collector from kali and here we pointed to the web structure and specified what we needed from the web page so the next thing is we nee we need to visit this website and fetch all the data so of course we can do it but uh you know right now we uh if we do it we'll so this is only one page if you see we need to do it for 312 pages so we uh so we can write a loop here we can say for i is 0 and i is less than what was that 312 will increment i and then okay so the first thing we need to do is we'll we will have a little feedback we can say printf uh so that it so that it displays what page it is scraping so we can say scraping page and then we can say person d and a new line and we'll throw an i here so this line will basically print what this will print i essentially that is the page that is being scraped and then we will visit the site so here again we'll use cauli we'll say c dot visit okay we'll say c dot visit and we'll throw in the address oops here we'll throw in the address so this is not the address actually this is the page one if you if i can zoom in here okay as you can see all right so we have internships slash page hyphen one and if i press enter you know you have this page this is page one and again if i write two here this is page two okay this is page two so that's the complete address so we can copy that and we can paste it here c dot visit and then we can have the address here but we don't need two here we need to add i here so for that we need string conversion we'll say string con and we will convert integer to string and we can have i here that will basically take i and convert it into a into string and append it to the address so that we can assess it so this will you know this does our work and it runs for all the pages of the website one other way because not all websites have show you how many pages there are one simple thing you can do is you know let's say it was uh this information was not available what then what will you do so you can do this uh you know you can do this funny thing you can just have this random number you can search it by binary search so let's say 500 you arbitrarily you know 300 400 something you write that and you can see 500 so it won't display a page it'll say page not found or something like that and then you say okay 500 that's not working let's say 250 and then okay this was working so something so next we'll pick something between 250 and 500 so let's say 350 okay so 350 is also not working so something between 250 and 350 so we can again say 300. okay and so this is working so this is roughly how you can get to know the number of web pages for a website of course you can study the code from here too but that was a fun way i i thought i should share that was actually the first way that popped in my mind instead of reading through the code so we run the program we run the loop here and we have it and we are pretty much done so uh lastly let's just have some information that the data is scraped scraping complete because nothing is more soothing than uh you know when a 100 complete progress bar pops up so yep scraping complete and we'll also see what quality brings us back if anything okay so this is the code we hit save and we have no errors we don't have to bank our heads yet let's run it let's build it first so for that you just say go build okay it seems like it did nothing but it actually created a file here go scraper for us and we can execute that so we'll just say dot slash and then tab for compilation and when we press enter our program is working scraping all the pages one by one okay let's let's stop it right there and check the data so it is creating a file data.csv and if we click on it here is the data and as you can see it's lots of data okay we we scraped like only eight to ten pages and it's quite some data all right we made a web scraper now what web scraping is merely a tool to get hands-on data it's really how you choose to use this data that your work is defined to give an example consider the internship data that we gathered so let's say let's say i want an internship and i don't know where to go i can i'm good at speaking i can code so okay i thought maybe i should go and id maybe i should go to marketing i don't know so we'll study this data and we'll make an insight and then we'll make an informed decision on that to do that all right so we have this data the first thing we do is by the power of regex we trim it we make it more readable next thing we do is we get rid of anything that we are not interested in so right now it's the company name where the data is and of course it's covered so we are only interested in work from home jobs so we remove anything unnecessary so now we have the job category and the stipend all right next thing we do is we make a count of how many jobs there are for each category say for graphic design there are 780 jobs with a total stipend of 29 19 500. all right after making this count so i'm going to remove any job description with this type into zero all right our data is more fine-tuned now next i will categorize it such that every job for graphic design and i get that and of course it needs a little manual input because of how uh the data is gathered okay you'll get a lot of noise in the data so you gotta use some tools i wrote a little program for myself and with that program and a little manual input i was able to categorize it you do that recursively until you get this list of data and then once you have that you can plot it so naive me don't know where to go i look at this graph and i see okay so these are the total jobs on internachala or rather internships so i see okay there's two peaks i see idea and marketing all right i was interested in either both has both have 1500 jobs more than 1500 jobs so that's a lot maybe i can land a job finally but when i switch to average salary i see okay so id jobs are pretty well paid a little over 7 500. however marketing jobs yeah not so much similarly i can see the difference in other sectors say content management or say telecalling teaching teaching has not a lot of jobs but the price or the statement is huge so maybe i can go there so this is one example of how you can use your data so this was just a personal uh simple example for the sake of understanding and so you can make an informed decision like like that other examples of web scraping include price comparison here i have the same product opened across three e-commerce giants but if you notice the price for the same product is different for each website you don't have to go manually looking for the product price you can just use websites such as google shopping that employ web scrapers here you can really type in the product name and you get the product price comparison that is by employing web scraping pricey is a similar website that does the same work you can also use web scraping for lead generation you know to get contact of customers or users you can also use it say for yahoo finance you fetch this data and then you find a pattern in it and then accordingly you can make an informed decision so those were some examples how you can use web scraping of course it goes beyond that you can use it for data collection for data science or research that's all that that was the video thanks for watching if you liked the content consider subscribing it's a new channel i'm going to post a ton of go videos thanks
Info
Channel: Sourav Singh
Views: 10,563
Rating: undefined out of 5
Keywords:
Id: 3KsE7zMm-AI
Channel Id: undefined
Length: 22min 10sec (1330 seconds)
Published: Thu Oct 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.