Ep.01 :: GO Web Scrapper/Crawler

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey this is a Gingka and you're watching CS funk we are going to start with the series on golang this is more like an episode based videos kind of like just for funk and in today's video we are going to start creating a web crawler so if you're wondering what's a web crawler you know it's it's a kind of a part of a search engine that scans webpages looking for links and then follows them it would normally store this found URLs into some database our version is just going to scan through the URLs and print them on the screen while it's not going to be quite an enterprise-grade application but we will deal with some of the basic concepts and see some satisfying result gets printed on our screen and let's get started with creating our project now so I'm I'm at my SRC folder if you're wondering how did I come up with this path make sure to watch the installation link in the description so make sure you have proper go line setup on your machine and as well make sure you have the latest version of go cause we are going to use go mod which is available to you only after 1.12 so I'm going to create a directory over here inside this directory we have nothing I'm just simply going to create a main door go file okay and this is going to initialize get so that I'll be able to share this code afterwards and I'm going to introduce go mod over here so of course go mod has to have an init method when we are starting it for the first time and as you can see now we have three files over here two files over here may not go and go dot mod and once we add few of the external packages you have to run Godot test we'll see that later on and this good or mod file creates a file called Godot sum which is kind of like a security file I am also not exactly sure what it is for I'm also a beginning into gulang am kind of go for starting out just now too so please feel free to post in the comments so let get started by opening our project in two ways your studio code basic file in go is like main dot go which has to have a package name package main and it has to have a function called func main right and func stands for function okay let's just print out an hello world okay and let's just run this code into our terminal okay everything is fine so far so what we're going to use is we are going to use an STD be package from go team and we are going to be provided with the basic URL so basically STD beget HTTP dot get will create us an HTML page so we are going to pass it an URL over here so let's just define that base URL over here first of all for know it is going to be an HTTP URL and if you notice okay we are just running this but if you hover on this gate you will notice that it has a return type so this defines the return type of our function so it is going to return us two variables and we are going to use the go Lange's be structuring over here so this is kind of a like a variable restructuring so we are going to get a response and we are going to get an error okay a handling ever in go is really simple so you just check with an if condition if error is not equals to two nil then we do something with that okay so in our case we are just going to do an FM t dot println so we are just going to print out that error to the screen and we are going to exit from our terminal right and it's going to see exit 1 okay so so far I think we are going to deal with errors few or multiple times and this so I'm just going to create a easy function over your called check error which is this and which accepts an error which is a type of error okay type of error is already defined into golang and they're going to reuse this funk check error over here like this and we are going to pass the error alright so so far so good and I'm just going to reduce this height yeah okay so we have this HTTP response and let's just print it to the screen so I'm just going to say FM t dot println and print the response to the screen let's see what we get okay so let's just run go run main dot go now this is my firewall okay so it is reading out an entire page and it is giving us something right so this is like a big su DB object that we are receiving you might also get an error since YouTube is an STD PS page right let's try that now since we are on localhost we might not get but let's try okay so far so good as to be say it is okay now if I just say a response dot body let's see what we get okay so this is the object that we are receiving so response or body is an i/o reader so if you roll over on body you will see it is a type of i/o dot reader and it has to be closed afterwards but now when I mentioned response to our body needs it's it's not a string right it's more like a reference to a stream of data so we are going to use our util package from go and we'll store it into our memory so let's just do that first so I'm just going to say body is equal to ru util okay dot read all okay read all methods stores everything into the RAM memory of your machine so I'm going to pass response the body over here and this body needs to be closed so we are going to close it at the end of our function so I'm going to to say body dot close okay like this and this will just simply print out something over you okay and just going to say fmt dot print body and let's see if this works go run may not go okay body dot close dot is undefined what I forgot to do was close this guy not the mean guy okay so this body is not from the response but this body is a bite of an array okay so you'll see that now okay so this is what I mean meant by bite of an array so we are receiving all the body into byte array so we need to convert this to a string but also before that I'm just going to ignore this error so or otherwise you can just simply do an error check over here for this error but in case if you want to ignore this error you can simply say enter scope which will ignore that error for us okay and I just try to run this again and you can see this is the exact HTML that we are exceeding you can see we have lot of URLs over here okay and this is a big HTML file and this is what we are receiving from this URL from s to T P dot get also I mentioned earlier that if you try to read an HTTP URL so what I mean by that is if you host this code as a service over the Internet you would face an issue cause if you try to access an STD PS URL this HTTP client will give you an ER so I'm going to use my own HTTP client and I'm going to say simply ignore the SSL Certificates right so let's just do that so what I'm going to do is I'm going to create a new variable over here called client okay and a client is going to be STD P it is going to be a pointer to s TP client okay an sctp client has many things so one of them is transport so transport is basically let's modify the SUV P client the original HTTP clan so we are just going to modifying the transport part of it so I'm just going to create another variable over your transport which is going to be a type of HTTP transport I'm not sure let me check yeah it is the type of transport is going to pass again and represents in address it's kind of a pointer to the SBB explanation of what pointers will be in future videos HDPE transport which expects so again transport has many methods but the one which we are interested is the TLS client conflict so I'm just going to add that over your and the conflict is a variable that we are going to create now okay conflict is again going to be a TLS config object okay and again TLS conflict has many options but the one of the option that we are interested is isn't too insecure skip verify ok I just knew about the starting keywords but since I'm using vs code it has intellisense and it is giving me the right keywords okay and I'm just going to say insecure skip verify to true okay and just VSO be a good just going to skip the SSL okay so since I want this client to be reusable okay and I'm going to declare it at the top but afterwards and as well let's just combine these imports together so I'm just quickly going to combine that love you combine the imports and so we have a client which we created now we just need to pass this transport all right perfect so I'm just going to use this client instead of our HTTP dot yet I'm going to call this as net client net client or get busier okay so so far so good we have created a basic gland and we have added a basic error handling then we are reading all the body parts from the response the main thing which is remaining now is extract all the HTTP legs or I should say extract all the anchor links from this body okay to extract those anchor links go has a package called HTTP HTML HTML not the CDP HTML it's on XHTML yeah this one so if you land upon X net HTML over here you will see we have something yes HTML dot pass and it needs a reader okay and once you have this reader you have to follow this for loop when you pass the reader object that you have fast wire this HTML dot pass method it returns you a tree structure of the Dom and you have to traverse with the for loop of that tree of the tree structure and you have to keep finding the anchor tags extracting the href and text attributes from it in most languages it is not fun and I would say in go as well it's not fun so I have extracted out a simple package called extract links which does the same thing and we are going to do just that we are going to use this simple package extract links for now we'll just use this and in future videos we'll have a series I will have a video on how I created this extract link package but for now let's just add that to our project to add a package you have to say go cat and we need the latest version of it and then just simply paste the URL over here which would add it to the library and we are using the latest of this extract links which would be zero point zero point one and now if you look at this we have automatically created it goes somewhere over here and if you look at the go mode file it has added this version over okay now to use this package you simply go to the inputs and add it like this extract links and now this extract link is available to us as an extract linked variable so I'm simply going to say instead of using this IOU till and we do use extract links dot all so it has this method it takes a reader object which is this response dot body okay and once we pass this we get all the links over here okay and this links also has a never we'll just deal with that error like this simple error and let's just print out what are this links now I'm simply going to print out print them out to this screen okay and also one more thing I forgot to mention rather than having this response dot body dot close at the end you can as well do something like this so you already know that you have this response or body right which you got it from over here so you can simply say differ response dot body dot close so it's another way of saying wait so differ is like way of saying wait till the end of the execution and at the end till this function exits and then you run this function okay so now I'm just going to move this after ever okay we are using extract links doll let's see on screen how it looks Sandra's going to say go run may not go so it is returning this huge object right so this object is more like a so you are receiving so many URLs from this current page so it is also written you returning you the path of the URLs so if you look at the browser now and you have these side links okay you can see at the bottom of the screen over here when I roll over and getting to see those links okay and we have this movies gaming premium like ok fashion so all these links are available over here ok and that is what this extract link is doing it is just simply reading only one page at a time and giving you all the links back to you okay but what I want is let's just print them one by one so to print them one by one we simply going to loop to it get the index then we have this link which is coming to us from a range object okay so I'll just use this fmt print let's say index instead of println I'm going to say printf so index is percentage value then link is this link so I'll just say percentage value again over here and I just pass in I and then link dot H rec all right and as well at the end let us see slash n for the newline and let's see it in action okay so this is what we are receiving and if you look at this link object it's its it has a text node as well I'm just going to print it out over here let's see how it looks okay so we are receiving a trap and we are receiving a text and if you notice I did this plus over here which is kind of like a give me more details not completely not sure but it works every time so far what we have done is we have our custom transport with the net client and we are doing the get for the URL and we have this extract or links but the missing things is first of all we need to get this URL from arguments secondly we have to move this part into its separate function and the third part is once we extract out all the links we have to visit each of those extracted links again okay so let's see how we are going to do that first of all let's just get this basic from the command line arguments so what I'm going to do is and simply going to use the OS arts so that is arcs or arguments is equal to o s dot ODS all right so once we have access to the OS arguments let's just print them out to see what we are receiving and just check it out friend Allen hugs and they disperse the arguments and see what we are receiving now so arguments we are receiving where the current directory from where you are calling this till here and the argument so basically what we are interested is we are interested after the index one okay so the easiest way to getting that from this area is we want everything after the first okay and this is how we get it and we are just simply going to output this again all right so as already you can see we are getting the correct arguments so let's just simply do a check over here if length of arguments is equal to equal to zero then we throw an error so we just tell the user hey that you're missing certain arguments as well we have to exit it over here so we'll just say oh s dot exit and say one okay so I think I will just call this as base you are at and not the arguments because this is what the argument is going to look like and just check it out over here just for the sake of displaying it on the screen you just say rental and NBC all right so cannot use base URL as type of string blah blah blah so basically the net is having an error now cuz this is not a new URL so let's just pass it a correct URL over here cannot use base URL type of string as argument okay let's see what we are getting over here I think you must have already spotted this I did a silly mistake over here so arguments is basically inherit still an array and we have to get the base URL from the arguments 0 all right and so this argument is something like this and now we should have the correct base you are let's try to do that again all right the base URL is correct and that's it so it's this works so now let this create our function to separate this code into a different function so I'm just going to call this as func call URLs crawl URL okay and which accepts in a track and it's going to be a string all right so I'm just quickly going to move this into crawl URL and I'm going to remove this spot all right so now the net client is not available as you can see the crawl is trying to access net client while net client is declared inside me so we need to move this variables outside okay and I'm just quickly going to turn them into variables so we have now globally declared variables which should be available to us now href is the part which is going to be getting crowd every time and what we can do is we can just simply print out over your fmt dot println a printf percentage and just display the H rev that which we are crawling all right and also add a /n for a new line so far we have a crawl function which crawls to the URL given and once we crawl to the URL it fetches the HTML once we have the HTML we are extracting all the links from that HTML body once we have all the links from that given URL then we get the body and we have all the links all right so again we have to crawl all these yards right so eventually what we are going to do is we are going to call the crawl function again right but doing a for loop avoid doing this inside this for loop and again trying to call this function over here and it's just passing this link would not make sense so basically what I'm trying to do go here is link door actually so if we look at this we are simply calling this crawl function which is going to be from here all right this is the starting URL and we are calling the scroll function which is getting stuck over here let's say this works and we'll know if we are doing any mistake okay let's see what is the speaker problem okay I'm just quickly renamed this to crawl all right let's try to run this into our terminal okay so now what is the problem that we are facing over here is we are crawling this but after this crawling as you might have seen before we also have certain URLs which are just path B which doesn't have the information about the host so some of this URLs are they are something like this so they are slash so they can be a forward slash they can be a forward slash with some path right but some of them have the correct information like SEO DP then your youtube.com slash some video link but this part the host part that is what we need to append over here so how we are going to do that is we are going to write a new function over here and as well this is not a good idea to loop through the crawling which is not efficient and we'll fix that afterwards but first let's fix the URL and so what I'm going to do is I'm going to create a new function over here called to fixed URL okay which accepts the H ref that we need to fix and the base you are right so these are both going to be strings so if this H ref doesn't have the host then we are going to take the host information from the base URL so we also have a new package URL passer from go team which we can use NAT - you are L and it has a method called parse so as you can see if we are going to pass a raw URL which returns us a URL object and if you look at this URL it has everything all the information that we need okay so let's try to do that so what I'm going to do is to create a new variable over URI and I never caused URL package also returns an error as a second argument and we are going to pause the raw URL over yo H ref and if error is not equals to Ned so in this case if we have an error we just simply return an empty string we have no option okay also define a return type okay and as well we have to create the object from the base URL so in this case I'm just going to say base and base URL alright same thing if we have any error we are simply going to return empty okay so far we have URL object and we have a base object which is a type of URL strut okay so we need to return the joint you are right if you inspect over here the URL what I did was I just went ahead URI host and base URI was just it will just spend some nice console.log okay nice prints and let's just try to run this okay so now as you can see the you are a host is empty the URI host is empty and if you look at the base host it is Jas dot o-r-g okay I just went out and change the host okay but so far we have identified the cause the URI doesn't have the host so what we can do is we can take the hose from the base and use the path from the Guara we can simply go ahead and fix that so what I'm going to do is I'm going to call a new function over your new method where you're from the base it is going to be resolved reference so it takes a complete URL object which is in our case is the URI and it will return us a new object it is again you are an object to fixed URI and what this is doing actually is it is taking the host from so it is taking the host from base and it is taking the path from URI okay if URI has its has its own host then it will simply use the host from URI otherwise what it will create is it will take base host plus URI path and that is what is being returned over here and now if we to fixed now if we call this to fixed over here to fix dot string it will fix the URL for us and that's it and let's just try to run this again okay now you can see this was the initial crawling URL I think it is still not fixed let's see why okay we forgot to pass this function to the crawl so we are simply going to remove this and pause the fix to URL and let's try to run this again and now you would see that we are crawling correct URLs but now we are stuck in an infinite loop so what this infinite loop is is basically we are crawling through the URL we are crawling to the page but again we are not queuing up them right so we are going to fix that with concurrency so basically to avoid this issue we need to keep some kind of queue where we would put the links that we found in the back of the queue and we visit the page that that is in the front of the queue right and we can fix this with channels and concurrency so concurrency is available you can go with just a go keyword but let's just create a first channel and let's see what this guy's crying okay yeah you it needs I don't know yeah this should fix this and let's just create a global variable over your call queue which is going to be a channel it's a channel of string and this queue which is a channel now is going to receive the base URL initially as a concurrent function so cool so we are just going to run a function over here okay which is going to be an anonymous function right now so the queue receives its base URL in this case is just the argument 0 we can remove this and once the queue receives this we have to execute this function as well right but this function is going to be concurrent okay so we need to keep this concurrence so what by what I mean had this concurrency is we need to keep this function as synchronous otherwise we would exhaust all our resources ever visiting a single link right so as soon as we receive this we are going to whenever we receive a URL we are going to crawl that okay so that is what this is going to help us with that I know my explanation might not be perfect but yeah that is what I am going to explain right now okay so let's just loop over the for loop and so we have to loop over the range of T queue and from the range we are going to receive an xref and we are just going to pass that Ashraf over here right and this is going to be passed over you boy so we have the basic queue ready you have loop we are run growling from the cue of the URLs or H revs once we have the H ref we are starting to fetch that and over here we don't need to call this function again we simply we need to put the found URL into the back of the queue right and we are going to do just that so this queue is going to receive this link dot H ref okay now let's try to run this okay it is stuck actually the reason for it is is this guy over here right so this is not right so you might like have spotted the issue but if you are not some people even ask this question over the interviews why this might not be working is the reason is we are looping through this is getting called a synchronously or so this part needs to be called a synchronously cause this we are pushing links faster than ever we are visiting them right to fix that we simply going to call this inside a concurrent function and that should do it and I just call that perfect so what is this just call this okay and we just pass it over you're like this alright and then with this game alright so there's like a simple quick fix and what we can do is I think the fixed URL we should fix it over here itself right and we can just remove this to fix to URL and the base is the HRA which is this guy all right makes sense we have the absolute URL now and so far it looks good just try to run this okay it is going through okay now what has happened is somehow I have just interrupted this somehow I 22 Jas then it went oh you then it again came back to this Jay's dot o-r-g and from here it went to Twitter from Twitter it went to this guy and so and so forth but if you look at this we have some repeated URLs over here right to fix them we need to have some some kind of check where we need to have looked at if this URL is visited we don't visit that again so we simply create another variable over your called has visited or we can just simply call this as visited all right and visited is going to be a map of string and boolean so let's just make a map which is going to be a string has a key off string and it is going to be the well it is going to have a value of boolean all right so visited is just called it has visited okay so has visited so we are going to do that check over here okay and we'll just simply say if has not visited this particular href okay then only we call the crawl function right and inside this function right we'll just call that has visited map again over here we pass the href and we set it to true perfect so let's see this in action again okay it is going through it is going to eat and every pages now okay it is now inside people going through okay this is all the new you are as it is finding now alright so I'm going to stop over here by interrupting with the command C so what s noticed was I just want to crawl the single domain we can add that again with if-else condition over here we can simply have a is same domain kind of a function so what I mean by that is if if the URL the actual F that we are visiting is only from the GS dodgy then only we will visit that otherwise we won't crawl that so we can do that check over here it's up to you completely option but I'm just going to have that check over here this is something additional but I will just quickly add that so this is basically similar function to the fixed URL so I'll just went ahead and just I already created the function this is going to paste it over here if you look at this function it is almost identical to our to fix to URL function the only check it is doing over here is it is returning a boolean instead of a string and that's it so if you are I host and parent URL host is same we written false it's if it is not the same we return false if it is same we are just returning true and we just do that so so additionally and is same domain okay so it needs an H ref and it needs the base URL so the base URL is this argument and which in this case now we only need to store this as a base URL and we just pass it initially over here and pass it over here as well so we are going to visit only the URL from the same domain name let's try to do that let's check now which in our case we should not visit anything so we should not be seeing anything related to github or PayPal okay this works and these are the only URL that has found let's try to visit another URL YouTube maybe yeah this is going to be a long list because YouTube has many URLs yeah so this is working and our application or our web crawler is working fine and that's it from this video if I have made any mistakes any wrong information please post it into the comments and in the next video we are going to learn or maybe solve a problem related to recursion and after that we'll jump into a few more golang problems and please don't forget to Like and subscribe to my channel thank you bye bye
Info
Channel: Ajinkya X
Views: 13,471
Rating: undefined out of 5
Keywords: javascript evangelist, jsfunc, golang, go programming, golang web crawler, go web scrapping, golang scrapper, web crawler, go spyder
Id: 2wmkHFTaXfA
Channel Id: undefined
Length: 44min 33sec (2673 seconds)
Published: Fri Nov 22 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.