Use wget to download / scrape a full website

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone in this video we'll take a look at how we can use W get as a simple tool to download and enter website slash scrape an entire website now for purposes of keeping this video really short and simple try to outline three examples we'll take a look at a basic example kind of like a hello world if you will and then we'll take a look at some more advanced functionality will kind of like go through a very common set of scenarios or parameters that we won't use for screen scraping or downloading files off of a website and then finally we'll take a look at some more advanced parameters now I do want to clarify that when when we talk about download and scrape it might have different implications but keep in mind that this is a very simple example of using W get I've covered other tools like scrap IR and even off-the-shelf scraping tools in the channel so you might want to take a look at that for more advanced capability so but this is probably the simplest way how you can download or keep a site offline or even scrape some parts of the site so I say squid with a bit of a hesitation because in most cases when you talk about scrape you're talking about extracting content from webpages that's basically crawling the web pages and passing the web pages and taking content or structured content from the web page and storing that in some other structured way or you know kind of like extracting information from web pages so just want to emphasize that this video is not really an in-depth scraping capability it's not something you used W get for it's more of a case that you want to download the video I'm sorry download the web page and then keep the website content offline or do some for the processing locally on your desk or you can build other tools to monitor if websites have changed etc so we are going to exclusively use W gate tool so a lot of the parameters of Webjet if that's unclear or you want to investigate further you can head over to this URL and all of this is in the description of the video below to get started let's start off with a simple example I've taken a random website I say random in the sense that the tongue-in-cheek because I'm pointing into scrap I which is one of the tools have covered in an earlier video incidentally scrappy is is a web scraping tool as an open-source web scraping tool so it's kind of funny we're using web get to scrape or download scrappy anyway that wasn't the original web site I had in mind but I thought it would be interesting from a video demo standpoint anyway so let's actually go and see oops what the website looks like if I can get rid of web get yeah yeah all right so this is basically the content that we are trying to extract in all of these examples you might argue that this isn't the best example there are hardly any images here except them but again I wanted a simple site for a demo feel free to substitute this URL with your preferred one all right so let's let's head over to the console let me just just copy that again and on our console I'm gonna just I've created a folder which currently does not contain anything and let's just paste that the web get Oh would be helpful if I na all right so connection established let's try that again alright so what you saw was the the web get in action but it didn't do anything amazing it's just downloaded a simple HTML file which if you open obviously opens up on the browser here and you can see that it's pointing to a local resource now keep in mind that not all the contents of this page are extracted and pulled down to your computer so for example CSS files JPEG images etc these are downloaded which actually brings us to the next example so this was you know just a very simple example if you've never used web gait for HTML chances are you've used web get in the past to download zip files and other software installation files but interesting to think of it as a tool to download webpages now things get more interesting in example two here we are going to use a few parameters here so the first thing we'll want to do you'll notice that I'm pointing into the same folder so I am sorry the same URI what I've specified here is that it needs to recursively navigate the contents of this initial page and then start recursively traversing through the site more like a web crawler would do this is not a mandatory option but the no clobber basically implies that if if a page if a URL had already been crawl and if a page was created don't crawl and recreate that page so it's typically helpful when you have issues with connectivity or you want to stop and restart a couple of times so typically during development or testing this is where things get more interesting so we have the page requisite so remember I talked about in this particular case and we extracted it the first time it was just the HTML but it did not download or keep an offline copy of any of the other resources like images and CSS etc so setting this will ensure that all the other resources are also downloaded HTML extension base is helpful when your navigator when you're crawling scraping and downloading files which typically have an extension like JSP or ASP X or CGI scripts which when you want to store on your local file and when you hard disk and when you want to click on it you want the extension to be an HTML extension so that it automatically opens in the browser so that's the only reason for this convert lengths basically and sure that any links HTML anchor or various others linked to a local file path as opposed to pointing it back to the server you are eyes that's basically what a convert link is and finally we have some escaping of characters and finally this what you're seeing is we are specifying that it needs to only stay within the domain or the sub domain and finally no parent specifies that it needs to be at this hierarchy level and it'll ensure that it does not go up a hierarchy say for example into French or German language so let's run this example now alright so if we run that example let me just cancel that and clear that folder just so that it stays clean and we know what's going on all right so that's now you can see that it's a completely scraped downloaded all these pages and if we go down here you'll notice we have the index dot HTML page so let me close the old one and now this is our latest downloaded page now just be mindful that while this selection of mine might not it's actually a poor choice of a website that I've used for this video because this page does still rely on external CSS files and various external resources and since we have locked it down to only the scrappy org domain it's not going to download those resources locally additionally you'll find that if you try to copy it from the same site or follow the same example you don't notice some of the images haven't been downloaded because this is an example of an image that in a different domaine so we have restricted it to only be in scrappy dot org domain whereas this is trying to point to read the docs that aughh domain so again it depends on your particular mileage of how much you want to scrape but I'm just highlighting that based on the parameters you provided here it may or may not scrape the content to the fullest degree and keep everything offline but obviously you can kind of like tweak these parameters to your heart's content now just a quick temp given where we are right now what we have done is we have downloaded the files it's it's all there in our local hard disk and that's brilliant it's kind of allows for us to do offline browsing /do for the processing offline but in most cases you're you are trying to download or scrape a much larger site typically these parameters alone will not suffice so in which case that's typically where I use the third example here now this has some additional parameters here so in in most cases you'll definitely want to keep these parameters so in in the previous examples you you've noticed that the requests were being sent immediately one after the other and this is OK for small sites or if you are scraping only a small subset but when you're scraping larger sites they might blacklist your IP so one of the ways you can get around that or just be a good netizen if you will is to allow for a wait so it waits for five seconds before it it sends the next request also you can specify a rate limit so how much of beta you're downloading so by default it's bytes so you can specify the key or I believe you can also specify the M as in megabytes parameter so again it ensures that your sites don't get blacklisted and some sites are a little more intelligent it it checks that the user agent is not in a familiar list so it might not serve you content so there are ways that you can specify the user agent and then finally there are ways that the default keep in mind recursive again you can check the docs for more up-to-date information but you'll find that the default recursion I believe is only five levels deep but if you did want to recurse down to lower levels or restrict it to something less like maybe two for example you can set the level here and the last thing I will point out is in most cases when I'm downloading a larger website I typically would want to run it in the background but at the same point in time have a log file where I can send that data to so W get provides some inbuilt functionality so we can send the output or see what the process or progress is in kind of like a log file and we will run it in the background so let's take a look at that in action and just before I do that let me just split this into two sections here so that we can see the files let's run that here all right so that's running in the background if if you take a look you'll notice that I up it's a it's the process is still running in the background and we have sent the data to this file here so if I look at the contents you notice here we have a new file here so let's tell the activity so here we can see the log is being updated as and when new requests and progress is made we our web get it's sending the results to the log file so I can just you know close all this consoles and maybe come back in a couple of hours or maybe even much longer if it does take that long and then I'll see what the progress is so that's quite handy here I would say in vast majority of the cases this is a kind of like template that I would use one final note before we wrap up for this video is again you'll want to run it on different sites and see how it performs but some sites do do know that there's a bot you know an automated process that's working or making these requests so instead of having a fixed five-second wait you can remove this wait and instead put this which is a random wait which works on some sites again it's a this has a tool it's a quick and dirty in a manner of speaking and it's not as versatile as a full-fledged web scraping tool obviously again I've covered these in other videos in the past some other examples of web scripting tools but just something I found quite handy when I wanted to scrape off some some content really quickly and keep it either offline I'll do some post processing and content extraction all right so that's it for this quick video thanks everyone for watching
Info
Channel: Melvin L
Views: 36,820
Rating: 4.5177307 out of 5
Keywords: Scrape, web scrape, wget, Linux, Tutorial, offline, site offline, download site, download full site, Linux hacks
Id: GJum2O2JM6M
Channel Id: undefined
Length: 14min 35sec (875 seconds)
Published: Thu Jan 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.