How To Crawl A Website Using WGET

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello welcome to another how to code well tutorial today we're going to take a look at W get W get is the non interactive downloader that you can get for any kind of Linux distribution as well as Mac now if you're using a Mac like I am then you need to install it the best way to install it is using homebrew and that's literally brew install W gets I've already got it so I'm not going to install it here but once you've got it you can start playing with it it's very very powerful you can literally download anything off of the internet any kind of webpage anything to your machine obviously as long as you've got access to that and you can also treat it like a spider so you can actually crawl the web pages for things and search for things and only download certain bits and pieces and file types and so forth we're going to take a look at all of that in a minute let's just take a look at the very basics of W get and actually download a file so it's as simple as typing W gets like this and then putting in the destination so I'm just going to do just come up to the name of my blog so Peter Fisher door Meo UK so that's going to download the index page of that site and it's downloaded it into the current working directory so if I did LS we can see that we have the index file here and if I did a D you - whoops - page of the file we can see that it is 68 K and we can cap that of course to actually see the internals off of that so there you go I'm just going to clear that down so that's downloading a single file but how about downloading the whole entire site maybe you want to do a mirror of a site well let's first of all what I'm going to do is just remove that file clear that down and we're going to do it recursively okay so let's just do W get again but this time you - are so using the our argument which is going to do it recursively if I press ENTER it's going to download not only that page but all the other pages that are connected to that as well as all of the the links and the resources so the JavaScript the CSS everything and it's just going to do it recursively so that's what the - R is what I'm going to do is just kill this now and do an LS to actually see what we've got so we actually have in this working directory though the website Peter Fisher ma UK I'm just going to clear that down make some room and I'm just going to CD into here and as we can see we actually have the directory structure of the website so we have the blog the contact hire how to code well and so forth and if I was to go into this stuff if we did CD into let's say how to code well for example whoops not a directory well that should have been the directory let's go into something else so let's do can I go into blog nope apparently these are just empty files which is a bit odd let's do LS - al yes okay well we've got we've got a couple of directories here let's do WP JSON maybe it hasn't actually gone through those those those directories to build up those things so let's do an LS off and we have the index of that WP - jason and again we've got WP content and so forth what I can do is in the WP content here I can see the plugins and the themes and so forth now what I'm showing you here isn't actually sensitive stuff it is literally just how the web handles it right this is what the web sees so you can access that let's just clear that down okay so that's all well and good and you can download mirror your sites and so forth and you can do it with Google you can do this against other websites - and just down a full copy of that now there is a way of preventing W get from doing so but you know it is a very good handy tool if you want to just quickly do a backup of something one thing we will look at right now though is how to actually use W get as a way of spidering over your website and actually discover broken links so actually discovering if the website has any broken links in that and it's of that that is a very useful powerful thing so what we're going to do is we're going to add a lot of arguments into this so W get and I'm just going to do - - help and as you can see there is a whole plethora of options and arguments that we can supply to W get we can do all sorts of things FTP things we can do HT HSTs things we can do all sorts of stuff especially with HTTPS and and how we can access directories and all of that stuff loads and loads of options here right lots of stuff that we can do now if you put this in a certain order and so forth then you can actually use W get as a great way to spider your website to find out what pages are actually broken and actually output those to either another terminal window or a log file man is what we're going to deal with now we've already had a look at - R which is doing it recursively there is also a - - spider argument which is this one down here which means don't download anything ok so in the example I showed you before when it was finding something you would download that and create a directory structure we don't want to do that we only want to find out whether something is broken okay so spider is a very good argument to do that we also going to use - ND and that is no directory I'm just going to try and find that I mean there's so much stuff here right I mean unless that's I did a dirt somewhere in here somewhere in here is nd here we go directories no directories so don't create the directories again we don't want anything we just want an output to say this is broken next we're going to use the - L and - L means the levels so with - are we going recursively we can actually tell how recursively we need to go okay so these are how many levels deep can we go by default I believe that W get goes by five so five levels deep you can change that and I'm just going to do one level deep okay just for for argument's sake so here we are l is the maximum recursion depth zero for infinite R okay so that will just go infinitely recursively we don't want to do that we're just going to do four one for now the next thing we're going to do is - W now - W which is all the way up here is wait so basically this means that we are waiting per request so I think it's up here a little bit more here wait so waiting in seconds so between retrievals right so it hits one link and then it waits maybe a second and then it looks for another link and so forth next thing we're going to do is - OH - OH is output which I believe is down here somewhere and basically - Oh gives you an output puts all the stuff in an output log file so it doesn't actually attack it to the screen which is very handy right because we literally want to find out what is broken and what isn't broken so you give it a log and I think - oh here we go is output so that writes the documents to a file so that's what we're gonna do we're gonna use - R and then spider - ND and then - l4 level - W - wait a couple of seconds and then - OH - output and the last thing of course is - envy and that is no verbose which basically means that it's going to condense down the output exactly what we need I'm going to play with around with these just to give you an idea of the differences that we can have with these arguments so let's go ahead and do that so the first thing I'm going to do is just clear down what we've got so let's just go back to you here and let's let's go up another level yeah that's good and let's just do an RM minus RV of that file of that directory and clear that down that's great so we're going to do W get and we're going to do - R again because we're doing it recursively but this time - - spider because we don't want to download anything we're going to do - nd okay okay as well as - NV because we don't want any verbose nurse coming back back out of this we just literally want to find what is broken and what isn't and I'll play with round with that in a minute - nd of course is no directories so we're not getting any directories that's fine we're also going to do - L because we don't want to do the amount of levels so let's do one and then - W - wait and we're going to wait for about a second after each request and then - Oh which means that we're going to supply an output file so here we're going to have W get blog right and then we're going to pass in the address of the thing that we want to cruel so in this case let's do its Peter Fisher dot M e dot uk' and what that's going to do now is cruel that website and let's press ENTER and you'll notice that you don't actually get any output it's not it is still running but there's no output at all what you need to do is create a new terminal window if you did an LS you can see that we have the log file what I like doing is just tailing that so tell - f - follow that go into W get and do that and as you can see it's going to slowly start churning way did you see that add another line there's gonna do it another one after another second it's going to check it and then it's going to continue and it's going to do that all through the through the website okay so this is how you can crawl your website and at the end of it you'll you'll find out what is broken and what isn't broken if there is anything broken at all and it's basically checking that and see we've got the 200 okay response so that is an okay response what we're looking for of course is a 404 404s meaning that the page cannot be found but this is very handy what you can do is put this on maybe a cron okay so you can do this every sort of month after a release and just ensure that nothing is coming back as a anything different than a 200 okay because that means that you have a problem there is a problem area that it's been found so this is how you can use a cruller to crawl over your stuff and it's basically just giving you a quick feedback loop what you can do is check this log file and only return 500 errors 404 errors and so forth but and then then you know if you come across that fire off an automated email to yourself just to say look you know after that last receive release we have some issues on the on the screen it's very good to find out the status the general status of your site now I don't wish everyone to just hammer my website for this but do check your own website do check other websites as well now what we're going to do is play around a little bit with the output of this so I'm just going to close this so ctrl and then C to cancel that out go back to the other tab and close that as well right so we've got here and this huge plethora of stuff here right so we've got the - nd in - env what happens if we change that - env so we actually want some other verbose nos to it and then just go into you env remove that from here now press enter again go back into the other tab let's just clear this down and tail whoops and tail that and what we can see is we have actually got the output that would normally you would see from them you get so it's all of this sort of dot dot dot stuff and then the file size and can you see it it's just hugely weighted right we don't need any of this stuff and that's why we have a - Envy flag because we don't need any of this stuff there's no there's just no point right what you're looking for is that now what you could do is after you have this completed right you can you can grep the log file to look for different status codes which is very very handy let's just put that back to what it was and again the hyphen W is the weight okay so if I was to increase that then that will make a longer period of time before requests thanks ever so much for watching if you haven't done so already do subscribe and also do check out my daughter course docker in motion by Manning publications I'll leave a link in the description as well as a card in this video thanks ever so much see you again happy coding cheers bye
Info
Channel: Peter Fisher
Views: 17,086
Rating: 4.9288888 out of 5
Keywords: wget, wget spider, website spider, crawl a website using wget, wget web crawler, how to use wget, what is wget, install wget on a mac, install wget, how to install wget, create a wget web spider, how to build a web spider, how to download a website with wget, how to download a website, how to backup a website, using wget, wget commands, wget tutorial, wget mac, wget command in linux, wget site, peter fisher, how to code well
Id: dbybH_h4lCs
Channel Id: undefined
Length: 14min 40sec (880 seconds)
Published: Tue Oct 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.