How to use the Wayback Machine

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is alexis rossi i'm a librarian and the director of collections at the internet archive the internet archive is a non-profit library that was founded to bring all of human knowledge online and make it accessible to everyone for free so let's talk about the wayback machine the wayback machine is an archive of over 500 billion archived web pages that we've been saving since 1996. from the archive.org front page you can search the wayback machine directly right here or from anywhere on the website you can go into the media nav in the web drop down and search here but for our purposes today i'm going to click the wayback machine logo to go to the wayback front page you can also type archive.org web into your browser to go straight here the internet archive itself was founded in 1996 and our original mission was to archive the internet hence the name now today it seems pretty obvious that the internet is a big part of our daily lives but in 1996 that was not so obvious uh so we were kind of the only people archiving the internet back in the day and the wayback machine is one of the only records we have of what those early days of the web were like we were motivated to save web pages because they're essentially ephemeral they can be changed deleted at any time and there's no artifacts left behind unlike something like a newspaper or a book in fact today the average lifespan of a web page is only about a hundred days before it either changes or is deleted the quickest way to search the wayback machine is to type in a url so let's go with eff.org the electronic frontier foundation and what you're looking at here is essentially a calendar of all of the times that we have captured the page eff.org over the last 24 years we call this the calendar page and this graph right here which you can scroll to see all the way back to 1996 is a record of how many times we've crawled this page over time the height of the line indicates how many times we crawled the site it doesn't denote change or importance just frequency of crawling 2021 is highlighted here so that's the calendar that you see below you since we're in january you only see a few days of information here so let's go back to 2020 and now you see a full year of crawling of this particular page there is a circle for every day that we crawled this page in this case we crawled it every day of last year but you'll notice that the circles are different sizes the different sizes of circles indicate how many times we crawled the site on a given day so this circle is eight snapshots and if i look at a bigger circle this is 59 snapshots you'll also notice that the circles are different colors this one is green this one is blue and we have a red one all the way over here at the bottom of the page you'll see a legend that explains these further but basically a blue circle indicates that we got 200 responses in other words the page eff.org in this case loaded with no problems a green circle like this one indicates that one of these captures was a redirect and a red circle indicates that there was an error retrieving the page a 500 error in this case you notice there are a lot of times listed here and these times are all in utc this is every time we've crawled the page during that day as i go through these dates you may notice that at the top of the calendar there is some text that's changing right here so again as i go through you'll notice it's not just showing me the date and the time it's also showing me why the page was saved in this case the red capture here the error capture was saved through our save page now collection if i go up here and click it you're taken to the collection on archive.org where this page is stored and you may find some good explanation up here about why the page was saved who is collecting it and how it got into the archive in the first place let's go back let's look at that again so that one is from save page now a human being said save this page this next one is from an archive partner we'll talk about archive it in a moment a few more of those okay and then this one is from focusedcrawls nbc news let's look at this capture this is why you come to the wayback machine you want to see what these pages looked like back in the day even if back in the day was less than a year ago so this is the page as it was captured on may 6 2020. you can see the date up here but what i'd like to point out is that you can also find out a page's provenance when you're on the page over here you'll find the about this capture pull down and you'll see again that this is from the nbc news focused crawl and this gives you a little bit of information about what focus calls are for tells you that the organization that crawled it is us the internet archive and then down here you see a bunch of time stamps for other pages so let's talk about what a web page is actually made up of so the page eff.org might contain this text for example but there are additional assets that make up this page the logo is a separate image file this is a separate image file this is a separate image file you get the idea there are also probably javascript files that run things on this page there might be a css file that tells us how to lay out the page every page that you go to on the internet pretty much is going to have many many files or assets or in this case urls that make up that page so when we look at about this capture and we look at the timestamps section what this is telling you is all of the different urls that we had to capture to replay this page to you accurately so we captured the page eff.org at the time that was listed in the calendar the one we clicked on but the other files were captured at different times and here we are telling you for example this file was captured one hour and 19 minutes before this particular page uh down here this is captured 33 minutes and 14 seconds after the page that we're on right now this gives you a quick way to see whether the images and other assets that make up the page were captured at around the same time in other words how likely is it that the page that we're showing you is accurately represented so this can be very helpful information if you're doing deep research most people probably don't need to care about the time stamp section or even the provenance necessarily but we do want you to know it is there in case you need further information about where this capture came from so this isn't a still image this is a copy of the code and other assets that made up this website at the time so there's one big proviso here um the wayback machine contains things that we could download so that might be something like text or an image even a video file but there are certain dynamic elements of websites that are far more difficult or impossible for us to capture search is one of those things probably the most likely one that you'll run into so the search function on this website eff.org is not going to work and no you can't go back and search google like it's 2005. and you can click through so now we're on a completely different url you see the different url here and you see the date we did also capture this on may 6 2020 and you can continue exploring you can go back to the home page and i'll show you a few different options that you have up here in the header above the provenance section you see your share buttons you can share on facebook or twitter this takes you to the help documentation and this icon here lets you save this capture of this page to your web archive on archive.org that is located in your account and i will show that to you later so let's go ahead and click that and we get a little success message so it will be waiting for us in our account later there are different ways to navigate through the dates that you're looking at again this is the date that you're currently on you can use these to go backwards and forwards a month backwards and forwards a day or well you can't go forward yet but uh backwards to 2019 you can also scan through dates here and click anywhere in here to navigate through time if you have another url you would like to search you can go ahead and do that right here you can use the logo to go home to the wayback machine home page again or you can click the captures link here to go back to the calendar page for eff.org which i'm going to do because we have several other things to look at here so uh we're on the calendar page right now which we've already talked about but next to that you'll see this collections page this is a bit sparse so i'm going to click into 2020 so we see the collections page for last year instead this page shows you why the page was crawled these are the collections that crawled the page and when they crawled it over time you can click through to any of these to find out more about the collection and who runs it and of course you can click on different years to see how that has changed over time the changes page for this url allows you to visualize how much a page has changed over time if a day is gray it hasn't changed much since the last crawl if a day is blue there were significant changes you'll see the legend for these colors down here gray is low rate of change blue is high rate of change the compare feature allows us to compare two different captures and see what has changed in a side-by-side display let's try this feature out very quickly let's just choose two captures from the same day but they can be from different days our two capture dates and times in this case are displayed up here and we go ahead and click compare and it takes a moment for this to finish running you'll see the changes highlighted here you can see the banner has changed but if you scroll down the page you'll also see different stories are now available as you can imagine this can be very helpful for showing small changes within pages now let's close this window this is opened in a new tab and go back to the changes interface we have the data for all of the years down here and it can take a little bit of time for these to process if there's a lot of information next let's look at the summary the summary applies to the entire domain calendar collections and changes are specific to the url that you're looking at here the single page that you looked up however summary and sitemap apply to the entire domain so the information for summary and sitemap apply to every page that lives on eff.org not just the single home page the summary page lets you explore what kind of content is hosted on eff.org text images javascript all of those sorts of things and you can do that through time you can limit this information by the type of files that are available you can also limit this by year and see how things have changed uh play with that a little bit so you can see the pie chart changing here you can also delve further into what types of things you'd like to see here so for example this is just images and again we can change this to include all time further down the page you'll see information for the entire domain about uh new pages different mime types that sort of thing lots to explore there and finally let's look at sitemap the sitemap again is for the entire domain of eff.org and you can choose which year you would like to look at the center of the ring is the root in this case of course eff.org and as you move outward you're moving outward through the tree of their website and note right above my cursor you can see as the urls change scrolling through time lets you see how the complexity of a site has changed over time so 2000 versus 2020 you can see the sites become much more complex and of course once you get to a page that you want you can click through and we will take you to the replay page for that i'll leave you to explore that more on your own let's click the logo and go back to the wayback machine front page now as i said the most efficient way to search is by url but if you don't know the url you can also search by keyword searching by keyword is different than it is when you search on google or another search engine you're not searching every keyword on every page you're actually searching for entire sites not pages when you search by keyword on the wayback machine and you're searching things like titles the url meta tags link text things like that this means that if you search for just a word instead of a url let's search for crochet you're not searching for that word on every page from the last 24 years you're only going to get entire websites that are about crochet from here you can choose whichever site looks the most likely to you and from here of course this is all the same you can click through to see other pages on the website you can learn about the capture everything here is the same as what we were just looking at so keyword search is a good way to find an old site when you've forgotten the url let's go back to the front page again just click the logo we've spent some time exploring how you can find things that are already in the wayback machine but you can also add things to the wayback machine yourself if you want to do large scale ongoing web archiving projects i recommend that you check out archivet.org this is our subscription-based archiving service that allows you to conduct your own large crawls but for most people save page now is all you really need the save page now feature lets you archive pages that are live on the web right now you may want to archive pages because you are saving your own research to make sure it will still be available later uh you might want to archive something that is happening right now that you think is newsworthy or you may just have redone your personal website and you want to make sure we have a copy of it for posterity whatever reason you have for saving a page all you need to do is type your url in right here in this case i'm going to save creativecommons.org and click save page just to note you will see more options here if you are logged in to the archive.org website with your free account now when you save a page using this tool it saves the one url plus any assets needed to render the page so it includes the images the css files and things like that it does not save an entire website however you can increase the number of pages we save by choosing to save the out links from this one page we will find all of the links on this one page and crawl those links as well if you uncheck this box we will not save the page if it results in any sort of an error code you can check here to save a screenshot of the page that will be a still image it's not clickable but it is a good representation of what the website looked like at a particular time you can save this to your web archive these pages will show up in your account we'll look at that in a moment and you can ask to be emailed the results of this crawl we will email it to the email address that is associated with your archive.org account the email includes the urls we crawled any errors we ran into and the screenshot if you chose that option now let's go ahead and save the page with our selections made and this dialog here tells us what's happening if it's a very media rich page or you've chosen to save out links and it's a fairly link heavy page it may take a while to complete this but when it's done you'll see green done messages for each resource saved so our seed page is done and by seed i mean the original url that i gave to save and then down here you'll see progress as we save all of the out links this page is still in pro oh there it's done this page has run into an error this page is still saving and this is the first time this page has been archived so as i said this can take a moment you will get a full report sent to you via email if you chose that option up here this is the link to the saved web page in the archive that's the clickable version and then down here that's the screenshot the still image so because i chose to save this in my web archive we also saved the eff.org website in our my web archive earlier i'm going to go to my account and choose the my web archive option from the drop-down and down here we see the creative commons page that we just saved as well as the may 6th version of the eff.org page that we added to our my web archive earlier and of course these are clickable so we can see the page that we just crawled there are a few more things to talk about on the wayback machine homepage if you're a developer right here you'll find a link to the wayback machine availability api this allows you to programmatically check whether we have any archived versions of a url this api has allowed our partners for example to fix more than 10 million broken links in wikipedia references and to add support for broken pages to the brave web browser among other things if you prefer to take the wayback machine with you you'll find extensions and add-ons for your browsers chrome firefox and safari as well as an app for your iphone and an app for your android phone i hope you've got lots of information now to help you get started using the wayback machine if you need further help you can go to our help site at help.archive.org and if you have further questions you can always email us at info archive.org
Info
Channel: Internet Archive
Views: 158,880
Rating: undefined out of 5
Keywords:
Id: ts1tu1BiSuY
Channel Id: undefined
Length: 21min 7sec (1267 seconds)
Published: Wed Jan 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.