Wayback Machine Hits 400,000,000,000

[+] leorocky|12 years ago|reply

And not one of those hits is from Quora due to their robots.txt:

https://web.archive.org/web/http://www.quora.com/

Good job Quora, preserving all that crowd sourced content away from the crowd, keeping it from everyone not logged in. Hats off to getting into YC so you can post your job openings on the HN home page and get some press. This doesn't add anything to your image though, it just takes a little away from YC.

On another better note, a great big thank you to the wayback machine for all of the public good it does. Now there's an organization that is amazing and wonderful and enriching our lives in an open and honest way with information.

[+] GuiA|12 years ago|reply

Thank you. I've hated Quora since day one precisely because of this - they're the archetype of a company taking all its value from the community and giving absolutely nothing back. Sadly, every time I've voiced this opinion in the Bay Area, I'm met with blank stares.

My dream would be a Quora-like service run by the Wikimedia (or similar organization), but that'll likely never happen.

[+] dredmorbius|12 years ago|reply

Man. I hate on Quora. Didn't realize they were YC. Shame.

[+] unfunco|12 years ago|reply

You can append ?share=1 to show the content. (A workaround which should not be required but sadly is.)

[+] keenerd|12 years ago|reply

PSA/ranty thing: Just because something is archived in the Wayback machine, do not trust that archive.org will keep it there for all time. If you need something, make a local copy! A few months ago TIA changed their stance on robots.txt. They now retroactively honor robot blocks. Now any site can completely vanish from the archives.

Let's say I died tomorrow. My family lets my domain slip. A squatter buys it and throws up a stock landing page, with a robots.txt that forbids spidering. TIA would delete my entire site from their index.

I've already lost a few good sites to this sort of thing. If you depend on a resource, archive it yourself.

edit - Official policy: https://archive.org/about/exclude.php

If I am reading it properly, once blocked they never check later in case of a change of heart? No procedure for getting re-indexed at all?

[+] gojomo|12 years ago|reply

A fair rant, but to correct some misperceptions:

The retroactive application of robots.txt is not a new policy; it's been in place for at least 11 years, and I believe it arrived very soon after the Wayback Machine was first unveiled.

An updated robots.txt does not irreversibly delete prior captures, so if the robots.txt changes again, access to previously-collected material can be re-enabled.

This policy has served to minimize risk and automate the most common removal-scenario, when a webmaster wants a total opt-out of current crawling and past display. But, the collateral damage to unrelated content from prior domain-owners has grown as the web has aged, and more domains have changed hands. (The tradeoff that made sense in 2002 probably doesn't make sense in 2014.)

Figuring out a system that can automate most legitimate exclusions, while rejecting or reversing those that lack a firm basis in content ownership or personal privacy, is a thorny task, but it would be worth pursuing if/when the Wayback Machine has the necessary staff resources.

(My proposal since 2008 has been a DMCA-inspired 'put-back' procedure, where an original content owner can assert, formally, that they are the content owner and do not want the current-day robots.txt applied to captures before a certain date. Then, the current domain-owner would have to counter-notify that to maintain the block. This idea hasn't had legal review, but would reverse some current damage, and any bad-faith blockers would have to go on record with a false claim to maintain the block, potentially exposing them to a third-party legal challenge, with minimal risk to IA.)

[+] frik|12 years ago|reply

Archive.org honors the robots.txt, at during the indexing period - okay. But current domain owners should not be allowed to remove historic content of the domain at a later date by just modifying the robots.txt.

A lot of information is lost as domain squatters take over domains set new robots.txt files. On Wikipedia for example you find a lot of reference-links that point to archived website URLs on archive.org. Every now and then the vital information source is lost - it's a bit surreal like book burning. I really like archive.org, this is the single feature I dislike a lot.

[+] dredmorbius|12 years ago|reply

I'd like to rave about an underappreciated but absolutely brilliant piece of the Internet Archive's infrastructure: its book reader (called, I gather, "BookReader").

TIA includes copious media archives including video, audio, and books. The latter are based on full-image scans and can be read online.

I generally dislike full-format reading tools: Adobe Acrobat, xpdf, evince, and other PDF readers all have various frustrations. Google's own online book reader is a mass of Web and UI frustrations.

I'm a guy who almost always prefers local to Web-based apps.

TIA's book reader is the best I've seen anywhere, hands down.

It's fast, it's responsive. The UI gets out of the way. Find your text and hit "fullscreen". Hit 'F11' on your browser to maximize it, you can then dismiss the (subtle) UI controls off the page and you are now ... reading your book. Just the book. No additional crap.

Page turn is fast. Zoomed, the view seems to autocrop to the significant text on the page. Unlike every last damned desktop client, the book remains positioned on the screen in the same position as you navigate forward or backward through the book. Evince, by contrast, will turn a page and then position it with the top left corner aligned. You've got to. Reposition. Every. Damned. Page. Drives me insane (but hey, it's a short trip).

You can seek rapidly through the text with the bottom slider navigation.

About the only additions I could think of would be some sort of temporary bookmark or ability to flip rapidly between sections of a book (I prefer reading and following up on footnotes and references, this often requires skipping between sections of a text).

Screenshot: http://i.imgur.com/Reg8KLB.png

Source: http://archive.org/stream/industrialrevol00toyngoog#page/n6/...

But, for whomever at TIA was responsible for this, thank you. From a grumpy old man who finds far too much online to be grumpy about, this is really a delight.

This appears to be an informational page with more links (including sources):

https://openlibrary.org/dev/docs/bookreader

[+] jsmthrowaway|12 years ago|reply

Wow, I had no idea. If you'd have asked me where to read older texts I'd have said "Project Gutenberg."

This is miles better.

[+] rajbot|12 years ago|reply

Thanks for the kind words :)

[+] meritt|12 years ago|reply

Has there been any High Scalability articles on their infrastructure? We have a similar need: storing a large volume of text-based content over a period of time, with versioning as well. On top of it we have various metadata. We're currently storing everything in MySQL -- a lightweight metadata row and a separate table for the large (~400KB on average) BLOB fields in a compressed table.

We're looking at ways to improve our architecture: simply bigger+faster hardware? Riak with LevelDB as a backend? Filesystem storing with database for the metadata? We even considered using version control such as git or hg but that proved to be far too slow for reads compared to a PK database row lookup.

Any HN'ers have suggestions?

[+] mmagin|12 years ago|reply

I am a former Archive employee. I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go:

- Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of seperately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread across a lot of commodity grade machines.

- An sorted index of all the content was built that would let you lookup (url) and give you a list of times or (url, time) to (filename, file-offset). It was implemented by building a sorted text file (first sorted on the url, second on the time) and sharded across many machines by simply splitting it into N roughly equal sizes. Binary search across a sorted text file is surprisingly fast -- in part because the first few points you look at in the file remain cached in RAM, since you hit them frequently.

- (Here's where I'm a little rusty) The web frontend would get a request, query the appropriate index machine. Then it would use a little mechanism (network broadcast maybe?) to find out what server that (unique) filename was on, then it would request the particular record from that server.

(Edit: FYI, my knowledge is 5 years old now. I know they've done some things to keep the index more current than they did back then.)

At the very least, I'd think about getting your blobs out of MySQL and putting them in the filesystem. Filesystems are good at this stuff. You can certainly do something as simple as a SHA-1 hash of the content as the filename, and then depending on your filesystem's performance characteristics, you can have a couple levels in the tree you store them in. e.g. da39a3ee5e6b4b0d3255bfef95601890afd80709 goes into the directory da/39/ Then you stick da39a3ee5e6b4b0d3255bfef95601890afd80709 into the 'pointer' field in your table that replaces the actual data. Obviously this design assumes the content of _that_ file doesn't change. If you want to change the data for that row in the table, you have to write a new file in the filesystem and update the 'pointer'.

[+] ddorian43|12 years ago|reply

hypertable (store data and metadata in different access groups, basically different files on disk but in the same table) on top of QFS for distributed file system in reed-solomon replication(for ~more efficiency)

you can keep n-versions(miscrosecond timestamps), sorted compressed data on disk for fast range reads, you can compress and group in blocks differently across access groups(access groups are just groups of columns)

it was based on bigtable, which was written to store the crawling data(like you & waybackmachine?) for google search

[+] redact207|12 years ago|reply

I'm using Azure that supports blob storage of up to around 2GB per blob, has snapshots and allows you to add metadata to blobs.

[+] swalsh|12 years ago|reply

If you're looking to donate, they take bitcoin too! https://archive.org/donate/index.php

[+] pimlottc|12 years ago|reply

A little known fact is that there is a mirror of the Wayback Machine hosted by [The Bibliotheca Alexandrina:

http://www.bibalex.org/isis/frontend/archive/archive_web.asp...

I have sometimes had luck retrieving pages from this mirror that were unavailable (or returned errors) in the main site.

[+] alternize|12 years ago|reply

awesome! it's a great tool to go back in time to check out our past websites full of blinking gifs and whatnot.

I didn't know that they also maintain the "HTTP Archive", showing website latency over time as well as some interesting live-statistics: http://httparchive.org/

[+] ersii|12 years ago|reply

As far as I know, Internet Archive does not maintain "HTTP Archive" (http://httparchive.org/). HTTP Archive was founded by and is being maintained by Steve Souders (Chief Performance Officer at Fastly - http://www.fastly.com/). He's previously held titles at both Google ("Head Performance Engineer") and Yahoo!. He's also a co-founder of the popular web-development debug add-on Firebug.

Sources: http://httparchive.org/about.php and http://stevesouders.com/about.php

[+] scott_karana|12 years ago|reply

Thanks, it's nice to see factual data backing trends in web design. :)

[+] Kenji|12 years ago|reply

Can anyone explain to me how displaying those sites on demand is not copyright infringement? I'm seriously curious, I don't know much about copyright laws.

[+] dvirsky|12 years ago|reply

Not a lawyer, but I guess this probably falls under fair use, which includes (from the Wikipedia article) both search engine use, and library archiving of content. http://en.wikipedia.org/wiki/Fair_use

[+] shmerl|12 years ago|reply

It should be fair use (non commercial, library purpose usage etc.).

[+] Vecrios|12 years ago|reply

I still cannot fathom how they are able to store huge amounts of data and not run out of space. Care anyone to explain?

[+] dwhly|12 years ago|reply

From a conversation with Brewster a few years ago: The doubling of density of disk drives has allowed them to stay relatively neutral with respect to space for the Wayback machine. It still occupies approximately the same size as it has for the last 10 years, which is essentially a set of racks about 15-20 feet long altogether I think?

However, the new TV news and search capability requires substantially more space than even the archive IIRC, or certainly is heading that way.

[+] jackschultz|12 years ago|reply

Funny story about the Wayback Machine and how it helped me. I had let my blog go into disrepair for a couple months, and eventually, when I went back to it, I found that since I hadn't kept up with security updates, I wasn't able to access any of my old posts.

When I went back to start writing again (this time using paid hosting so I didn't have to deal with that), I was disappointed that I wasn't going to have ~20-30 posts I had before. On a hunch, I checked the Wayback Machine and found that it had archived about 15 of my posts! Very excited that I could restore some of my previous writings.

[+] ultrasandwich|12 years ago|reply

> Before there was Borat, there was Mahir Cagri. This site and the track it inspired on mp3.com created quite a stir in the IDM world, with people claiming that “Mahir Cagri” was Turkish for “Effects Twin” and that the whole thing was an elaborate ruse by Richard D. James (Aphex Twin). (Captured December 29, 2004 and December 7, 2000)

Okay this just blew my mind. Anyone else follow Aphex Twin's various shenanigans? Was this ever investigated further?

[+] sutro|12 years ago|reply

Nice work on this over the years, gojomo et al.

[+] mholt|12 years ago|reply

Cool, but on a lot of sites (including some of my own, from 10+ years ago to recently) it doesn't get hardly any of the images. Am I the only one experiencing this?

[+] buren|12 years ago|reply

If you're using AWS S3 check your resource policy

[+] rietta|12 years ago|reply

Wow, that's one billion more pages than there are stars in our Milky Way galaxy. That's a lot!

[+] TempleOSV207|12 years ago|reply

[deleted]

56 comments