top | item 21764592

A Thread about Internet Archive's “Silent Killer”

589 points| danso | 6 years ago |twitter.com | reply

261 comments

order
[+] commoner|6 years ago|reply
The Internet Archive is absolutely essential to Wikipedia, whose articles are required to be verifiable to reliable sources. When pages go offline, the link rot makes it harder for articles to be verified. However, if the Wayback Machine has an archived copy, that copy can then be cited as the source and made available to readers and editors. The Wayback Machine now automatically archives every new external link added to the English Wikipedia.

https://en.wikipedia.org/wiki/Wikipedia:Link_rot

The Wikimedia Foundation's budget is about 10 times that of the Internet Archive. If you see the fundraising banner on Wikipedia and want to help out the site, but don't think the Wikimedia Foundation needs the donation, consider donating to the Internet Archive instead.

https://archive.org/donate

[+] ignoramous|6 years ago|reply
> The Wayback Machine now automatically archives every new external link added to the English Wikipedia.

Links submitted to Hacker News should be auto-archived, too: I often stumbled upon dead-links [0] which otherwise had generated insightful discussion on news.yc. Adding archive next to web would work nicely.

[0] Launch HNs, in particular.

[+] Bartweiss|6 years ago|reply
This is a great point, thank you.

It's been awesome to see archiving and Archive links become standard practice on Wikipedia. For a long time, links just rotted and were replaced. Then later, dead links would be switched to Archive content if it was available. But it's so much easier and more reliable to archive a page when it's referenced, and provide both links even while the page is still up.

It's one thing to automate dead link detection, but archiving defends against a lot more. Universities seem to redo their sitemaps and redirect all their old links about once a year. And reference pages get edited, which goes unnoticed for a long time and then leads to frustrating [Not in citation given] tags. It's incredible what a boon the Internet Archive has been to wiki sourcing.

[+] filmgirlcw|6 years ago|reply
Fantastic point. I personally would much rather support the Internet Archive than WF.

Also, there is a 2:1 donor match thing happening now, making the value of each donation effectively triple. Moreover if your company matches 501(c) donations (mine does, at least up to something like $20k a year), you can have them match your donation too, effectively giving the archive a 4x your original donation.

[+] slim|6 years ago|reply
Wikipedia should put a banner to raise funds for internet archive every year
[+] acqq|6 years ago|reply
> consider donating to the Internet Archive

It is immensely important, especially as there is a strong trend to hide the content behind the paywalls and registration pages. Even if something is accessible today...

And there are always forces who have an interest that some inconvenient data just disappear.

[+] rbanffy|6 years ago|reply
> consider donating to the Internet Archive instead.

If you happen to have power to do that at a large cloud provider, consider donating storage space and data movement services.

Having a backup of the Internet Archive on cloud providers would offer an additional safeguard against data loss.

[+] lonelappde|6 years ago|reply
Wikipedia requires available sources, not reliable sources. Any published website is acceptable, regardless of fact check it standards.
[+] tetha|6 years ago|reply
I've discovered their efforts to archive old dos games... fully playable in the browser through a dosbox build in WASM as far as I know. That's a really impressive cooperation of very old and very new technology - 16 bit up to the edge of JS (- edit: or rather, the edge of browser based computing).

And on that train of thought, I just had a little flashback about storage sizes. There was some time when 3.14 MB was a big unit of measurement, and some 50MB drive was huge. The classical example: Monkey Island, Indiana Jones or Day of the Tentacle on ~12 floppy disks. But you had to choose which 1 or 2 to install because you didn't have enough space on your hard drive, or you had to swap disks every few screens :)

Or, they are archiving a lot of video playthroughs in the lets-play style from video sites, in case those sites go through a meltdown like viddler does.

I guess I'm rambling. Point is: This isn't just a storage dump. There are also interesting projects around the internet archive to make the old things accessible on new systems. Very worthwhile donating to.

[+] theandrewbailey|6 years ago|reply
> There was some time when 3.14 MB was a big unit of measurement

I don't know what storage medium that refers to. Plain old 3.5" floppies only went up to 1.44 MB.

[+] elweston2|6 years ago|reply
Most of that effort to archive games is not IA. It is from 2 other groups who sometimes work with IA if they all are not fighting. Specifically one group is the one going through all of them and configuring them to work correctly in DOSBox. Another is just going through and just cataloging them. Getting the metadata on that stuff is kind of tough now. Like what was the exact date a game came out on? Does it have artwork still? Is there some odd protection scheme going on? Is there an IMG file for the disks? Or is it just some random pile of files? Can you get a copy from ebay? etc etc etc. IA typically notes who is doing the archive work. IA also respects takedown notices for these games. As many of them sort of came back from the dead and are sold again, or the company just does not want its IP on some random site. A good portion though no one really knows who owns them anymore. Some have clear lineage. Some have been sold over and over to random companies and depending on contracts no one knows.

That 'scene' is also full of drama and very insular. Donate to IA to help them build better infrastructure and better search. But donating because they host old abandoned DOS games I would not call a good reason, and misplaced. Because the people doing the majority of the work are doing it because they like it, not money. IA does have some work around that such as getting dosbox and mame running in a browser. Just be clear on what you are donating to.

[+] redisman|6 years ago|reply
I encourage everyone to also do their own thinking of what did you enjoy in the internet or computer software in the 80's, 90's and 00's and see if it is still available somewhere.

I found out that the local games scene I used to love as a kid had been almost wiped off the face off the earth. These were small non-commercial and shareware games localized in just one language so it was already a niche. The free hosting services of the 90s are gone so those sites are down and no one wanted to keep paying for hosting for 20 years+ on sites that get very few visitors nowadays.

The only way to get these games again was to find a Discord group and a friendly stranger who agreed to seed a torrent (which had 0 seeds when I found it). I'm looking to upload them to a couple of different places and compile a basic website catalogue (static site on CDN) one of these days. For the layperson, these games are already gone from the internet.

The internet archive does a great service but it is breadth-first and quite surface level. The depth has to come from people who were familiar with the sites at their peak. And there's a big change that no one is doing that for your specific interest.

[+] foxthatruns|6 years ago|reply
BlueMaxima's Flashpoint is a fabulous archival project that is saving as many Adobe Flash games/animations as they can before browsers pull support at the end of 2020. Really cool, since Flash games are a similarly concrete slice of culture/history that will just be gone if they're not archived.

https://bluemaxima.org/flashpoint/

[+] dicknuckle|6 years ago|reply
Torrents can be seeded from HTTP sources, known as a WebSeed. I've done it with Dropbox and they don't care as long as they don't get a copyright infringement notice. Easy way to use free storage as a CDN.
[+] oneepic|6 years ago|reply
YouTube has many channels with playthroughs of old games as well. (old for me is the 90's and early 00's since i'm in my twenties) You can find things like the old Lego PC games, old educational games like Physicus/Bioscopia, kids' games like Putt-Putt (not in the least because they have a speedrunning community), and much more. It's been a great resource to go back to the stuff I played growing up.
[+] tus88|6 years ago|reply
If you are into retro computing, drivers for old sound/graphics cards and motherboard BIOS updates it is invaluable (when it works).
[+] starsinspace|6 years ago|reply
Although efforts like internet archive are noble (and I find it occasionally useful), I'm not sure it's always so great that everything anyone does online will be permanently archived.

I know many people feel that everything should be available forever. But for me... it's pushing me away from doing much on the web. I liked it in the 90s when things were more ephemeral. When you could make mistakes and not have them easily found by anyone with a few clicks, forever.

[+] VonGuard|6 years ago|reply
You're right, let's burn the library down because one book has a liable chapter in it.

This argument is so horrible as to be actively harmful to Archive's work. Jason Scott is a god, and if we didn't have him, we'd have to invent him.

WE DO NOT GET TO CHOOSE WHAT THE FUTURE FINDS INTERESTING.

We live in the only point in human history where we can actually save all of humanity's knowledge and culture, and we can do so without having to worry about physical space or staff to work the "library." It's a remarkable time we live in, and yet, 99% of our society either doesn't care, thinks this work is stupid, or actively works against it through horrific copyright laws.

We know more about how Rembrandt painted and lived than we do about how Atari 2600 programmers worked and lived. I can go to Rembrandt's house and see where he lived, where he painted, how he worked, where he slept and ate and mixed his paints and taught his classes.

Atari's old HQ is just another office building. The source code to those games is mostly gone (thankfully, it's assembly and easier to disassemble). We need to save our culture and digital heritage, else we forget where we come from.

Deleting some old tweets is one thing, but actively worrying about Archive's work is just harmful to us all. We need 10,000 more Archives, dammit. It's supremely important work that is helping stem the tide of lost culture due to stock market forces. Geocities is gone forever because Yahoo! didn't find it profitable. This cannot keep happening.

[+] gravitas|6 years ago|reply
My personal, obscure ISP user page (think the ~user/ era) from 1995 is preserved in all it's drop shadow blink tag marquee glory at archive.org with me doing nothing, it was just captured by whatever natural processes. The things I said on mailing lists, random forum posts etc. - it's all archived. That 90s stuff isn't/wasn't as ephemeral as folks think in my opinion, it's out there somewhere. $0.02 :)
[+] DuskStar|6 years ago|reply
> I'm not sure it's always so great that everything anyone does online will be permanently archived.

But you see, even if the Internet Archive didn't exist, someone would probably still be saving a copy of the things you do. It'd just be a megacorp or surveillance agency instead of a more egalitarian organization.

So the choice isn't "things on the internet are ephemeral" or "things on the internet are available forever to everyone", it's that or "things on the internet are available forever to some subset of the rich and powerful".

[+] garaetjjte|6 years ago|reply
Maybe if everything would be archived forever, we could understand that everybody makes mistakes, and stop paying so much attention to old posts? Though I admit this is very optimistic view on human behavior.
[+] Avamander|6 years ago|reply
Some mistakes are also worth recording. I like seeing bad predictions of the 2000s from the 1970s for example.

Not to mention that quite a lot what is archived today has been made by companies, there's no "right to be forgotten" that companies could ever deserve. For example I've uncovered quite a few mistakes in currently public datasets/websites based on archived sites, who knows how many mistakes are made now and never fixed because we lose the original sources. Point being that the lack of original source doesn't mean the information gets lost, it just becomes a big version of the kids game "telephone" where everyone recites what they heard and it gets distorted in the end.

[+] strenholme|6 years ago|reply
>I'm not sure it's always so great that everything anyone does online will be permanently archived

The real problem here is the runaway cancel culture, where we attack people for things they said or did years or decades ago which were (at the time) perfectly acceptable and reasonable.

The most egregious example I have seen so far is cancel culture advocates who think we should disregard the late Richard Feynman’s legacy because he said some rude things to a lady back in 1946, even though the lady herself was not offended, since she did sleep with him later that same evening.

There’s a point where we just have to say “That was a long time ago, no one at the time was offended, get over it.”

[+] Bartweiss|6 years ago|reply
The consolidation and permanence of the web are definitely concerning.

Moving from "somebody knows this happened" or "this is in a file drawer somewhere" to "there's a searchable record of this" expand everyone's access to the info, and can do a lot to stave off forgetfulness and bit rot. But the people whose gain the most access are the ones who weren't involved in the first place, and the intersection of "uninvolved" and "cares enough to check" tends to be people who are actively hostile. Hence doxxing, stolen photos, and callouts over years-old tweets.

But that's a broad result of digitization. If a reporter or opposition researcher wants to embarrass someone, they can already look through digitized student newspaper essays, find interview subjects off class rolls, or simply comb through Twitter for long-forgotten offenses. (This holds for both good and ill - it applies to both serious skeletons and misleading or trivial issues.)

The Internet Archive, then, seems like sousveillance offsetting surveillance. For those who can point time, money, and connections at a target, it's enough that evidence exists, and more than enough that it's available online. But for the general public, it's much harder to keep track of countless sources or publicize news. If you can't dedicate interns and an archive to tracking every news story you read, you can't find or prove edits. (And while most newspapers noted corrections or morning/evening revisions, silently changing online stories has become common practice even for the likes of the BBC.) If you can't point out a webpage or tweet to thousands of people at once, the evidence is likely to be taken down before it's recognized. There are a lot of dedicated sites like NewsDiffs working on this problem, but Internet Archive provides a general-purpose answer to "let an average person see the history of a page or create a trusted record of it".

I worry that this just amounts to an eye for an eye, and still increases the total amount of scrutiny we're all under. But as long as more content is becoming permanent, it still seems better to have symmetrical access to it.

[+] krapp|6 years ago|reply
It's not actually true that everything anyone does online will be permanently archived. If it were, there would be no need for the Internet Archive.

The truth is, only the things someone has an interest in archiving will be archived, and only so long as someone has an interest in maintaining those archives. Just look at the recent announcement about Yahoo Groups... no one was, and likely no one is, going to permanently archive most of that. Sites, content and history get lost all the time.

[+] NeedMoreTea|6 years ago|reply
I think it would be reasonable to establish a bar, similar to offline where everything above it is archived, and everything below is optional opt-in.

In the offline world the National Libraries get a copy of every book, magazine and newspaper published, by law. At least that's the way the UK and US do it. They archive a lot of other stuff as well, including music, audio, adverts, but that's more informal, and there is no requirement to preserve.

Personally I'd like things politicians and personalities (by dint of having chosen to live large) say online archived, all business (to later hold them to account) along with the sites of anyone in the business of influence - think tanks, parties, lobbyists, activists, "grass roots" organisations etc. Individuals, anon forums, HN and reddit subs and other places of shooting the breeze should be allowed to stay ephemeral. In fact I think conversation is freer that way - some will choose to say less, say different, or say nothing if all everyone says is forever...

[+] enumjorge|6 years ago|reply
Not only that, but I also wonder if we're overestimating the value of keeping all of this data around. Who's going to have the time to search and curate these mountains of information when we're generating tons more of it every day? I imagine the ideal goal is to allow future historians to learn about our past selves, but I think there's a tipping point where only those with lots of resources can afford to meaningfully consume it. Those typically are wealthy companies or individuals, and I'm generally less excited about what do with our information.

Obviously there's value in archiving some information, but a save all or even same most approach starts sounding a little hoarder-ish. Sure you might one day make use of that 1997 November TV guide, but chances are you won't and in the meantime you're paying the opportunity cost of storing it.

Maybe we need to take a page from Marie Kondo and only keep that which sparks joy and learn to let go of the rest. There's a chance someone will need a bit of info that no longer exists, but we'll probably be ok.

[+] crucialfelix|6 years ago|reply
In the not so distant future many people will record their entire lives: movements, utterances, biometrics, audiovisual and sensory data. Then they are going to freakout when dead people's lives start getting deleted because nobody is going to pay to host all this crap
[+] joe_the_user|6 years ago|reply
So, copyright conditions are apparently another silent killer [1].

A website can be archived, vanish from the web, and then vanish from the archive for technical copyright reasons (new owner's robot.txt file on the root). So "archiving the archive" might be useful. Or something.

Ezboard was an old discussion site that contained much of interest - archived and now the archive is not accessible.

https://archive.org/post/389127/ezboard-content-suddenly-not...

[+] mastazi|6 years ago|reply
> A shutdown announcement put these at risk. We worked with the founder, @tedr (RIP), who'd left the company, to save as much as we could. 3.5 Terabytes

I work for a company that owns a website containing information that in my opinion is valuable to the public. The website may go offline forever in the coming months. How can I get in touch with the Archive, to ensure that the content is saved? Parts of the content are not easy to index (e.g. there are "hidden" pages that you will only find if you have the exact URL), I can assist with that.

[+] jedberg|6 years ago|reply
It seems they could save some money by moving a bunch of infrequently accessed data to warm storage. The entire archive does not need to be accessible 24/7.

I would be perfectly ok if I was trying to see a copy of a web page from five years ago, and it said that I had to make a request and it would be available in five or ten minutes.

I think I could wait five or ten minutes for a web page to get pulled from the archives.

[+] bpaddock|6 years ago|reply
The Food and Drug Administration (FDA) has been archiving the FDA.gov site at archive-it.org .

Leaving a lot of dead links on the FDA site. Sometimes they tell you to look in the archives for the old information, without giving you a link to it, and sometimes they don't, they just expect you to know.

Now why can't the FDA afford the space to keep their pages forever on their own site? Fill in your favorite conspiracy theory...

Some of the information that has been removed, such as the 2015 hearings on Fluoroquinolone antibiotics, are important health research as just one example.

https://archive-it.org/organizations/1137

[+] tyingq|6 years ago|reply
Pretty amazing what they do with 1/10th the revenue of Wikimedia and quite a lot more data to manage.
[+] notacoward|6 years ago|reply
The real "silent killer" I see here is the reliance on mirroring. One-failure protection, with a 2x expansion factor. As it happens, I work on large storage systems, where 2x is our maximum expansion factor and for that we get resistance to as many as nine simultaneous failures. Across power and network failure domains, with multiple kinds of background scrubbing to detect loss of that redundancy. Oh, and 60PB is something we might add to an existing cluster for a day to absorb transient I/O load. There's also a bunch of monitoring and automation stuff that should be considered "table stakes" for storage at these scales. Seems like an opportunity to use what I and others have learned for a good cause, to make this valuable resource more efficient and more durable all at once.
[+] maxton|6 years ago|reply
I donated a week or so ago. The Internet Archive has come in handy many times for me, not just for the Wayback Machine but also things like their live music archives. They're an indispensable resource.
[+] nym|6 years ago|reply
Setup a monthly $5 donation. It's not much, but I know how valuable the archive is... they're doing the work the library of congress SHOULD be doing.
[+] duelingjello|6 years ago|reply
IA and WBM are great and essential, like a Library of Congress/Smithsonian. What's frustrating about some old websites like Microsoft or Borland's FTP download area is that dynamic links weren't followed and can't be followed and websites that used user-agent filtering. CDN links also weren't captured well.

There's so many retro patches that just don't exist publicly. For example, a number of files on SciTech's IA's WBM have zero captures. Most FTP sites weren't captured in WBM adequately either. There are spots of FTP archives hosted here and there on IA and elsewhere, but they're not like WBM for static content sites, and a single snapshot archive lacks the history and the changes, before and after. It is what it is, unless folks donate their vintage personal/work local mirrors to add to the collective.

[+] phendrenad2|6 years ago|reply
The Internet Archive is great, but it risks becoming a single failure point if we rely on it too much. Also it would be good to take some of the server load off of them. One possible solution is for smaller archives to exist. So if you're interested in archival, and you have some spare time and cash, consider not only donating to IA, but also setting up your own archive site with content on whatever category or topic that you found interesting enough to archive.
[+] rikroots|6 years ago|reply
The British Library's web archive[1] does similar archiving work, but limits itself to 'British' sites - in other words, sites with a .uk domain. I've had good dealings with the admins before, when I submitted some of my personal sites for inclusion in the archive.

Interestingly, the British Library uses web crawlers based on the Internet Archive's Heritrix web crawler[2], which demonstrates how important IA's work is for many other archival organisations' work.

[1] https://www.webarchive.org.uk/ [2] https://github.com/internetarchive/heritrix3

[+] echelon|6 years ago|reply
A linear amount of data could be saved if we extricated text content from the HTML skeleton that contains it.

I wish Semantic Web had taken off. "Pages with styling" was suboptimal. Web apps are a such a weird evolutionary branch we've descended into that don't relate to documents.

Content instead should have fallen under a type of ontology: news item, blog post, technical reference, comment, status update, ... If we'd adopted such a markup grammar and styled around it, we could parse out meaning, have stronger links in the graph, and compress.

Semantic Web would have happened if commercial web didn't outpace it.

[+] toyg|6 years ago|reply
TIA should get money from the UN. Or be the only beneficiary of a flat tax on network ports - I bet even just $5 on every small router sold in the US (which is basically nothing) would generate a ton of money for them.
[+] alwillis|6 years ago|reply
Let’s not forget that we can all participate in archiving web data using IPFS[1], which the Internet Archive is also using.

And coming soon, you’ll be able to get paid to make content available using IPFS and FileCoin[2].

[1]: https://ipfs.io/

[2]: https://filecoin.io/

[+] mirimir|6 years ago|reply
It's too bad that there's not a vaguely-somehow-related-but-not-really and impossible-to-censor service that retains stuff that sites have excluded using robots.txt or whatever.