Internet Data Is Rotting

[+] gambler|6 years ago|reply

This needs to be solved on the protocol level. Of course, the players who have control over our protocols are exactly the people who don't want this to be solved at all.

The next best thing would be to redefine what "bookmarking" is. When I bookmark a page, I want it to be permanently stored on my local machine and full-text indexed. In fact, it's rather ridiculous that after 25 years browsers don't have anything of this sort. Unfortunately, the most popular browser in the world is controlled by the same people who control our protocols.

If I ever get the energy, I will attempt to write a browser extension for this.

[+] dlkf|6 years ago|reply

> As of last fall, its Wayback Machine held over 450 billion pages in 25 petabytes of data. This would represent .0003% of the total internet.

> Universities, governments and scientific societies are struggling to preserve scientific data in a hodgepodge of archives, such as the U.K.‘s Digital Preservation Coalition, MetaArchive, or the now-disbanded collaborative Digital Preservation Network.

Like any conservation work, the benefits are incredibly easy to ignore - until something goes awry / stops getting funding and suddenly it's too late. Consequently it's easy to have a myopic view of the issue.

These organizations are doing very important work, and I hope that internet users and governments don't take them for granted.

[+] mnl|6 years ago|reply

We should aim at browsing the Internet by date. We're moving everything there ignoring the fact that as it is there is no built-in permanence. We're accustomed to that after Gutenberg: it wasn't that easy losing every copy of an important document. Now it is, things disappear and we're getting into a drifting cultural bubble impossible to trace back.

The Internet Archive is doing God's work, but it's not enough. If you don't have the URL of a site that is gone, you probably won't find any reference to it after every online hyperlink to it has disappeared as well. It might become then inaccessible after a while, stored yet gone anyway.

[+] dredmorbius|6 years ago|reply

"Browsing by date" is pretty much Brewster Kahle's ideal, and is what the Internet Archive's Wayback Machine approximates, thanks in large part to WARC storage.

I'd also like to see a distinction between the idea of web servers, which are really publishers, and where archives are kept. Ideally not all in one single store, a/k/a the Internet Archive, but replicated fairly widely.

[+] thaumasiotes|6 years ago|reply

> We're accustomed to that after Gutenberg: it wasn't that easy losing every copy of an important document.

The effort has been made. Look at what happened with Tibet:

> Caroe pulled off a rather notorious subterfuge in order to buttress the British claim to Tawang: he published the Simla Convention for the first time in 1938 with a note misrepresenting that it had included settlement of the border (and alienation of Tawang); and he arranged for the publication of official Survey of India maps that, for the first time, showed the McMahon Line as the official boundary. To advance the narrative, he also corresponded with commercial atlas publishers to put the McMahon Line on their maps as well.

> In a telling indication of Caroe’s jiggery-pokery, to avoid the awkward question of why he was first publishing the Simla Convention twenty-four years after the fact in 1938, he instead arranged for the surreptitious printing of a spurious back-dated edition of Aitchison, deleting the original note about the Chinese government’s non-signature, and replacing it with a lengthy note stating, quite falsely, that “The [Simla] Convention included a definition of boundaries…”

> Since 1) the McMahon Line had been concluded in secret bilateral negotiations between Tibet and Great Britain outside the Convention and 2) the Chinese had officially refused to recognize any bilateral agreement, boundary or otherwise, between Tibet and Great Britain and 3) had declined to sign the Simla Convention itself and 4) had notified Great Britain in 1914 that the specific sticking point was “the boundaries” this was hoo-hah.

> The replacement copy was distributed to various libraries with instructions to withdraw and destroy the original edition.

> The subterfuge was only discovered in 1963 when J.A. Addis, a British diplomat, discovered a surviving copy of the original edition at Harvard and compared it to Caroe’s version.

( http://www.unz.com/plee/the-myth-of-the-mcmahon-line/ )

Wikipedia confirms this, if you look hard, in a shockingly non-judgmental way:

> Simla was initially rejected by the Government of India as incompatible with the 1907 Anglo-Russian Convention. The official treaty record, C.U. Aitchison's A Collection of Treaties, was published with a note stating that no binding agreement had been reached at Simla. Since the condition (agreement with China) specified by the accord was not met, the Tibetan government didn't agree with the McMahon Line.

> The Anglo-Russian Convention was renounced by Russia and Britain jointly in 1921, but the McMahon Line was forgotten until 1935, when interest was revived by civil service officer Olaf Caroe. The Survey of India published a map showing the McMahon Line as the official boundary in 1937. In 1938, the British published the Simla Convention in Aitchison's Treaties. A volume published earlier was recalled from libraries and replaced with a volume that includes the Simla Convention together with an editor's note stating that Tibet and Britain, but not China, accepted the agreement as binding. The replacement volume has a false 1929 publication date.

( https://en.wikipedia.org/wiki/Simla_Accord_(1914) )

[+] zshbleaker|6 years ago|reply

Things got far more worse in China right now.

Baidu Tieba, which could be considered as Reddit for China, just made all posts before 2017-01-01 inaccessible. And a number of other online forums are doing the same thing due to political reasons.

[+] cbluth|6 years ago|reply

would you elaborate on the political reasons?

[+] mwest|6 years ago|reply

Back in the mists of time, I used to use wwwoffle proxy. It was great for low-latency links, but also had the benefit of keeping an offline archive of whatever you'd browsed.

Project's still there, although not sure how well it does with the modern web.

http://www.gedanken.org.uk/software/wwwoffle/

There are a bunch of more modern variations too:

https://archivebox.io/ - "Your own personal internet archive"

https://getpolarized.io/ (as seen on HN previously)

https://github.com/kanishka-linux/reminiscence

https://github.com/fake-name/ReadableWebProxy

[+] dredmorbius|6 years ago|reply

Sadly, a lot of old-school proxies (squid, privoxy) are stymied by SSL/TLS connections.

I think we're due for the idea that a proxy can be designated as a trusted intermediary, most especially if it's run on a personal basis. I'm sure this presents security issues, but it also avoids some.

[+] patrick5415|6 years ago|reply

Yeah it’s annoying that links get broken. But maybe it’s better this way. There’s something about modern tech that has turned all of us into digital horders. I (we?) have backups and backups of backups and redundant RAID servers with every version of every file so that no byte shall ever perish. I have essays still that I wrote in highschool nearly 20 years ago. To what end? I’m partial to material minimalism. Why not data minimalism?

[+] Hextinium|6 years ago|reply

My common response is that there are real world costs to having too much stuff, data is pretty small to keep around. You can fit a RAID 5 box with 30 TB of space in a shoebox for $15/month, that is enough to keep pretty much any content that you ever consume so that if you want anything it still exists. My parents and grandparents hoarded files and documents to no end and I have been going through some of them and there is mostly garbage but there is some real gems in the rough. Willingly disposing of the internet through our own negligence is something I don't advocate for because there is some sort of value preserved in what we save for the next generation.

[+] ddebernardy|6 years ago|reply

Oh, that's a simple one: lack of time. I too have some very old data in an archive that's on some old hard drive somewhere in a storage. Assuming the HD still works, I could go in there, destroy the data, and trash it. But I'm fairly certain I will never find a moment to do so.

Managing what data you keep is just too time consuming. Hoarding it all is simpler.

[+] hashkb|6 years ago|reply

> Then there is also a problem of software preservation: How can people today or in the future interpret those WordPerfect or WordStar files from the 1980s, when the original software companies have stopped supporting them or gone out of business?

This issue in particular we have great solutions for (open formats / text), but they are of course less profitable than only-my-app-can-read-this formats.

[+] Crinus|6 years ago|reply

FWIW those particular formats are widely understood even if they are proprietary (well, at least in WordStar's case). And as long as the software runs (be it natively or via an emulator or VM), you can always open and convert/print the files (e.g. you could use vDOS to run WordStar or whatever and use its printer emulator functionality with Windows' PDF printer to create a PDF from the WordStar files).

[+] Causality1|6 years ago|reply

I read somewhere that the lifespan of the average hyperlink is only about two years.

I count myself lucky I was introduced to the HTTRACK archiver program many years ago and thus have complete offline copies of many of my favorite websites of the early 00's.

[+] adossi|6 years ago|reply

Can you give some examples of these 'favorite websites'? I'm interested in knowing what kind of website would be so interesting that I would want an entire offline copy of it. (Besides maybe Wikipedia)

[+] rchaud|6 years ago|reply

I run a music review website in my free time and I'm extremely envious that you were able to archive that stuff. The indie music sites of the early '00s (pre-social media) were a goldmine of non-corporate DIY journalism and analysis that simply doesn't exist anymore.

[+] dredmorbius|6 years ago|reply

According to the Internet Archive, it's closer to 100 days.

https://blogs.loc.gov/thesignal/2011/11/the-average-lifespan...

https://blog.archive.org/2013/10/25/fixing-broken-links/

[+] cbluth|6 years ago|reply

you should upload them somewhere for posterity

[+] Crinus|6 years ago|reply

Similarly, also even today i use wget --clone and the Firefox addon Save Page WE to save interesting pages (the latter works for single pages, but it is useful for blog articles, etc).

[+] dev_dull|6 years ago|reply

I’m okay with internet rot and you should be too. I’m not sure where we got the idea that “our data must be preserved forever”. This can be especially harmful for teens and young adults whose indiscretions now follow them forever.

Think of the privilege you had when you were younger. You could do something stupid and nobody could whip out a high def camera to record it and make it part of your history forever.

Let it rot.

[+] oceanplexian|6 years ago|reply

I'm OK with it because otherwise you are whitewashing history.

For example I have recordings of the Colbert report going back to ~2005. Some of his skits released during that time would be classified as "hate speech" in 2019. Of course he, and mainstream broadcasting companies would love it if you didn't think about that. There are plenty of news clips and interviews where mainstream politicians (on Left AND Right) casually dismiss gay marriage. Powerful tech influencers like Mark Zuckerberg would love it if his IMs would disappear from the Internet. The examples go on and on.

[+] asdff|6 years ago|reply

I encountered rot today when I was trying to repair a set of speakers. I found the guide on some audiophile forum, step by step with pictures for every removed screw and multiple angles. Only the pictures were originally hosted on photobucket and since retroactively removed.

Some reddit users are egregious about it, installing scripts that overwrite their comments after x amount of time, seemingly oblivious to the fact that every edit on reddit can be found through archival tools. The solution is to take better care to not conflate your anonymous online persona to your real life persona—just don't post identifying information publicly online and you will be head and shoulders above many users on the internet in terms of privacy. There's no need to purge the internet of it's collective knowledge and history.

We are very lucky to have the wayback machine preserving this stuff from dissapearing into the void, but it doesn't cache everything on the internet, especially if that forum I visited had shut down and became impossible to find in a search result.

Side note: Is there an extension or bookmarklet available to automatically pull a web archive?

[+] dlkf|6 years ago|reply

You are potentially conflating two different issues.

> Think of the privilege you had when you were younger. You could do something stupid and nobody could whip out a high def camera to record it and make it part of your history forever.

This is an excellent point, and I agree 100%. But it's not a priori obvious that we need to embrace internet rot to solve this problem. Perhaps we can work towards a future where privacy rights and digital literacy protect individuals from offense archaeology, and at the same time, work to preserve the knowledge, ideas, and sheer human weirdness that get posted every day (as long as the creators don't want it removed. My understanding is that the Internet Archive has explicitly stated that they are not interested in preserving stuff where there has been a takedown request.)

One might argue that we have to make a choice. This is touched on in the article - redundancy presents a tradeoff between persistence benefits and security/privacy risks. But we should examine this tradeoff more closely and see if there isn't a healthy middle ground before we ask for one side wholesale.

[+] GuiA|6 years ago|reply

Letting MySpace rot is fine.

As the article points out, a more concerning issue is that “universities, governments and scientific societies are struggling to preserve scientific data”.

[+] 1000units|6 years ago|reply

I think that ship has sailed. All our most personal data is being archived forever competently by multiple parties.

[+] dredmorbius|6 years ago|reply

Among the handy tools that can be used to save and access present, at-risk, and/or rotted data, are bookmarklets.

I've recently added two to my browser, "open in Wayback Machine" and "Save in Wayback Machine". Respectively:

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href));

    javascript:void(window.open('https://web.archive.org/save/'+location.href));

This makes opportunistic archival and reference easy. There are also Wayback Machine / Internet Archive browser extensions.

(These are from the Internet Archive, not my work.)

For bulk archival, lists of URLs can be submitted to the IA's save address:

    https://web.archive.org/save/<URL>

(Used in the bookmarklet above as well.)

This can be automated with a simple shell script using any console or script-based HTTP agent, such as curl, wget, lynx, etc.

[+] sun_n_surf|6 years ago|reply

I actually have a different problem -- not sure it is one that I can legally solve.

I have 10 years of lovingly curated YouTube videos playlists, which, now when I look into the older ones, are a barren wasteland of "Video removed" or "Video not available". It is heartbreaking. Is there any way I can prevent this from happening?

[+] boardwaalk|6 years ago|reply

I’d download and store the videos locally with youtube-dl.

[+] darepublic|6 years ago|reply

I think it's good that data is lost . Only items where someone gives enough of a duck to save it should be preserved. It's not as if physical paper content, which ends up recycled or in a landfill 99.999 of the time is any different. It's true that digital formats change but fighting that is the cost of preservation. A museum of software needs to also preserve the context on which software was run in order to save it from the mists of time, albeit temporarily.

[+] jplayer01|6 years ago|reply

I feel like you've never had to look for information or how to do something, and the only decent source is entirely gone or consists mostly of pictures, which are also mostly gone. Maybe somewhere at some point in time somebody saved it, but that copy gets lost and never makes it's way online again so you can find it. A lot of information is being lost in this way and I'm not sure why we should be fine with that.

[+] DFXLuna|6 years ago|reply

This seems like a good time to plug running a storage server on your local network. You can pick up old workstations off eBay for 100$. Stick a couple of drives in it, load it up with data to preserve and then put encrypted backups in the cloud. Back blaze B2 is something like .001$ per gigabyte.

It's a fun experiment with clear, practical use.

[+] kwhitefoot|6 years ago|reply

Not quite that cheap 0.005$ per GB. USD 6 per month for personal unlimited backup though.

Trying the trial now.

Thanks for bringing it to my attention.

[+] colonelpopcorn|6 years ago|reply

I hate to be this guy, but isn't this why printers and physical books exist?

[+] papln|6 years ago|reply

If books are all we need, why did anyone bother creating the Internet?

[+] dredmorbius|6 years ago|reply

The recent shutdown of Google+ was another case of this.

As one of the people helping coordinate information among those still using the platform and hoping to migrate off of it, discovering the Archive Team's GoogleMinus project this past January was a huge boost. That ended up being the largest archival project undertaken to date, 1.6 PB, and succeeded in capturing 98% of all G+ profiles, now stored at the Internet Archive.[1]

While it had long been obvious that the project was ill-stared, the shutdown announcement came as a surprise, and Google's tools, communications, and support for both individuals, and far more importantly, groups, looking to continue their existence off the platform, was abysmal.

I don't fault Google for killing the service -- I was suprised it survived as long as it had. I do fault Google for how they did so. And that episode was hardly the worst in history.

One of the lesser-known parts of G+ were its Communities. In the process of the shutdown we came to realise that there were over 8 million of these, about 50,000 with 1,000 or more members, of all descriptions. Many frivilous or worse, but many also not. And all stuck in a very hard spot by Google's actions.[2]

Even preservation of individual data does very little for groups, and is one of the issues we're considering in the post mortem of the G+ mass migration, intended to be of use to others.[3]

________________________________

Notes:

1. For those preferring not to have their content archived, the IA WBM respect DMCA requests, and as Google+ posts are all listed under the user's account, requesting removal is exceedingly straightforward.

2. Characteristics of number and size are collected here, compiled by me, based in part on on data provided by Friends+Me: https://social.antefriguserat.de/index.php/Migrating_Google%...

3. Discussion at Reddit and elsewhere. Compilation at the PlexodusWiki. https://old.reddit.com/r/plexodus/comments/boa97x/g_migratio... https://social.antefriguserat.de/index.php/G%2B_Migration_Po...

[+] bcaa7f3a8bbc|6 years ago|reply

Another problem of the present-day WWW is, even archiving all the data is far from enough to preserve the history! That's because the Web has a dual-role, (a) as a protocol, or a medium of communication, and (b) as the software, or the user-interface.

A good history preservation should allow you to somehow "browse" it, as if the historical system is still alive. How the website worked, how it was used, that's all parts of the history. If old operating systems and programs are preserved, there is no reason not to preserve websites in this way.

Back in the old days, many systems are federated and/or distributed, which means the software and the protocol are two separate entities. You use a newsreader, which talks the NNTP protocol to obtain news from a Usenet newsgroup. If you want to preserve history, you can (a) archive the newsreader program with source code, and (b) archive all the data on the NNTP server. That's exactly what has been done already, if you load a Usenet archive to your newsreader, pretty much you would have the experience similar to how Larry Wall browsed the Usenet back in the late 80s, at worst you need to rewrite a compatible "mock" server, but that's all. On the other hand, little of the early BBS systems have been preserved, once the server is gone, everything is gone.

The transformation to the web, means now the platform (a web community) = protocol (backend database format) = user interface (HTML/CSS), they're all tightly coupled together. It creates several problems:

(1) The "internal state" cannot be archived. A website is a system with constantly updating parameters, and often they are not stored. Simple examples: (a) On Hacker News, I cannot see what was shown on the frontpage yesterday retroactively, (b) A user changes his/her avatar, now we had no idea how the old avatar used to look like, and (c) an early user has been banned from the forum, now his/her personal profile is inaccessible, (d) on some social media platforms, sometimes a old post may be raised from the dead by renewed interests (look, how stupid this comment was!), and now suddenly it's flooded by new posts, leaving no trace of how it used to look like.

(2) The "reader/user interface" cannot be archived. You must have seen something like this: You changed the website frontend, superficially, lots of "conservative" users complained, but the point is: now the old frontend and its "look-and-feel" is lost. If it was a simple CSS file, there are chances to bring it back, but if it was a major rewrite of frontend code, now history is gone forever. And in the lifetime of a website, the design and architecture is likely to be changed many times.

As a result, even if a website and all its content is still alive, it may already be a shadow of its past for a long time, don't even mention to preserve it! And currently, there are two ways to archive the web, both are flawed:

(1) Preserve the HTML at the surface. It's good for single pages, but you cannot browse a website in this way at all. None of the button on the website would work.

(2) Preserve the database. For example, using the API to save posts, or dumping the database - the frontend and reader are not preserved. Using Hacker News as an example, now every single post is archived, but it's far from a full experience, at least you should be able to click someone's username and see all the posts.

Now more and more websites are powered by JavaScript, makes the problem even worse. You are now literally running a program on your computer without any control over it. Once the platform is gone, no archive can save you.

What is the solution? I guess there's no full solution, but there are some possibilities:

(1) Wikipedia-like websites already have builtin version control, but it's very difficult to browse the historical version of the entire website. Systems like this can improve the frontend / user interface to allow a user to "lock on" a historical date.

(2) When building an all-Javascript website, spend some energy to build a plain HTML version as well, it may help avoiding the upcoming digital dark age.

(3) If you are going to close a website, perhaps it may be a good idea to make your internal backups of database and codebase at different years publicly available with sensitive information removed, and allow everyone to setup and run a replicated version. It's infeasible for a big website, but it may be a workable idea for a small community.

And I can imagine the archeologist from the 22th century digging into the old backup tapes of Reddit and attempt to rerun the system.

But ultimately, it's a problem that is needed to be addressed by the protocol and software with archive and preservation in mind.

---

BTW: A few weeks ago, I've written a lengthy comment on the fundamental conflict of history preservation and personal privacy, using the Usenet as an example, you may find it interesting.

* https://news.ycombinator.com/item?id=19562650

[+] return1|6 years ago|reply

And that s a great thing! There is no reason to maintain everything, in fact the entire function of our brains is to filter out useful information from a cataclysm of sensory input. The internet figures out what to keep and what to throw away, the hard thing seems to be to willingly make it forget stuff.

[+] 1121redblackgo|6 years ago|reply

If it is rotting, then it will be a great time for scavengers and carrion feeders--maybe even the 'era of'.

[+] scarejunba|6 years ago|reply

I don't care. In fact, I prefer it this way. Death in species is an evolutionary advantage. So too with culture. We mustn't let it ossify.

For the whitewashing thing, it will happen anyway. Only vigilance can protect against rewriting. Websites can be altered. There is no provenance.

I'm not convinced infinite recall is useful.

[+] mnl|6 years ago|reply

I don't think it's for us to decide.

We've been mourning the loss of the Library of Alexandria for circa 1,500 years and I can't notice that we're getting radically wiser during the last 25 ones.

Letting all go is definitely the cheaper and convenient attitude... for us. But we might be leaving nothing to build upon for the future generation. They should have the same opportunity to ignore what they want that we have.

113 comments