top | item 1584988

Ask HN: How would you store 5TB of data for 50 years, untouched?

78 points| icey | 15 years ago | reply

Let's say you were thinking of putting digital data in a time capsule where it couldn't be touched for 50 years. How would you store it? How would you ensure it was readable at the end of the 50 years?

172 comments

order
[+] iamelgringo|15 years ago|reply
Here's what I would suggest:

Take a bunch of fibrous, cellulosic material, pound it into a pulp and then squeeze it into very thin, flexible sheets of material. Let them dry.

Then, take some form of pigment or dye, and with a very fine stylus impregnated with the dye, visually encode the data on the cellulosic sheets using a set of graphemes. Each grapheme would roughly represent a phoneme in a spoken language.

It would take quite a while to encode all that data. I'd suggest building some type of mechanical to automate the task of transferring dye on to the cellulosic sheets. I'd also want to bundle these individual cellulosic sheets into stacks of 200-500 for organization's sake. I'd probably cover them with a more durable material such as animal hide or perhaps a thicker layer of cellulosic material.

I'd then take all these bundles of data laded cellulosic material, and I'd build a structure to protect these bundles from the elements. Developing a cataloging or indexing system for these bundles shouldn't be too hard. I'm sure it's been done before.

Regardless, you could either preserve the materials or let the public have free access to the information. You'd run the risk of damaging the data, but if you had a mechanical replication system, you could simply make multiple copies of each data set, and ensure the safety of the data that way.

Sheets of fibrous, cellulosic material should last several thousand years if kept in the right environment.

You know. Now that I think about it. It's probably much too complex a system to handle something like that. I really don't think it would work.

[+] KC8ZKF|15 years ago|reply
It is important to remove the lingin from your fibrous material, or otherwise ensure the flexible sheets have a basic pH, else the sheets will deteriorate.
[+] Nogwater|15 years ago|reply
Out of curiosity, how large would the time capsule need to be to contain 5TB of data encoded that way?
[+] aristus|15 years ago|reply
Hah, I'm building a time capsule right now. Acid-free paper, laser printed, and encased in epoxy resin. I'm targeting 300 to 500 years.

Eventually I want to do the same thing with a 20ft cargo container and a bunch of concrete.

[+] ezl|15 years ago|reply
ha, yeah right.

no way that would ever work.

[+] tzs|15 years ago|reply
Since some of the solutions people are proposing assume you have a lot of space for storage, I'll assume that too.

1. Get a 9600 bps modem. Use it to encode your data, and record the output as an audio file.

2. Take this audio file, and split it up into 60 minute segments.

3. Record these 60 minute segments onto two-sided vinyl LPs, 30 minutes per side. This will take about a million LPs.

4. Print on acid-free paper, using ink that will survive 50 years too, instructions on how a 9600 bps modem works. Describe the encoding in detail, sufficient so that someone using the equivalent of MATLAB or Mathematica or something 50 years from now on the computers they will have then could easily write a program to decode a modem signal.

5. Also print and include instructions for making a record player. As with the modem, the important part is describing how the signal is encoded on the LP. They'll have no trouble building a record player 50 years from now. (Assuming they don't just photograph the LPs with the 3D terapixel camera on their iPhone 54, and then write an app to extract the signal from the photo...)

5. Store all of this somewhere. LPs will last 50 years easily in a typical office environment, so you probably don't have to resort to something like a hermetically sealed vault buried in an old salt mine or anything extreme like that.

[+] zandorg|15 years ago|reply
They put ZX Spectrum loading sounds (kind of like a modem except totally audio), on this LP, XL1:

http://en.wikipedia.org/wiki/Pete_Shelley

It is software which, when loaded into the Speccy's audio In port, does funny lightshows in time to the album's songs.

BTW, the LP speed doesn't matter, the Speccy picks it up anyway.

Pretty innovative. And the best thing is you can now get the LP in a TAP-style emulator format! So it survived over 25 years.

[+] phreeza|15 years ago|reply
Here's the catch: Transmitting 5TB at 9600bps takes exactly 132 years, 1 month, 14 days, 21 hours, 24 minutes and 27 seconds. So the time capsule would be opened before you're done loading it.
[+] AlexMuir|15 years ago|reply
Jesus, you'd struggle to get the data off those the day after you did it, nevermind in 50 years.
[+] phreeza|15 years ago|reply
Nice idea. Probably no need for an actual modem in the encoding, I think that should be feasible with software nowadays?
[+] bkrausz|15 years ago|reply
If a constraint was that it's going in a literal time capsule that would be buried underground (i.e. physical damage to the area is not controllable) I would get a couple of SSDs and a couple of backup tapes, and save redundant copies on different types of media. Given enough space I'd also stick a couple of machines capable of reading the data just in case.

Removing that constraint and completely ignoring cost I'd also setup a low-risk savings account with $1M in it and put the data on S3 and Rackspace Cloud. I'd store access credentials in the capsule. Odds are pretty good one of those 2 will be around in 50 years (and you'll have a chunk of money left over in interest).

Try to keep everything ASCII, with really good text descriptions of data formats.

Realistically 50 years is not a long time: I would bet we'll still have legacy access to USB, SATA, and probably ext3 & NTFS (though probably not IDE). Tons of computer folk who used these technologies will still be alive to work them. English will still be the primary language in the US.

An interesting problem is what to do when the timescale allows these things to change. What if nobody remembers USB, or what spinning platters are. Or the English language?

[+] fhars|15 years ago|reply
Neither tape nor SSDs will last 50 years. Within about 10 years, tapes will loose magnetization through thermal movement and capacitors in SSD storage cells will flip due to cosmic radiation. Over some decades the plastic the media and/or casings are made of will just decay (a serious problem for museums of modern art and design: http://www.getty.edu/conservation/science/plastics/index.htm...). The only media with a prooven track record of preservation over decades are acid free paper, parchment and non-organic materials like steel, stone and clay. But getting 5TB on an stone tablet has its very own challenges. [Edit: And using acid free paper won't buy you anything if you print using plastic based toner common in modern laser printers, at least use an inkjet with inorganic pigment (and not dye based) ink. If you look out for pitfalls like this, you might be able to implement your requirements with http://ronja.twibright.com/optar/ and only 125 metric tons of acid free paper, which you should be able to buy for less than $200000.]

Some people claim than MO media and DVD-RAM can guarantee 30 years, but this still is an estimate, they have not been around long enough to actually know.

The only "reliable" way to store digital data for more than five years known today is to copy them to new media well in advance of the old media loosing them, and even that is difficult if the amount of data is growing faster than the the storage technologies get faster. (I don't know if I should trust Eric Schmidt, but a few days ago he claimed that currently humanity generates as much data every two days as it did up until 2003, http://techcrunch.com/2010/08/04/schmidt-data )

[+] robryan|15 years ago|reply
Even better run a program on multiple servers that has the ability to move the data around different online storages and has the ability to seek out and buy more online storage solutions if required over time.
[+] phreeza|15 years ago|reply
My answer to the last question would probably be including a reader right with it. AC current will likely be around a lot longer than 50 years. So if you have some kind of computer containing the data, just need to plug it in and it starts a self-explaining film(ideally with pictogrammes, several language explanations, etc) that should make the data usable...
[+] sliverstorm|15 years ago|reply
Convert the data to a string of letters.

Have a child.

Name your child that string of letters.

Now preserving your data is the government's problem- they have to produce a birth certificate and keep track of him/her in their databases.

[+] lzw|15 years ago|reply
They'll just rename your child.

The thing about government is, all the laws apply to you. But they can do anything they want.

[+] kabdib|15 years ago|reply
A lot of people are talking about 50 years like it was a super long time, and propose solutions that are really intended for hundreds to thousands of years. I think it's overkill. [Also I think a lot of you are under 30 :-) ]

For only 50 years, I'd probably risk making many thousands of DVDs and CDs, using different manufacturers and drives. Store with tons of redundancy and ECC, don't use inner / outer tracks for anything important, etc.

Also, are all of the data equally important? You can afford to store the more critical pieces in more expensive and less compact, but more robust formats.

I think the real enemy is obsolescence, and that keeping the data simple (and providing decompression programs and indices in easily understandable formats) is likely more important than worrying about bit-dropout, which seems largely manageable over your specified time.

For 500 years, I'd print it, or micro-inscribe it. (One problem with printed matter is that it has other inherent value, e.g., fuel for heating the yurts of cold barbarians).

For 5K years, micro-inscription and (if you are worried about technological crashes) an archive in the sky. You could populate a host of satellites in various orbits, timed to re-enter at intervals of (say) a decade over a few thousand years (hard to be exact with atmospheric drag and climate change, but you get my drift). Getting something from orbit down to the ground is not hard, getting /noticed/ and picked up as an interesting artifact is probably harder.

For 5M years, add a metric buttload of ECC and stick it in the DNA of some critter that doesn't get out much. A bottom-feeder in a radiation-shielded environment would be cool. Say, a lobster.

[+] parallax7d|15 years ago|reply
I love the lobster idea. It would be even better if the lobsters survival was based on the integrity of the data. This would provide evolutionary ECC.

You would also need some mechanism to signal people in the far future that the lobsters were data carrying devices. Otherwise they wouldn't have any reason to randomly decode sea creatures. Perhaps you could program the lobsters to develop spots on their shells every century which denote the first 10 prime numbers.

[+] jbert|15 years ago|reply
The DNA idea is great. You'd need a long "this is a message" intro, say a long ATATATATAT repeating sequence (much too long to occur by chance).

As you say, you can use forward error correction to preserve the data. The hard part is describing the data format to the reader.

From the ATATA...intro, a reader knows they have a message. But now they need to know how to interpret it. You need a way of encoding information (english text?) in DNA and you also need a way of describing that encoding mechanism in DNA too...

Basically, over 5k years you should look up all the protocols SETI people have thought up. And/or re-read Gödel, Escher Bach.

[+] ugh|15 years ago|reply
Compact Cassettes are now in their 47th year of production, still going strong in developing countries. I’m willing to bet that you will still be able to buy cassettes and players in twenty years.

The story for CDs seems to me to be similar and they are still popular everywhere. I give them at least another forty to fifty years.

[+] extension|15 years ago|reply
Why is everybody worried about the future people knowing how to read the data? Barring some unprecedented catastrophe, we should still have detailed technical specs of today's formats in 50 years.

Just bury 250TB worth of SSD storage, along with a device that activates every year and copies from one 5TB block to the next. Any single SSD will only be in use for a year. If the drives can survive 49 years before their first use, it will work.

Storing the data in some ridiculous format is just going to discourage anyone from ever reading it. I'm sure the people of tomorrow have better things to do than OCR millions of sheets of paper just to see grandpa's porn collection.

[+] iamelgringo|15 years ago|reply
How many digital devices that were created in 1960 are still running today? Most digital devices back then were created using vacuum tubes. How many vacuum tube makers still exist? The military was buying vacuum tubes from Checkoslovakia in the 80's to keep the SAGE early warning system running. ref: http://en.wikipedia.org/wiki/Semi_Automatic_Ground_Environme... That's because there were no American manufacturers of vacuum tubes after the late 70's.

Sure, we still have the technical specifications for how to build it, but manufacturing the individual components would be a giant pain in the ass.

[+] NathanKP|15 years ago|reply
Who says that in 50 years SSD storage won't be a ridiculous format? It might be just as hard to get data off an SSD as it would be to get data off an LP, or some other device.
[+] bugsy|15 years ago|reply
> I'm sure the people of tomorrow have better things to do than OCR millions of sheets of paper just to see grandpa's porn collection.

That's actually a pretty good thought - whatever it is, label it as porn.

People have spent millions of dollars restoring vintage erotica films.

[+] goodside|15 years ago|reply
Encrypt the data with a key long enough that, by Moore's law, you'd expect computers to be able to break it in 50 years. Submit the data to Wikileaks. Destroy the key.
[+] gojomo|15 years ago|reply
Contract with the Long Now Foundation's Rosetta Project to put the data on their 'Rosetta Disks', readable by any civilization with high-powered optical telescopes:

http://en.wikipedia.org/wiki/Rosetta_Project

Supposedly one holds 13,000 pages of text in human languages. If we assume your data is similar text, and one page is 58 lines of 66 characters (as are plain text IETF RFCs), you'll need:

(5TB / (3828 bytes)) / 13000 = 110473 disks

[+] zokier|15 years ago|reply
My answer is that you don't put digital data in a time capsule. Digital data is easy to copy, and that's what you want to leverage. Put a two pairs of servers in two data centers and keep them running, migrating to new tech when needed. I'd assume 5-10 generations of hardware would be necessary.
[+] waterlesscloud|15 years ago|reply
This is the best answer.

I also think things like NASA datasets, other govt agency datasets, etc, should be placed on torrents for anyone who wants to make a copy. Let the self-replicating nature of the internet serve as the backup backup plan.

If you put those Apollo datasets online, it's a guaranteed certainty that some hacker somewhere will have them in 50 years.

[+] edw519|15 years ago|reply
1. Convert all the data to decimal (digits 0-9).

2. Put a decimal point in front of this long string. The result will be a rational number between 0 and 1. Call it x.

3. Get a titanium rod exactly 12 inches long.

4. Using a fine laser, etch a line in the rod precisely 12x inches from the end.

5. Done. Precise, durable, elegant, compact, and green.

EDIT: </sarcasm>

[+] JadeNB|15 years ago|reply
Ah, you added the `</sarcasm>` tag while I was responding …. Anyway, it was an excuse to break out Frink (http://futureboy.us/fsp/frink.fsp), which reports that the resolution required, which (I think) is (1 foot)/(50 terabytes/byte), is 6 * 10^(-15) m, i.e., on the order of the diametre of a proton. Honestly, I thought it would be much smaller.

Another problem is that any rod etched in this way will have two decodings. :-)

[+] bayes|15 years ago|reply
It's certainly possible to imagine universes in which that would work.

But in ours, where stuff is made of atoms, I can't see you positioning the mark on the rod any more precisely than the width of an atom, which I think is about 10 to the -10 meters. So I'm guessing you could only encode 30 or 40 bits, even with super-advanced etching and measuring equipment.

[+] phreeza|15 years ago|reply
Just out of curiosity: whats the background of this question? Are you planning to actually do this? What kind of data?

If so, you could maybe give use some more information on the constraints involved(although i must admit thinking about it without any constraints is fun, too)

[+] icey|15 years ago|reply
There isn't really much in the way of background, I was thinking about people who decide to use cryopreservation, or the potential of sending out spacecraft for long periods of time where it may be out of communication range but the craft is meant to return, or even something as simple as a school's time capsule.

The constraints were chosen in order to remove the easiest answers (file sizes, period of time, etc).

Ultimately I think it's an unsolved problem that will become more important over time. My family can has photo albums from over 50 years ago but that doesn't have the kind of bandwidth we need for larger datasets (audio, video, etc).

So I guess it's just a thought experiment I thought was interesting.

[+] bluemetal|15 years ago|reply
If this digital data didn't have to be put away somewhere, I would make the most of intelligence. Put someone in charge, with the skills to transfer the data the newer mediums as they become popular and to ensure that the copies are not corrupted in the process. This person would be paid in whatever kind of way leads to the most loyalty, I would also leverage their sense of pride. Maybe have a few different people each tasked with protecting overlapping segments of the data to help ensure nothing is ever lost.

Ideally some kind of artificial intelligence would come about sometime in the future to assume the role of data keeper - hiring people to do any work it couldn't do from within the computer and running off some kind of fund that had been set up. Maybe one day there will be a market for creating intelligent services like this, I hope I have something to do with them.

[+] nhnifong|15 years ago|reply
Transmit the data with a laser to a mirror 25 light years away.
[+] dmoney|15 years ago|reply
Put it on regular hard drives and write a note to your future self to come back for them once time machines are invented.
[+] Sapient|15 years ago|reply
The beauty is in the simplicity.
[+] nivertech|15 years ago|reply
With the rates of our both culture and technology changes, communicating with people 50 year into the future is like communicating with Aliens. So you may apply the same principles:

http://en.wikipedia.org/wiki/Communication_with_Extraterrest...

Another proven way to communicating your knowledge through thousands of years is to start your own ethno-religious group/nation, like for example Jews.

If you want to combine both approaches - try Scientology ;)

[+] ecaradec|15 years ago|reply
Paper ? It's possible to print binary data on paper now. I'm not sure at what 5TB would look like though. It's probably better to keep the data at hand and migrate it as the world and the technology evolve.
[+] fxj|15 years ago|reply
A4 page is 8.3x11.7 inches

use 1200 dpi printer to print b/w dots this gives: 8.3 * 11.7 * 1200 * 1200 bits/page = 17 MB/page

5 TB are 142 Books with 1000 pages double sided each (size of a small personal library)

case closed.

[+] js4all|15 years ago|reply
5 TB on paper needs to much room to store.
[+] ghshephard|15 years ago|reply
Use archival CD-R Media - Good for 300 Years

Start by looking at these guys: http://www.falconrak.com/pro_archival_cd-r_gold_ep.html

[+] ascuttlefish|15 years ago|reply
Any idea where they get the number 300 years? In the archival world, we're leery of such blatant marketing claims, especially since CD technology is only a few decades old.
[+] zokier|15 years ago|reply
You'd also want to store a few CD drives with them. And a few computers to attach the CD drives to. And from there the problem arises. Both bearings and electrolytic capacitors are essential for computers and I doubt that they'd last 50 years.
[+] AlexMuir|15 years ago|reply
50 years isn't all that long - there are plenty of tapes and records that are still perfectly usable from then. As long as you included a decent amount of redundancy you'd be alright with a few hard drives surely? There's always the issue of software being able to read the data, but we have no problems opening images and documents from 27 years ago now. In 50 years time there'll probably be a niche industry producing software that converts old formats - just as there is now converting VHS/Cinefilm.
[+] mevodig|15 years ago|reply
The major film studios, faced with a related problem a long time ago, opted for a method called YCM separation, where they separate the image into yellow, cyan and magenta and record it onto very stable black and white polyester film stock. Properly stored, this supposedly has a lifetime of 500 years or more.

A modern laser film recorder is capable of a resolution of 4096x3112 and 10 bits per pixel, so that's about 16MB of data per 35mm frame with black and white film.

[+] yannis|15 years ago|reply
Print them out on an acid free paper (I have books that are over 300 years old), so this will definitely work.

After 50 years you can OCR the data etc (or ask your personal robot to do it for you) and print it using a variant of TeX/LaTex. This has already survived for 34 years, so another fifty years is almost guaranteed;) Knuth predicted some years back that TeX will last for about 100 years.