top | item 43391459

Archival Storage

362 points| rbanffy | 1 year ago |blog.dshr.org | reply

193 comments

order
[+] entrepy123|1 year ago|reply
It's kinda mind-blowing that we have (so-called) AI, quantum computing, 6K screens, M2 NVME, billions of networked devices, etc., but regular data *can only be expected to last about 5 years* due to the propensity of moving disk failure, SSD impermanence, bitrot, etc., and is only overcome with great attention and significant cost (continually maintaining a JBOD or RAID or NAS, or painstakingly burning to M-Disc bluray etc.) or handing it over to someone else to manage (cloud) or both. I mean maybe you get lucky with a simple 3-2-1 but maybe you don't, and for larger archives of data that is simply not necessarily a walk in the park either.

Absolutely mindblowing.

[+] Rygian|1 year ago|reply
Emphatic Yes.

I'd like to expand. What I find mindblowing about it is that, as a regular consumer:

* When you need more space you can't just plug in another disk or USB stick. You also have to choose on which device you want to use it, and you have to tell all your software to use it. And that may involve shuffling data around.

* As a corollary, you need to remember in which device you put which stuff.

* As an extra corollary, any data loss is catastrophic by default.

* File copy operations still fail, and when they fail, they do so without ACID-strong commit/fallback semantics.

* Backups don't happen by default, and are not transparent to the end user.

* Data corruption can be silent.

Bonus, but related:

* You can't share arbitrary files with people without going through a 3rd party.

[+] creer|1 year ago|reply
Well, professionally, tape is it - there is technology and it lasts more than 5 years. Unfortunately, the market for tape has evolved such that it's not very friendly to the non-pros. Not impossible but not friendly. That probably has to do with the lack of perceived market for that among non-corporate - or perhaps the impression that clown storage is where it's at for non-corporate.

To be fair some more, JBOD/RAID and hard drives does work pretty well. Past the 5 year horizon to be sure.

Product mgt and corp finance has also fallen in love with subscriptions - and clown storage is such an awesome match for that! Who needs to sell long term terabyte solutions when you can rent it out. Easy to argue against that logic of course, but not easy to fight.

[+] bob1029|1 year ago|reply
I think it's even more mind-blowing that we can hold back the tide of entropy for as long as we can and with so little energy expended.

Solid state electronics and magnetic media are beyond magical. The odds of keeping terabytes of data on rails are astronomically bad.

[+] rietta|1 year ago|reply
Yes, it is. I actually have been mulling over a fictional world set in the future where the period between the 20th and 25th centuries is a mysterious time that so little is known about. The story follows a professor who is obsessed with the "Bit Rot Era" and finding out just what happened to that civilization.

I have a prototype first chapter written that cold opens with an archeological dig '...John Li Wei looked up from his field journal where he had just written “No artifacts found in Basement Level 1, Site 46-012-0023”, wiping sweat from his brow. "Did you find something, Arnold?" he asked, his voice weary. "Three days in this godforsaken jungle, and we've got nothing but mud to show for it. Every book in this library’s long since turned to muck.” Arnold gestured towards the section of the site he had been laboring for the last 30 minutes, digging through layer after layer of brown muck, with fragments of metal hardware that once supported shelving. A glint of metal caught the filtered light. “Arnold, that’s just another computer case,” John sighed, his shoulders slumping slightly. He could already imagine the corroded metal and the disintegrated components inside. Useless. “Help me pull this out.” The two men strained against the clinging earth, their boots sinking into the mud with each heave. As they finally wrestled the heavy, corroded metal case free, a piercing shriek cut through the jungle sounds – beep, beep, beep, beep....'

[+] mtillman|1 year ago|reply
Important to remember that M-Disc Blu-ray is marketing only and not spec whereas M-disc DVD is an actual spec for archival purposes.
[+] RGamma|1 year ago|reply
I often think we're living in a dark age (an age that is characterized by little surviving cultural output)... Dpends on how thing's will go, of course, but I ain't holding my breath.
[+] remus|1 year ago|reply
In some ways it is surprising, but the examples you gave are only currently straightforward because of massive investment over many years by thousands of people. If you wanted to build chatGPT from scratch I'm sure it would be pretty hard, so it doesn't seem so unreasonable that you might pay someone if you care about keeping your data around for extended periods of time.
[+] vodou|1 year ago|reply
How good/bad would it be to have a poor man's tape archival, using standard cassette tapes (C90, C120, etc)?

For example, using something like ggwave [1]. I guess that would last way more than 5 years (although the data density is rather poor).

[1] https://github.com/ggerganov/ggwave

[+] WalterBright|1 year ago|reply
I buy new disk drives every year, copy everything onto them, and retire the old ones as backups.
[+] pdimitar|1 year ago|reply
I share your disappointment. I explained it for myself with this: nobody cares if we the netizens have our data backed up. The corps want it for themselves and they face zero accountability if they lose it or share it illegally with others.

So it's up to us really. I have a fairly OK setup, one copy on a local machine and several encrypted compressed copies in the cloud. It's not bulletproof but has saved my neck twice now, so can't complain. It's also manual...

We the techies in general are dragging our feet on this though. We should have commoditized this stuff a decade ago because it's blindingly obvious the corps don't want to do it (not for free and with the quality we can do it anyway). Should have done app installers for all 3 major OS-es, zero-interaction-required unattended auto-updates -- make is to grandma would never know it's there and it's working. The only thing it asks is access to your cloud storage accounts and it decides automatically what goes where and how (kind of like disk RAID setups I suppose).

[+] UltraSane|1 year ago|reply
Data on LTO tapes stored correctly should last 40 years.
[+] klysm|1 year ago|reply
Making stuff last for a long time is very difficult. It's cheaper to make things last for a short time and allow improvement
[+] alnwlsn|1 year ago|reply
I've thought about the "hundreds of years" problem on and off for a while (for some yet to be determined future time capsule project), and I figure that about all we know for sure that will work is:

- engraved/stamped into a material (stone tablets, Edison cylinders, shellac 78s, vinyl, voyager golden record(maybe))

- paper, inked (books) or punched (cards, tape)

- photography; microfiche/microfilm (GitHub Arctic Code Vault), lithography?

I actually looked into what it might take to "print" an archival grade microfilm somewhat recently - there might be a couple options to send out and have one made but 99.99% of all the results are to go the other way, scanning microfilm to make digital copies. This is all at the hobbyist grade cheapness scale mind you, but it seems weird that a pencil drawing I did in 2nd grade has a better chance of lasting a few hundred years than any of my digital stuff.

[+] PaulHoule|1 year ago|reply
Cost calculations are often different at the enterprise scale from the individual scale. Hypothetically

https://en.wikipedia.org/wiki/Linear_Tape-Open

is an affordable storage medium if you need to store petabytes but for what the drive costs

https://www.bhphotovideo.com/c/product/1724762-REG/quantum_t...

you could buy 400 TB worth of hard drives. Overall I'd have more confidence in the produced-in-volume hard drives compared to LTO tapes which have sometimes disappeared from the market because vendors were having patent wars. Personally I've also had really bad experiences with tapes, going back to my TRS-80 Color Computer which was terribly unreliable, getting a Suntape with nothing at zeros on it, when the computer center at NMT ended my account, the "successful" recovery of a lost configuration from a tape robot in 18 hours (reconstructed it manually long before then), ...

[+] sshagent|1 year ago|reply
My day job (company died a couple of weeks ago). We had > 100,000 LTO tapes in the end. With data archived way back in 2002 until present. We were still regularly restoring data. In our busiest years we were doing what averaged to 177 restores per day (365 days a year). Barely any physically destroyed tapes.

I see a few articles citing robotic failures as a big issue, but really someone can just place a tape in the robot if critical recovery is needed and the robot has died.

[+] JeremyNT|1 year ago|reply
Tape is reliable and suitable for long term archiving, but it still needs care and feeding.

Having some kind of parity data recorded so losing a single tape does not result in data loss, routine testing and replacement of failing tapes, and a plan to migrate to denser media every x years are all considerations.

Spinning rust just feels simple because the abstractions we use are built on top of a substrate that assumes individual drive (or shelf) failure. Everybody knows that if you use hard drives you'll need people to go around and replace failing hardware for the entire lifetime of the data.

[+] wmf|1 year ago|reply
This is mentioned in the article.

There's an old presentation from Google where they mentioned that they were the only ones who read back their tapes to make sure they work.

[+] jewel|1 year ago|reply
If you're using cloud storage for backups, don't forget to turn on Object Lock. This isn't as good as offline storage, but it's a lot better than R/W media.

At work we've been using restic to back up to B2. Restic does a deduplicating backup, every time, so there's no difference between a "full" and an "incremental" backup.

[+] rtkwe|1 year ago|reply
I wish tape archival was easier to get into. But because it's niche and mainly enterprise, drives usually start in the multiple thousands of dollar range unless you go way down in capacity to less than a modern SSD.
[+] nntwozz|1 year ago|reply
I basically use the 3-2-1 backup strategy.

The 3-2-1 data protection strategy recommends having three copies of your data, stored on two different types of media, with one copy kept off-site.

I keep critical data mirrored on SSDs because I don't trust spinning rust, then I have multiple Blu-ray copies of the most static data (pics/video). Everything is spread across multiple locations at family members.

The reason for Blu-ray is to protect against geomagnetic storms like the Carrington Event in 1859.

[Addendum]

On 23 July 2012, a "Carrington-class" solar superstorm (solar flare, CME, solar electromagnetic pulse) was observed, but its trajectory narrowly missed Earth.

[+] kemotep|1 year ago|reply
3-2-1 has been updated to 3-2-1-1-0 by Veeam’s marketing at least.

At least 3 copies, in 2 different mediums, at least 1 off-site, at least 1 immutable, and 0 detected errors in the data written to the backup and during testing (you are testing your backups regularly?).

[+] Dylan16807|1 year ago|reply
> The reason for Blu-ray is to protect against geomagnetic storms like the Carrington Event in 1859.

The danger of such an event is the volts per kilometer it induces in long wires.

An unplugged hard drive will experience no voltage and a super tiny magnetic field. Nothing will happen to it.

[+] ievans|1 year ago|reply
Do you store your SSDs powered? They can lose information if they're not semi-frequently powered on.
[+] lizknope|1 year ago|reply
I've got files going back to 1991. They started on floppy and moved to various formats like hard drives, QIC-80 tape, PD optical media, CD-R, DVD-R, and now back to hard drives.

I don't depend on any media format working forever like tape. New LTO tape drives are so expensive and used drives only support small sized tapes so I stick with hard drives.

3-2-1 backup strategy, 3 copies, and 1 offsite.

Verify all the files by checksum twice a year.

You can over complicate it if you want but when you script things it just means a couple of commands once a week.

[+] hn_throwaway_99|1 year ago|reply
This article touches on a lot of different topics and is a bit hard for me to get a single coherent takeaway, but the things I'd point out:

1. The article ends with a quote from the Backblaze CTO, "And thus that the moral of the story was 'design for failure and buy the cheapest components you can'". That absolutely makes sense for large enterprises (especially enterprises whose entire business is around providing data storage) that have employees and systems that constantly monitor the health of their storage.

2. I think that absolutely does not make sense for individuals or small companies, who want to write their data somewhere and ensure that it will be there in many years when they might want it without constant monitoring. Personally, I have a lot of video that I want to archive (multiple terabytes). I've found the easiest thing that I'm most comfortable with the risk is (a) for backup, I just store on relatively cheap external 20TB Western Digital hard drives, and (b) for archival storage I write to M-DISC Bluerays, which claim to have lifetimes of 1000 years.

[+] nadir_ishiguro|1 year ago|reply
I personally don't believe an archival storage, at least for personal use.

Data has to be living if it is to be kept alive, so keeping the data within reach, moving it to new media over time and keeping redundant copies seems like the best way to me.

Once things are put away, I fear the chances of recovering that data steadily reduce over time.

[+] sigio|1 year ago|reply
Only 'online' data is live/surviving data... So I keep a raid5 array of (currently 4) disks running for my storage needs. This array has been migrated over the years from 4x1 TB, to 2TB, to 4TB, 8TB and now 4x 16TB disks. The raid array is tested monthly (automated). I do make (occasional, manual) offline backups to external HDD's ( a stack of 4/5 TB seagate 2.5" externals), but this is mostly to protect myself from accidental deletions, and not against bitrot/failing drives.

Tapes are way to slow/expensive for this (low) scale, optical drives are way to limited in capacity, topping out at 25/50GB, and then way to expensive to scale.

[+] Dylan16807|1 year ago|reply
You don't need constant monitoring if you have extra disks. If your budget is at least a thousand dollars, you can set up 4 data disks and 4 parity disks and you'll be able to survive a ton of failure. That's easily inside small company range.
[+] wmf|1 year ago|reply
My takeaway is that for personal/SMB use you have to use the cloud.
[+] globular-toast|1 year ago|reply
This article is specifically about digital archival. That is, keeping bit-perfect copies of data for 100+ years. But I think for regular people this is not so obviously useful. People want to keep things like texts (books), photographs, videos etc. Analogue formats are a much better option for these things, for a couple of reasons:

* They gracefully degrade. You don't just start getting weird corruption or completely lose whole files when a bit gets flipped. They might just fade or get dog-eared, but won't become completely unusable,

* It's a more expensive outlay and uses scarce physical space, so you'll think more carefully about what to archive and therefore have a higher quality archive that you (and subsequent generations) are more likely to access.

The downside I guess is backups are far more difficult, but not impossible, and they will be slightly worse quality than the master copy. But if you lose a master copy of something, would it really be the end of the world? Sometimes we lose things. That's life.

[+] squeedles|1 year ago|reply
Simple wins. Always.

I've backed up on just about everything going back to QIC-150s, but today I just use a set of 4Tb drives that I rsync A/B copies to and rotate offsite. That gives me several generations as well as physical redundancy.

The iteration before that, I made multiple sets of Blu-Rays, which became unwieldy due to volume, but was write-once with multiple physical generations. I miss that, but at one point I needed to restore some files and even though I used good Verbatim media, a backup from a couple months prior was unreadable. All copies had a mottled appearance and the drive that wrote it (and verified) was unable to read it. Did finally find a drive that would read it, but finally pushed me over the edge.

I wonder how the author's 18yo media will compare to modern 5yo media. It's been a long time since we have had the rock solid Taio Yuden gold disks ...

[+] wuschel|1 year ago|reply
This made me smile. I have a very similar configuration. Simple but effective. The only thing that worries me bitrot might get me. Then again, my body will bitrot, too. So no point worrying too much about some random data in some turbulence in time.
[+] mburns|1 year ago|reply
Why not use mdisc and effectively solve the “has my cd/dvd degraded beyond the point of being readable” question entirely.
[+] UltraSane|1 year ago|reply
LTO tape is excellent for archival storage because that is what it was designed for. It uses a two layer error correction code that means it has an incredibly low bit error rate so you will still be able to read a tape that was stored correctly 40 years later. Just remember to also store a compatible drive!
[+] CartwheelLinux|1 year ago|reply
When the HDDVD-Bluray wars were going on China had their own implementations of optical storage, and it has been evolving ever since. Much of it is undocumented in languages other than Chinese.

Companies in China use these alternative optical discs, some of which store up to 1TB of data.

The only reference I can find to it on English Wikipedia is the CBHD

https://en.wikipedia.org/wiki/China_Blue_High-definition_Dis...

[+] sigio|1 year ago|reply
1TB would be nice, but never heard of these, and the quoted wikipedia page lists:

Like HD DVD, CBHD discs have a capacity of 15 GB single-layer and 30GB dual-layer and can utilize existing DVD production lines.

So sounds relatively equivalent to bluray, which is way to small to backup modern HDD's (10TB+)

[+] xhrpost|1 year ago|reply
The quote of LTO tape being much less prone to read failures (10^-20) vaguely reminded me of an old article stating that something like 50% of tape backups fail. I'm not in that side of the industry so can't really comment as to if there is some missing nuance.

https://www.quora.com/What-percentage-of-restores-from-a-tap...

[+] lowbloodsugar|1 year ago|reply
That’s a lot of work.
[+] demaga|1 year ago|reply
Yes, exactly. As a data hoarder myself I've been thinking 'what data is _really_ important to me?'. And the answer is - not that much of it. The work, mental space, time, money you have to invest into storing your own data is so much effort, it is probably not worth it.
[+] dharmab|1 year ago|reply
The local external backups certainly are. S3 Deep Archive is about one evening to set up rclone and set up a regular job to run a backup.

If you don't need Linux support you can also pay Backblaze a flat fee for an easy solution, at a slightly higher price.

[+] damnitbuilds|1 year ago|reply
"It is backed up to a Raspberry Pi, also on the DMZ network but not directly accessible from the Internet."

I find it strange that a discussion of storage talks about backing things up to a Raspberry Pi, as if that means anything.

[+] codemac|1 year ago|reply
A lot of things are stated as conclusions in this article, where SOTA has reversed or in some cases invalidated the conclusions. Unfortunately they are not published, and will probably remain trade secrets for another decade.

The biggest conclusion that is invalidated is that your archival workload cannot be bin packed with your hot workloads. With the ever reducing IO/byte of HDD, this has radically changed where the bytes go.

[+] 8jef|1 year ago|reply
My recipe for large files: 3 copies. Right now, 1st copy on external 8 to 16TB NTFS desktop hard drives, and 2nd copy on 14 to 16TB internal ext4 drives. Theses drives I power up only for copy purposes, once a month or so. At present time, my drives are 5 to 7 years old, and still good.

Main working copies I keep on 4 to 8TB NTFS SSDs (mix of sata and nvme), plugged into a PC I'm using regularly, but intermittently.

[+] 0cf8612b2e1e|1 year ago|reply
Don’t forget the offsite storage. I try to ship an old copy every year or so to an acquaintance so I have a catastrophic recovery option.
[+] jrib|1 year ago|reply
are you concerned about something like a fire destroying all the copies?