top | item 45359693

(no title)

MDGeist | 5 months ago

I always assumed the really slow tiers were tape.

discuss

order

derefr|5 months ago

My own assumption was always that the cold tiers are managed by a tape robot, but managing offlined HDDs rather than actual tapes.

jasonwatkinspdx|5 months ago

Yeah, I don't know about S3, but years back I talked a fair bit with someone that did storage stuff for HPC, and one thing he talked about is building huge JBOD arrays where only a handful of disks per rack would be spun up, basically pushing what could be done with scsi extenders or such. It wouldn't surprise me if they're doing something like that with batch scheduling the drive activations over a minutes to hours window.

Demiurge|5 months ago

There was an article or interview with one of the lead AWS engineers, and he said they use CDs or DVDs for cold glacier.

chippiewill|5 months ago

I think that's close to the truth. IIRC it's something like a massive cluster of machines that are effectively powered off 99% of the time with a careful sharding scheme where they're turned on and off in batches over a long period of time for periodic backup or restore of blobs.

hobs|5 months ago

Not even the higher tiers of Glacier were tape afaict (at least when it was first created), just the observation that hard drives are much bigger than you can reasonably access in useful time.

temp0826|5 months ago

In the early days when there were articles speculating on what Glacier was backed by, it was actually on crusty old S3 gear (and at the very beginning, it was just on S3 itself as a wrapper and a hand wavy price discount, eating the costs to get people to buy in to the idea!). Later on (2018 or so) they began moving to a home grown tape-based solution (at least for some tiers).

pjdesno|5 months ago

The “drain time” for a 30TB drive is probably between 36 and 48 hours. I don’t have one in my lab to test, or the patience to do so if I did.

g-mork|5 months ago

There might be surprisingly little value in going tape due to all the specialization required. As the other comment suggest, many of the lower tiers likely represent basically IO bandwidth classes. a 16 TB disk with 100 IOPs can only offer 1 IOP/s over 1.6 TB for 100 customers, or 0.1 IOP/s over 160 GB for 1000, etc. Just scale up that thinking to a building full of disks, it still applies

smueller1234|5 months ago

I realize you're making a general point about space/IO ratios and the below is orthogonal, no contradiction.

It's actually a lot less user-facing per disk IO capacity that you will be able to "sell" in a large distributed storage system. There's constant maintenance churn to keep data available: - local hardware failure - planned larger scale maintenance - transient, unplanned larger scale failures (etc)

In general, you can fall back to using reconstruction from the erasure codes for serving during degradation. But that's a) enormously expensive in IO and CPU and b) you carry higher availability and/or durability risk because you lost redundancy.

Additionally, it may make sense to rebalance where data lives for optimal read throughput (and other performance reasons).

So in practice, there's constant rebalancing going on in a sophisticated distributed storage system that takes a good chunk of your HDD IOPS.

This + garbage collection also makes tape really unattractive for all but very static archives.

pjdesno|5 months ago

See comments above about AWS per-request cost - if your customers want higher performance, they'll pay enough to let AWS waste some of that space and earn a profit on it.