top | item 38933665

(no title)

x1f604 | 2 years ago

> Consider a company that stores users’ emails in the cloud — that is, on a vast array of servers. You can think of the whole collection of emails as one long message. Now suppose one server crashes. With a Reed-Solomon code, you’d need to perform a massive computation involving all the encoded data to recover your emails from that one lost server. “You would have to look at everything,” said Zeev Dvir, a computer scientist at Princeton University. “That could be billions and billions of emails — it could take a really long time.”

I have to take issue with the above characterization. It seems to imply that a server crash means the user has to wait for the data to be reconstructed, or that it will necessarily take a long time for the data to be reconstructed. But I don't think either of these claims are true in the general case.

We can look at Backblaze for a real world example of how an actual file storage company uses Reed-Solomon for error correction:

> Every file uploaded to a Backblaze Vault is broken into pieces before being stored. Each of those pieces is called a “shard.” Parity shards are added to add redundancy so that a file can be fetched from a Backblaze Vault even if some of the pieces are not available.

> Each file is stored as 20 shards: 17 data shards and three parity shards. Because those shards are distributed across 20 storage pods in 20 cabinets, the Backblaze Vault is resilient to the failure of a storage pod, power loss to an entire cabinet, or even a cabinet-level networking outage.

> Files can be written to the Backblaze Vault when one pod is down, and still have two parity shards to protect the data. Even in the extreme and unlikely case where three storage pods in a Backblaze Vault are offline, the files in the vault are still available because they can be reconstructed from the 17 pieces that are available.

So BackBlaze splits each file into 20 shards, with 3 of those being parity shards so that only 17 out of 20 shards are necessary to reconstruct the original file.

Regardless of whether you store each email in a separate file, or if you store all your emails in one giant file, the point is that your emails will be divided into 20 pieces across 20 separate physical machines, so that the loss of any one machine (or even an entire cabinet) will not impact your access to your emails.

I would be extremely surprised if any real company that was actually in the business of storing user data (e.g. AWS, Azure, GCP, Backblaze etc) would store user data in such a way that the crash of a single server would require a "really long time" for the user data to be recovered. Rather, I think it's most likely that the loss of a single server should not have any noticeable impact on the time that it takes for a user to access the data that was stored on that server.

As for the second claim, I don't think it should take "a really long time" to recover even billions of emails. I know that (depending on the parameters) the Intel ISA-L Reed-Solomon implementation can achieve a throughput of multiple GB/s on a single core. So even if you were storing all your emails in a single, really huge file that was tens of gigabytes in size, it still shouldn't take more than a few minutes to recover it from the available shards and to regenerate the shard that was lost.

discuss

22c|2 years ago

I agree that it's a strange example to give. This would also be like saying imagine you had a single Reed-Solomon code for your entire hard drive. That would indeed be very painful to recover data, but we don't have a single Reed-Solomon code for hard drives. You'd pick a block size that is suitable for your application.

vaidhy|2 years ago

There is theoretical inefficiency and practical inefficiency. Something might be O(n^3).. but if your n is small (as in backblaze case, where you do it on a file by file basis, rather than for your filesystem), it is still useful.

In other cases, your optimal algorithm might have a large constant cost (setup cost etc) which for small n might make it practically inefficient. n^2+c1 and n^3 + c2, but c2 >>> c1 happens a lot.

shermantanktop|2 years ago

The article offered that example as an extreme, impractical, but easy-to-imagine case to show the utility of using codes over smaller data segments. I read this article as a discussion about data entropy, data encoding, and information theory.

Nowhere did they suggest that concatenating zillions of emails could be a real world system, or that such a system would be good or practical, or that any actual real system used this approach.

What you describe with Backblaze is using redundant storage to sidestep the problem, so it's apples and oranges.

epcoa|2 years ago

Sidestep what problem? Backblaze is a practical application of Reed-Solomon coding. And the article text is " With a Reed-Solomon code, you’d need to perform a massive computation involving all the encoded data to recover your emails from that one lost server. " How is it apples and oranges?

Reed-Solomon coding is redundant, that's the whole point.

rcxdude|2 years ago

This would be true if you were to optimize for the very extreme case of running an error correction code over all of your data at once. This would give you the absolute best case tradeoff between redundancy and data storage, but would be completely intractable to actually compute, which is the point they are making. In practice error correction is used over smaller fragments of data, which is tractable but also doesn't give you as good a tradeoff (i.e. you need to spend more extra space to get the same level of redundancy). From what I understand one of the appeals of the codes mentioned in the article is that it might be tractable to use them in the manner described, in which case you might only need, say 3 extra servers out of thousands in order to lose any three, as opposed to 3 extra out of 20. But it seems like it is not likely.

(In practice, I would say existing error correction codes already get you very close to the theoretical limit of this tradeoff already. The fact that these 'magical' codes don't work is not so much of a loss in comparison. While they would perhaps be better, they would not be drastically better).

Joker_vD|2 years ago

Does it mean that when Blackblaze needs to retrieve me my file, it has to issue 20 parallel network requests, wait for at least 17 of them to complete, then combine the responses into the requested file and only then it can start streaming it to me? That seems kinda bad for latency.

Twirrim|2 years ago

Yes, you pay a cost for latency, but you get a phenomal amount of durability at much lower stretch factor.

If they make sure they no two shards occupy the same hard disk, they could lose up to three hard disks with your data shared on it and still be able to recreate it. Even if they lose just one, they can immediately reproduce that now missing shard from what they already have. So really you'd need to talk losing 4 hard disks, each with a shard on, nearly simultaneously.

So that's roughly the same durability as you'd get storing 4 copies of the same file. Except in this case it's storing just 1.15x the size of the original file (20:17 ratio). So for every megabyte you store, you need 1.15 megabytes of space instead of 4 megabytes.

The single biggest cost for storage services is not hardware, it's the per rack operational costs, by a long, long stretch. Erasure encoding is the current best way to keep that stretch factor low, and costs under control.

If you think about the different types of storage needs there are, and access speed desires, it's even practical to use much higher ratios. You could, for example, choose 40:34 and get similar resilience to as if you had 8x copies of the file, while still at a 1.15x stretch factor. You just have that draw back of needing to fetch 34 shards at access time. If you want to keep that 4x resilience that could be 40:36 which nets you a nice 1.11x stretch factor. If you had just 1 petabyte of storage, that 0.03 savings would be 30 terabytes, a good chunk of a single server.

22c|2 years ago

No, you are confusing file retrieval with file recovery. The reconstruction only needs to happen if some form corruption is detected (typically in the case of a bad/dead disk).

x1f604|2 years ago

I don't know exactly how Backblaze does it, but in the normal case, reconstruction is not computationally expensive because the 17 data shards are just pieces of the original file that can be served directly to users.

It's only when a data shard is lost that computation is necessary to regenerate it using the parity shards.

sargun|2 years ago

This is actually better for latency, perhaps counter intuitively. Let’s say that each server experiences some high latency requests. Normally, if one server stored that file, you’d get high latency, this scheme on the other hand cuts down on overall latency

mnw21cam|2 years ago

The requests are parallel and therefore complete in the same(ish) amount of time as a single request, so the latency isn't increased.