top | item 11826615

(no title)

"haven't figured out consistency yet" is the issue. Seemingly very small decisions have huge impacts that you don't expect, and the state of the art is far enough along that you don't get better than the existing solutions except by exploiting newly-identified workload characteristics (or acceptable ways of losing data, loosening consistency, etc based on the workload) that you've planned out to make use of ahead of time.

One example: Both Gluter and Ceph have erasure-coded storage. Gluster's looks just like the replicated storage, only it involves more nodes and less overhead. Ceph's is severely limited in comparison: it's append-only, it doesn't allow use of Ceph's omap kv store, "object class" embedded code, etc. The reason is because distributed EC is subject to the same kind of problem as the RAID5 write hole: if a Gluster client submits an overwrite to 3 of a 4+2 replica group and then crashes, the overwritten data is unrecoverable and the newly-written data never made it.

Torus won't hit that particular issue because it is log-structured to begin with, which has all kinds of advantages. But garbage collection is really hard! Much harder than seems remotely reasonable! Getting good coalescing and read performance is really hard! Much harder than seems remotely reasonable! There's one big existing storage log-structured distributed storage system which has discussed this publicly: Microsoft Azure. They have a few papers out which hint at the contortions they went through to make block devices work performantly — and Azure writes first to 3-replica and then destages to the log! They still had performance issues!

https://github.com/coreos/torus/blob/master/Documentation/re... points to a bunch of HDFS research and replacements; HDFS is designed for the opposite (large files with high bandwidth and nobody-cares latency) of what I presume Torus is targeting (high IO efficiency, low latency). Mostly the same for the Google papers they cite. There's no mention of Azure's storage system papers, nor of Ceph, nor anything about the not-paper-publishing-but-blogging stuff from Gluster or sheepdog; nor from academic research into VM storage systems (there's tons about this!).

Can they fix a bunch of this? Sure. But the desires they list in eg https://github.com/coreos/torus/blob/master/Documentation/ar... go towards making things worse, not better. They aren't talking about how etcd can be in the allocator path but not the persistence path[1] and how every mount should run a repair on the data to deal with out-of-date headers. They talk about adding in filesystems, but not any way of supporting read-after-write (which is impossible with the primitives they describe so far, and really hard in a log-structured system without synchronous communication of some kind). They discuss network partitions between the storage nodes, and between the client and etcd; they don't discuss clients keeping access to the disks but losing it to etcd.

[1] Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync:

1) data write. It doesn't say in docs and I didn't look to see if replication is client or server-driven, but assuming sanity the network traffic is client->server1->server2->server1->client, with a write-to-disk happening before the server2->server1 step.

2) etcd write. Client->etcd master->etcd slaves (2+)->etcd master->client, with a disk write to each etcd process' disk before the reply to etcd master.

This is a busy-neighbor, long-tail latency disaster waiting to happen.

discuss

betawaffle|9 years ago

These are very good points, but also probably much more constructive as GitHub issues, where they can be either answered or addressed. In the meantime, hopefully I can talk to a few of these:

>haven't figured out consistency yet I don't recall this being the case, but I'm not the authority on the matter.

>I presume Torus is targeting (high IO efficiency, low latency) The best-possible performing storage solution is _definitely_ not our primary goal, though we'll take it where we can get it. The most important goals for the project are ease of use, ease of management, flexibility, and correctness when the use-case desires it. Please note that the block device interface is only one of many planned. The underlying abstraction was designed (and will be improved) to support other situations.

>Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync In Torus today (with the block device interface specifically), and with the caveat that I'm not the authority, so I may be slightly wrong, calling sync(), fsync(), and friends result in what I think you would consider an "allocation". Writes happen against a snapshot of the file (in this block storage case, the block volume is the "file"), and then a sync() makes those changes visible as the "current" version. Syncs hit etcd, writes do not.

I would really encourage you to submit feedback like this in GitHub issues. The project is still in _very_ early stages, and legitimate feedback can actually make a difference.