(no title)
gregsfortytwo | 9 years ago
* Single reader/writer. If you try and set up more, Bad Things happen (even if it's single-writer, you don't get read-after-write consistency).
* It sure sounds like network partitions will allow all kinds of badness.
* If copy 1 goes down, you can keep operating on copy 2, then lose copy 2, have copy 1 come back up, and then warp back in time? Maybe that's prevented because of append-only and you just lose the data entirely because it was only replicating to one node due to the "temporary" failure.
Most egregiously, the Architecture description of an Inode implies that persisting a write-to-disk requires persisting the "INode". INodes are persisted in etcd. Which means your entire cluster's write-to-disk throughput is limited to what you can push through a Raft consensus algorithm.
Look, there are all kinds of reasons one could legitimately decide that none of the existing scalable block storage systems satisfy your use case. Maybe containers really are different enough from VMs. But the blog post claims that there just aren't any solutions; the research papers cited in the Documentation page are mostly old and are about systems in a very different part of this sub-space; and what developer documentation exists does not encourage me that this is a good idea.
Granted, I've been working on Ceph for 7 years and am a bit of a snob as a result.
zzzcpan|9 years ago
Good work on Ceph, by the way. I've been following your work since it was a PhD if I remember correctly.
gregsfortytwo|9 years ago
One example: Both Gluter and Ceph have erasure-coded storage. Gluster's looks just like the replicated storage, only it involves more nodes and less overhead. Ceph's is severely limited in comparison: it's append-only, it doesn't allow use of Ceph's omap kv store, "object class" embedded code, etc. The reason is because distributed EC is subject to the same kind of problem as the RAID5 write hole: if a Gluster client submits an overwrite to 3 of a 4+2 replica group and then crashes, the overwritten data is unrecoverable and the newly-written data never made it.
Torus won't hit that particular issue because it is log-structured to begin with, which has all kinds of advantages. But garbage collection is really hard! Much harder than seems remotely reasonable! Getting good coalescing and read performance is really hard! Much harder than seems remotely reasonable! There's one big existing storage log-structured distributed storage system which has discussed this publicly: Microsoft Azure. They have a few papers out which hint at the contortions they went through to make block devices work performantly — and Azure writes first to 3-replica and then destages to the log! They still had performance issues!
https://github.com/coreos/torus/blob/master/Documentation/re... points to a bunch of HDFS research and replacements; HDFS is designed for the opposite (large files with high bandwidth and nobody-cares latency) of what I presume Torus is targeting (high IO efficiency, low latency). Mostly the same for the Google papers they cite. There's no mention of Azure's storage system papers, nor of Ceph, nor anything about the not-paper-publishing-but-blogging stuff from Gluster or sheepdog; nor from academic research into VM storage systems (there's tons about this!).
Can they fix a bunch of this? Sure. But the desires they list in eg https://github.com/coreos/torus/blob/master/Documentation/ar... go towards making things worse, not better. They aren't talking about how etcd can be in the allocator path but not the persistence path[1] and how every mount should run a repair on the data to deal with out-of-date headers. They talk about adding in filesystems, but not any way of supporting read-after-write (which is impossible with the primitives they describe so far, and really hard in a log-structured system without synchronous communication of some kind). They discuss network partitions between the storage nodes, and between the client and etcd; they don't discuss clients keeping access to the disks but losing it to etcd.
[1] Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync:
1) data write. It doesn't say in docs and I didn't look to see if replication is client or server-driven, but assuming sanity the network traffic is client->server1->server2->server1->client, with a write-to-disk happening before the server2->server1 step.
2) etcd write. Client->etcd master->etcd slaves (2+)->etcd master->client, with a disk write to each etcd process' disk before the reply to etcd master.
This is a busy-neighbor, long-tail latency disaster waiting to happen.