> Is hardware agnostic and uses TCP/IP to communicate.
So no RDMA?
It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.
> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB
Raft-like, so not Raft, a custom algorithm?
Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.
The document mentions it's designed to reach TB/s though.
Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.
Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.
> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.
This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?
We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.
PRs welcome!
> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.
If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.
> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.
Not to mention you simply want a large distributed system implemented in multiple clouds / on prems / use cases, with battle tested procedures on node failure, replacement, expansion, contraction, backup/restore, repair/verification, install guides, an error "zoo".
Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.
There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.
Out of curiosity, you seem knowledgeable here, is it possible to do NVME over RDMA in public cloud (e.g., on AWS)? I was recently looking into this and my conclusion was no, but I'd love to be wrong :)
Sounds more like an object system (immutable) with the veneer of a file system for their use cases. I sort of read the doc - sounds like data is replicated and not erasure encoded (so perhaps more expensive?).
I think many people have said this, but "file systems" get a lot easier if you don't have to worry about overwrites, appends, truncates, etc. Anyway, always interesting to see what people come up with for their use cases.
Append-only is pretty much the only way to do robust replicated storage at scale, else you get into scenarios where, instead of a given logical block having two possible states (either existing somewhere or not existing anywhere), the block can exist with multiple values at different times, including an unbounded number of invalid ones, for instance in case a client died halfway through a block mutation. Immutability is just plain a very strong invariant.
It also does not at all preclude implementing a read-write layer on top of it, for instance with a log-structured FS design. That's however the solution to a problem these people are, it seems, not having.
Over 500PB of data, wow. Would love to know how and why "statistical models that produce price forecasts for over 50,000 financial instruments worldwide" require that much storage.
I would imagine to lesser extent government policy changes and news articles, and to larger extent online discussions on topics relevant to these instruments. Models then attempt to extract signals with predictive value from all the noise. Probably contains non-trivial amount of history to correlate words to market performance in the past, say 20 years or more.
But it's really just a guess, I haven't worked in this domain.
I have worked on exabyte-scale storage engines. There is a good engineering reason for this type of limitation.
If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.
A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.
It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.
(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)
Main factors:
* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).
* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.
* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.
The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).
CephFS implements a (fully?) POSIX filesystem while it seems that TernFS makes tradeoffs by losing permissions and mutability for further scale.
Their docs mention they have a custom kernel module, which I suppose is (today) shipped out of tree. Ceph is in-tree and also has a FUSE implementation.
The docs mention that TernFS also has its own S3 gateway, while RADOSGW is fully separate from CephFS.
Ceph isn't that well suited for high performance. its also young and more complex than you'd want it to be (ie you get a block storage system, which you then have to put a FS layer on after.)
if you want performance, then you'll probably want lustre, or GPFS, or if you're rich a massive isilon system.
It'd be helpful to have a couple of usage examples that illustrate common operations, like creating a file or finding and reading one, right after the high-level overview section. Just to get an idea what happens at the service level in these cases.
Yes, that would be very useful, we just didn't get to it and we didn't want perfect to be the enemy of good, since otherwise we would have never open sourced :).
But if we have the time it would definitely be a good addition to the docs.
Is anyone else bored of seeing the endless line of anti-human-scale distributed filesystems?
It's like the engineers building them keep trying to scratch their own itch for a better filesystem that could enable seamless cross-device usage, collaboration, etc. But the engineers only get paid if they express themselves in terms of corporate desires, and corpos aren't looking to pay them to solve those hard problems. So they solve the horizontal scaling problem for the thousandth time, but only end up creating things that requires a full time engineer (or perhaps even a whole team) to use. Hooray, another centralizing "distributed" filesystem.
Great to see another distributed file system open sourced! It has some interesting design decisions.
Have a couple of questions:
- How do you go about benchmarking throughput / latency of such a system? Curious if it's different compared to how other distributed filesystems benchmark their systems.
- Is network or storage the bottleneck for nodes (at least for throughput)?
- From my observations from RDMA-based distributed filesystems, network seems to be the case.
- How does the system respond to rand / seq + reads / writes? A lot of systems struggle to scale writes. Does this matter for what workload TernFS is designed for?
- Very very interesting to go down the path of writing a kernel module instead of using FUSE or writing a native client in userspace (referring to 3FS [1])
- Any crashes in production? And how do you go about tracking it down?
- What's the difference in performance between using the kernel module versus using FUSE?
> Most of the metadata activity is contained within a single shard:
>
> - File creation, same-directory renames, and deletion.
> - Listing directory contents.
> - Getting attributes of files or directories.
I guess this is a trade-off between a file system and an object store? As in S3, ListObjects() is a heavy hitter and there can be potentially billions of objects under any prefix. Scanning only on a single instance won't be sufficient.
It's definitely a different use case but given they haven't had to tap into their follower replicas for scale, it must be pretty efficient and lightweight. I suspect not having ACLs helps. They also cite a minimum 2MB size, so not expecting exabtyes of little bytes.
I wonder if a major difference is listing a prefix in object storage vs performing recursive listings in a file system?
Even in S3, performing very large lists over a prefix is slow and small files will always be slow to work with, so regular compaction and catching file names is usually worthwhile.
> There's a reason why every major tech company has developed its own distributed filesystem
I haven't worked at FAANG, but is this a well-known fact? I've never heard of it. Unless they're referring to things like S3? Are these large corps running literal custom filesystem implementations?
Not sure about everyone, but probably yes. Imo the killer feature is durability and replication. You can use single disks if you don't care about data loss, but once you need to start replicating data you need a distributed filesystem.
Tectonic is Facebooks, Google's is Colossus. I'm not sure about the others.
TernFS is Free Software. The default license for TernFS is GPL-2.0-or-later.
The protocol definitions (go/msgs/), protocol generator (go/bincodegen/) and client library (go/client/, go/core/) are licensed under Apache-2.0 with the LLVM-exception. This license combination is both permissive (similar to MIT or BSD licenses) as well as compatible with all GPL licenses. We have done this to allow people to build their own proprietary client libraries while ensuring we can also freely incorporate them into the GPL v2 licensed Linux kernel.
> The firm started out with a couple of desktops and an NFS server, and 10 years later ended up with tens of thousands of high-end GPUs, hundreds of thousands of CPUs, and hundreds of petabytes of storage.
So much resources for producing nothing of real value. What a waste.
Great project though, appreciate open sourcing it.
In theory what they are doing of value, is that at any time you can go to an exchange and say "I want to buy x" or "I want to sell y" and someone will buy it from you our sell it from you... at a price that's likely to be the accurate price.
At the extreme if nobody was providing this service, investors (e.g. pension funds), wouldn't be confident that they can buy/sell their assets as needed in size and at the right price... and because of that, in aggregate stocks would be worth less, and companies wouldn't be able to raise as much capital.
The theoretical model is:
- You want to have efficient primary markets that allow companies to raise a lot of assets at the best possible prices
- To enable efficient primary markets, investors want efficient secondary markets (so they don't need to buy and hold forever, but feel they can sell)
- To enable efficient secondary markets, you need many folks that are in the business of XTX
... it just so happens that XTX is quite good at it, and so they do a lot of this work.
if you're storing the blocks in one place, its not decentralised.
The metadata would be crucial for performance, and given that I assume you'll want a full chain of history for every file, your metadata table will get progressively bigger every time you do any kind of metadata operation.
Plus you can only have one person write metadata at one time, so you're gonna get huge top of line blocking.
Ha ha, I forecast, SPY goes up, and I’ve already made more money than XTX or any of its clients…
Look I like technology as much as anyone. Improbable spent $500 million on product development, and its most popular product is its grpc-web client. It didn't release any of its exotic technology. You could also go and spend that money on making $500m of games without any exotic technology, and also make it open source.
charleshn|5 months ago
> Is hardware agnostic and uses TCP/IP to communicate.
So no RDMA? It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.
> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB
Raft-like, so not Raft, a custom algorithm? Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.
The document mentions it's designed to reach TB/s though. Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.
Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.
> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.
This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?
jleahy|5 months ago
We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.
PRs welcome!
> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.
If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.
> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.
AtlasBarfed|5 months ago
Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.
There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.
foota|5 months ago
harshaw|5 months ago
I think many people have said this, but "file systems" get a lot easier if you don't have to worry about overwrites, appends, truncates, etc. Anyway, always interesting to see what people come up with for their use cases.
rostayob|5 months ago
Balinares|5 months ago
It also does not at all preclude implementing a read-write layer on top of it, for instance with a log-structured FS design. That's however the solution to a problem these people are, it seems, not having.
rickette|5 months ago
guerby|5 months ago
taneliv|5 months ago
But it's really just a guess, I haven't worked in this domain.
Beijinger|5 months ago
Have you seen their portfolio?
PS: Company seems legit. Impressive growth. But I still don't understand what they are doing. Provide "electronic liquidity". Well....
mrbluecoat|5 months ago
> TernFS should not be used for tiny files — our median file size is 2MB.
jandrewrogers|5 months ago
If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.
A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.
It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.
Eikon|5 months ago
I initially developed it for a usecase where I needed to store billions of tiny files, and it just requires a single s3 bucket as infrastructure.
heipei|5 months ago
pandemic_region|5 months ago
stonogo|5 months ago
ttfvjktesd|5 months ago
rostayob|5 months ago
Main factors:
* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).
* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.
* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.
jleahy|5 months ago
cmdrk|5 months ago
Their docs mention they have a custom kernel module, which I suppose is (today) shipped out of tree. Ceph is in-tree and also has a FUSE implementation.
The docs mention that TernFS also has its own S3 gateway, while RADOSGW is fully separate from CephFS.
KaiserPro|5 months ago
if you want performance, then you'll probably want lustre, or GPFS, or if you're rich a massive isilon system.
eps|5 months ago
It'd be helpful to have a couple of usage examples that illustrate common operations, like creating a file or finding and reading one, right after the high-level overview section. Just to get an idea what happens at the service level in these cases.
bitonico|5 months ago
But if we have the time it would definitely be a good addition to the docs.
mindslight|5 months ago
It's like the engineers building them keep trying to scratch their own itch for a better filesystem that could enable seamless cross-device usage, collaboration, etc. But the engineers only get paid if they express themselves in terms of corporate desires, and corpos aren't looking to pay them to solve those hard problems. So they solve the horizontal scaling problem for the thousandth time, but only end up creating things that requires a full time engineer (or perhaps even a whole team) to use. Hooray, another centralizing "distributed" filesystem.
maknee|5 months ago
Have a couple of questions:
- How do you go about benchmarking throughput / latency of such a system? Curious if it's different compared to how other distributed filesystems benchmark their systems.
- Is network or storage the bottleneck for nodes (at least for throughput)?
- How does the system respond to rand / seq + reads / writes? A lot of systems struggle to scale writes. Does this matter for what workload TernFS is designed for?- Very very interesting to go down the path of writing a kernel module instead of using FUSE or writing a native client in userspace (referring to 3FS [1])
[1] https://github.com/deepseek-ai/3FS/blob/main/docs/design_not...hintymad|5 months ago
I guess this is a trade-off between a file system and an object store? As in S3, ListObjects() is a heavy hitter and there can be potentially billions of objects under any prefix. Scanning only on a single instance won't be sufficient.
jeffinhat|5 months ago
I wonder if a major difference is listing a prefix in object storage vs performing recursive listings in a file system?
Even in S3, performing very large lists over a prefix is slow and small files will always be slow to work with, so regular compaction and catching file names is usually worthwhile.
chatmasta|5 months ago
I haven't worked at FAANG, but is this a well-known fact? I've never heard of it. Unless they're referring to things like S3? Are these large corps running literal custom filesystem implementations?
foota|5 months ago
Tectonic is Facebooks, Google's is Colossus. I'm not sure about the others.
allset_|5 months ago
https://cloud.google.com/blog/products/storage-data-transfer...
loeg|5 months ago
mdaniel|5 months ago
coolspot|5 months ago
TernFS is Free Software. The default license for TernFS is GPL-2.0-or-later.
The protocol definitions (go/msgs/), protocol generator (go/bincodegen/) and client library (go/client/, go/core/) are licensed under Apache-2.0 with the LLVM-exception. This license combination is both permissive (similar to MIT or BSD licenses) as well as compatible with all GPL licenses. We have done this to allow people to build their own proprietary client libraries while ensuring we can also freely incorporate them into the GPL v2 licensed Linux kernel.
hardwaregeek|5 months ago
rickette|5 months ago
Balinares|5 months ago
rostayob|5 months ago
d12bb|5 months ago
So much resources for producing nothing of real value. What a waste.
Great project though, appreciate open sourcing it.
candiddevmike|5 months ago
EugeneG|5 months ago
At the extreme if nobody was providing this service, investors (e.g. pension funds), wouldn't be confident that they can buy/sell their assets as needed in size and at the right price... and because of that, in aggregate stocks would be worth less, and companies wouldn't be able to raise as much capital.
The theoretical model is: - You want to have efficient primary markets that allow companies to raise a lot of assets at the best possible prices - To enable efficient primary markets, investors want efficient secondary markets (so they don't need to buy and hold forever, but feel they can sell) - To enable efficient secondary markets, you need many folks that are in the business of XTX ... it just so happens that XTX is quite good at it, and so they do a lot of this work.
scandox|5 months ago
miovoid|5 months ago
rob_c|5 months ago
Beijinger|5 months ago
jleahy|5 months ago
sreekanth850|5 months ago
nunobrito|5 months ago
bananapub|5 months ago
arcade79|5 months ago
VikingCoder|5 months ago
jleahy|5 months ago
YouAreWRONGtoo|5 months ago
[deleted]
771753|5 months ago
eigenvalue|5 months ago
MadnessASAP|5 months ago
No need for an underpinning, it is the underpinning.
KaiserPro|5 months ago
The metadata would be crucial for performance, and given that I assume you'll want a full chain of history for every file, your metadata table will get progressively bigger every time you do any kind of metadata operation.
Plus you can only have one person write metadata at one time, so you're gonna get huge top of line blocking.
mrtesthah|5 months ago
doctorpangloss|5 months ago
Look I like technology as much as anyone. Improbable spent $500 million on product development, and its most popular product is its grpc-web client. It didn't release any of its exotic technology. You could also go and spend that money on making $500m of games without any exotic technology, and also make it open source.