Every time someone builds one of these things and skips over "overcomplicated theory", aphyr destroys them. At this point, I wonder if we could train an AI to look over a project's documentation, and predict whether it's likely to lose commmitted writes just based on the marketing / technical claims. We probably can.
The only post in this thread that actually summarized the core findings of the study, namely:
- ACKed messages can be silently lost due to minority-node corruption.
- A single-bit corruption can cause some replicas to lose up to 78% of stored messages
- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.
- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.
- A crash combined with network delay can cause persistent split-brain and divergent logs.
- Data loss even with “sync_interval = always” in presence of membership changes or partitions.
- Self-healing and replica convergence did not always work reliably after corruption.
…was not downvoted, but flagged... That is telling. Documented failure modes
are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?
So what is next? Nominate NATS for the Silent Failure Peace Prize?
You can have DeepWiki literally scan the source code and tell you:
> 2. Delayed Sync Mode (Default)
> In the default mode, writes are batched and marked with needSync = true for later synchronization filestore.go:7093-7097 . The actual sync happens during the next syncBlocks() execution.
However, if you read DeepWiki's conclusion, it is far more optimistic than what Aphyr uncovered in real-world testing.
> Durability Guarantees
> Even with delayed fsyncs, NATS provides protection against data loss through:
> 1. Write-Ahead Logging: Messages are written to log files before being acknowledged
> 2. Periodic Sync: The sync timer ensures data is eventually flushed to disk
> 3. State Snapshots: Full state is periodically written to index.db files filestore.go:9834-9850
> 4. Error Handling: If sync operations fail, NATS attempts to rebuild state from existing data filestore.go:7066-7072"
It's not even "overcomplicated theory" it's just "commit your writes before you say you committed your writes". It's actually way, way more complicated to try to build a system that tries to be correct without doing that.
For anyone dealing with databases, and especially distributed databases, I highly recommend reading the Jepsen page on consistency models: https://jepsen.io/consistency/models
It provides a dictionary of terms that we can use to have educated discussions, rather than throwing around terms like "ACID".
Wow. I’ve used NATS for best-effort in-memory pub/sub, which it has been great for, including getting subtle scaling details right. I never touched their persistence and would have investigated more before I did, but I wouldn’t have expected it to be this bad. Vulnerability to simple single-bit file corruption is embarrassing.
Why? Why do some databases do that? To have better performance in benchmarks? It’s not like that it’s ok to do that if you have a better default or at least write a lot about it. But especially when you run stuff in a small cluster you get bitten by stuff like that.
It's not just better performance on latency benchmarks, it likely improves throughput as well because the writes will be batched together.
Many applications do not require true durability and it is likely that many applications benefit from lazy fsync. Whether it should be the default is a lot more questionable though.
I always wondered why the fsync has to be lazy. It seems like the fsync's can be bundled up together, and the notification messages held for a few millis while the write completes. Similar to TCP corking. There doesn't need to be one fsync per consensus.
The kind of failure that a system can tolerate with strict fsync but can't tolerate with lazy fsync (i.e. the software 'confirms' a write to its caller but then crashes) is probably not the kind of failure you'd expect to encounter on a majority of your nodes all at the same time.
Curious about the differences between content on aphyr.com/tags/jepsen and jepsen.io/analyses. I recently discovered aphyr.com and was excited about the potential insights!
Highly recommend you check out the interview series they are a lot of fun.
> They will refuse, of course, and ever so ashamed, cite a lack of culture fit. Alight upon your cloud-pine, and exit through the window. This place could never contain you.
> > You can force an fsync after each messsage [sic] with always, this will slow down the throughput to a few hundred msg/s.
Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?
> Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?
Yes, and you shouldn't even need a fixed interval. Just queue up any writes while an `fsync` is pending; then do all those in the next batch. This is the same approach you'd use for rounds of Paxos, particularly between availability zones or regions where latency is expected to be high. You wouldn't say "oh, I'll ack and then put it in the next round of Paxos", or "I'll wait until the next round in 2 seconds then ack"; you'd start the next batch as soon as the current one is done.
Yes, this is a reasonably common strategy. It's how Cassandra's batch and group commit modes work, and Postgres has a similar option. Hopefully NATS will implement something similar eventually.
NATS is a fantastic piece of software. But doc’s unpractical and half backed. That’s a shame to be required to retro engineer the software from GitHub to know the auth schemes.
> By default, NATS only flushes data to disk every two minutes, but acknowledges operations immediately. This approach can lead to the loss of committed writes when several nodes experience a power failure, kernel crash, or hardware fault concurrently—or in rapid succession (#7564).
I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.
Coordinated failures shouldn't be a novelty or a surprise any longer these days.
I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.
I don't think there is a modern database that have the safest options all turned on by default. For instance the default transaction model for PG is read commited not serializable
One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.
I don't know about Jetstream, but redis cluster would only ack writes after replicating to a majority of nodes. I think there is some config on standalone redis too where you can ack after fsync (which apparently still doesn't guarantee anything because of buffering in the OS).
In any case, understanding what the ack implies is important, and I'd be frustrated if jetstream docs were not clear on that.
> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.
The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.
The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.
Two minutes is a bit too too much (also fdatasync vs fsync).
Not flushing on every write is a very common tradeoff of speed over durability. Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted. You can often prevent this by enabling an option or tuning a parameter.
> I wouldn't trust a product that doesn't default to safest options
This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.
NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here. If you wanted something fully durable with a stronger persistence story you'd probably use Kafka anyhow.
For example, https://github.com/williamstein/nats-bugs/issues/5 links to a discussion I have with them about data loss, where they fundamentally don't understand that their incorrect defaults lead to data loss on the application side. It's weird.
I got very deep into using NATS last year, and then realized the choices it makes for persistence are really surprising. Another horrible example if that server startup time is O(number of streams), with a big constant; this is extremely painful to hit in production.
I ended up implementing from scratch something with the same functionality (for me as NATS server + Jetstream), but based on socket.io and sqlite. It works vastly better for my use cases, since socketio and sqlite are so mature.
I'm not seeing full self-hosting yet, and "Book a call" link is an instant nope for many techies.
I understand that you need to make money. But you'll have to have a proper self-hosting offering with paid support as well before you're considered, at least by me.
I'm not looking to have even more stuff in the cloud.
When I worked with bounded Redis streams a couple of years ago we had to implement our own backpressure mechanism which was quite tricky to get right.
To implement backpressure without relying on out of band signals (distributed systems beware) you need to have a deep understanding of the entire redis streams architecture and how the the pending entries list, consumers groups, consumers etc. works and interacts to not lose data by overwriting yourself.
Unbounded would have been fine if we could spill to disk and periodically clean up the data, but this is redis.
stmw|2 months ago
awesome_dude|2 months ago
People always think "theory is overrated" or "hacking is better than having a school education"
And then proceed to shoot themselves in the foot with "workarounds" that break well known, well documented, well traversed problem spaces
belter|2 months ago
- ACKed messages can be silently lost due to minority-node corruption.
- A single-bit corruption can cause some replicas to lose up to 78% of stored messages
- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.
- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.
- A crash combined with network delay can cause persistent split-brain and divergent logs.
- Data loss even with “sync_interval = always” in presence of membership changes or partitions.
- Self-healing and replica convergence did not always work reliably after corruption.
…was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?
So what is next? Nominate NATS for the Silent Failure Peace Prize?
PeterCorless|2 months ago
> 2. Delayed Sync Mode (Default)
> In the default mode, writes are batched and marked with needSync = true for later synchronization filestore.go:7093-7097 . The actual sync happens during the next syncBlocks() execution.
However, if you read DeepWiki's conclusion, it is far more optimistic than what Aphyr uncovered in real-world testing.
> Durability Guarantees
> Even with delayed fsyncs, NATS provides protection against data loss through:
> 1. Write-Ahead Logging: Messages are written to log files before being acknowledged
> 2. Periodic Sync: The sync timer ensures data is eventually flushed to disk
> 3. State Snapshots: Full state is periodically written to index.db files filestore.go:9834-9850
> 4. Error Handling: If sync operations fail, NATS attempts to rebuild state from existing data filestore.go:7066-7072"
https://deepwiki.com/search/will-nats-lose-uncommitted-wri_b...
staticassertion|2 months ago
asa400|2 months ago
dboreham|2 months ago
jwr|2 months ago
It provides a dictionary of terms that we can use to have educated discussions, rather than throwing around terms like "ACID".
plandis|2 months ago
There is also this [1] which Aphyr collabed on which you might find interesting if you haven’t seen it yet.
[1] https://antithesis.com/resources/reliability_glossary/
rishabhaiover|2 months ago
veverkap|2 months ago
unknown|2 months ago
[deleted]
johncolanduoni|2 months ago
vrnvu|2 months ago
https://jepsen.io/blog/2025-10-20-distsys-glossary
jessekv|2 months ago
https://github.com/nats-io/nats-server/discussions/3312#disc...
(I opened this discussion 2.5 years ago and get an email from github every once in a while ever since. I had given up hope TBH)
merb|2 months ago
Why? Why do some databases do that? To have better performance in benchmarks? It’s not like that it’s ok to do that if you have a better default or at least write a lot about it. But especially when you run stuff in a small cluster you get bitten by stuff like that.
aaronbwebber|2 months ago
Many applications do not require true durability and it is likely that many applications benefit from lazy fsync. Whether it should be the default is a lot more questionable though.
millipede|2 months ago
mrkeen|2 months ago
The kind of failure that a system can tolerate with strict fsync but can't tolerate with lazy fsync (i.e. the software 'confirms' a write to its caller but then crashes) is probably not the kind of failure you'd expect to encounter on a majority of your nodes all at the same time.
thinkharderdev|2 months ago
Yes, exactly.
dilyevsky|2 months ago
cnlwsu|2 months ago
mysfi|2 months ago
aphyr|2 months ago
andersmurphy|2 months ago
> They will refuse, of course, and ever so ashamed, cite a lack of culture fit. Alight upon your cloud-pine, and exit through the window. This place could never contain you.
https://aphyr.com/posts/340-reversing-the-technical-intervie...
dangoodmanUT|2 months ago
Just use redpanda.
maxmcd|2 months ago
Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?
scottlamb|2 months ago
Yes, and you shouldn't even need a fixed interval. Just queue up any writes while an `fsync` is pending; then do all those in the next batch. This is the same approach you'd use for rounds of Paxos, particularly between availability zones or regions where latency is expected to be high. You wouldn't say "oh, I'll ack and then put it in the next round of Paxos", or "I'll wait until the next round in 2 seconds then ack"; you'd start the next batch as soon as the current one is done.
ADefenestrator|2 months ago
clemlesne|2 months ago
unknown|2 months ago
[deleted]
belter|2 months ago
[deleted]
gostsamo|2 months ago
rdtsc|2 months ago
I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.
Coordinated failures shouldn't be a novelty or a surprise any longer these days.
I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.
Thaxll|2 months ago
One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.
lubesGordi|2 months ago
KaiserPro|2 months ago
I like that, and it allows me to build things around it.
For us when we used it back in 2018, it performed well and was easy to administer. The multi-language APIs were also good.
gopalv|2 months ago
The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.
The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.
Two minutes is a bit too too much (also fdatasync vs fsync).
0xbadcafebee|2 months ago
> I wouldn't trust a product that doesn't default to safest options
This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.
For example, Postgres' isolation level is weak by default, leading to race conditions. You have to explicitly enable serialization to avoid it, which is a performance penalty. (https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...)
wseqyrku|2 months ago
Wait, isn't that the whole point of acknowledgments? This is not acknowledgment, it's I'm a teapot.
CuriouslyC|2 months ago
williamstein|2 months ago
williamstein|2 months ago
I got very deep into using NATS last year, and then realized the choices it makes for persistence are really surprising. Another horrible example if that server startup time is O(number of streams), with a big constant; this is extremely painful to hit in production.
I ended up implementing from scratch something with the same functionality (for me as NATS server + Jetstream), but based on socket.io and sqlite. It works vastly better for my use cases, since socketio and sqlite are so mature.
sreekanth850|2 months ago
akshayshah|2 months ago
[0]: https://www.redpanda.com/blog/why-fsync-is-needed-for-data-s...
[1]: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...
lionkor|2 months ago
shikhar|2 months ago
Pros: unlimited streams with the durability of object storage – JetStream can only do a few K topics
Cons: no consumer groups yet, it's on the agenda
embedding-shape|2 months ago
pdimitar|2 months ago
I understand that you need to make money. But you'll have to have a proper self-hosting offering with paid support as well before you're considered, at least by me.
I'm not looking to have even more stuff in the cloud.
hmans|2 months ago
[deleted]
unknown|2 months ago
[deleted]
dzonga|2 months ago
ViewTrick1002|2 months ago
To implement backpressure without relying on out of band signals (distributed systems beware) you need to have a deep understanding of the entire redis streams architecture and how the the pending entries list, consumers groups, consumers etc. works and interacts to not lose data by overwriting yourself.
Unbounded would have been fine if we could spill to disk and periodically clean up the data, but this is redis.
Not sure if that has improved.
selectodude|2 months ago
the__alchemist|2 months ago
Infiltrator|2 months ago
sam_lowry_|2 months ago
[deleted]
t0i7a1r1a|2 months ago