top | item 44199592

Jepsen: TigerBeetle 0.16.11

241 points| aphyr | 8 months ago |jepsen.io

86 comments

order

Some comments were deferred for faster rendering.

nindalf|8 months ago

Very impressed with this report. Whenever I read TigerBeetle's claims on reliability and scalability, I'd think "ok, let's wait for the Jepsen report".

This report found a number of issues, which might be a cause for concern. But I think it's a positive because they didn't just fix the issues, they've expanded their internal test suite to catch similar bugs in future. With such an approach to engineering I feel like in 10 years TigerBeetle would have achieved the "just use Postgres" level of default database in its niche of financial applications.

Also great work aphyr! I feel like I learned a lot reading this report.

jorangreef|8 months ago

Thanks!

Yes, we have around 6,000+ assertions in TigerBeetle. A few of these were overtight, hence some of the crashes. But those were the assertions doing their job, alerting us that we needed to adjust our mental model, which we did.

Otherwise, apart from a small correctness bug in an internal testing feature we added (only in our Java client and only for Jepsen to facilitate the audit) there was only one correctness bug found by Jepsen, and it didn’t affect durability. We’ve written about it here: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...

Finally, to be fair, TigerBeetle can (and is tested) to survive more faults than Postgres can, since it was designed with an explicit storage fault model and using research that was not available at the time when Postgres was released in ‘96. TB’s fault models are further tested with Deterministic Simulation Testing and we use techniques such as static memory allocation following NASA’s Power of Ten Rules for Safety-Critical Code. There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.

For more on this, see the section in Kyle’s report on helical fault injection (most Raft and Paxos implementations were not designed to survive this) as well as a talk we gave at QCon London: https://m.youtube.com/watch?v=_jfOk4L7CiY

SOLAR_FIELDS|8 months ago

I always get excited to read Kyle’s write ups. I feel like I level up my distributed systems knowledge every time he puts something out.

jitl|8 months ago

Really happy to see TigerBeetle live up to its claims as verified by aphyr - because it's good to see that when you take the right approach, you get the right results.

Question about how people end up using TigerBeetle. There's presumably a lot of external systems and other databases around a TigerBeetle install for everything that isn't an Account or Transfer. What's the typical pattern for those less reliable systems to square up to TigerBeetle, especially to recover from consistency issues between the two?

jorangreef|8 months ago

Joran from TigerBeetle here! Thanks! Really happy to see the report published too.

The typical pattern in integrating TigerBeetle is to differentiate between control plane (Postgres for general purpose or OLGP) and data plane (TigerBeetle for transaction processing or OLTP).

All your users (names, addresses, passwords etc.) and products (descriptions, prices etc.) then go into OLGP as your "filing cabinet".

And then all the Black Friday transactions these users (or entities) make, to move products from inventory accounts to shopping cart accounts, and from there to checkout and delivery accounts—all these go into OLTP as your "bank vault". TigerBeetle lets you store up to 3 user data identifiers per account or transfer to link events (between entitites) back to your OLGP database which describes these entities.

This architecture [1] gives you a clean "separation of concerns", allowing you to scale and manage the different workloads independently. For example, if you're a bank, it's probably a good idea not to keep all your cash in the filing cabinet with the customer records, but rather to keep the cash in the bank vault, since the information has different performance/compliance/retention characteristics.

This pattern makes sense because users change their name or email address (OLGP) far less frequently than they transact (OLTP).

Finally, to preserve consistency, on the write path, you treat TigerBeetle as the OLTP data plane as your "system of record". When a "move to shopping cart" or "checkout" transaction comes in, you first write all your data dependencies to OLGP if any (and say S3 if you have related blob data) and then finally you commit your transaction by writing to TigerBeetle. On the read path, you query your system of record first, preserving strict serializability.

Does that make sense? Let me know if there's anything here we can drill into further!

[1] https://docs.tigerbeetle.com/coding/system-architecture/

DetroitThrow|8 months ago

This is a particularly fun Jepsen report after reading their fuzzer blind spots post.

It looks like the segfaults on the JNI side would not have been protected if Rust or some other memory safe language were being used - the lack of memory safety bugs gives some decent proof that TigerBeetle's approach to Zig programming (TigerStyle iirc, lol) does what it sets out to do.

matklad|8 months ago

See https://news.ycombinator.com/item?id=44201189. We did have one bug where Rust would've saved our bacon (instead, the bacon was saved by an assertion, so it was just slightly crispy, not charred).

EDIT: But, yeah, totally, if not for TigerStyle, we'd die to nasal demons!

FlyingSnake|8 months ago

Love the wonderfully detailed report. Getting it tested and signed off by Jepsen is such a huge endorsement for TigerBeetle. It’s not even reached v1.0 and I can’t wait to see it hit new milestone in the future.

Special kudos to the founders who are sharing great insights in this thread.

jorangreef|8 months ago

Yes, Kyle did an incredible job and I also love the detail he put into the report. I kept saying to myself: “this is like a work of art”, the craftsmanship and precision.

Appreciate your kind words too, and look forward also to sharing something new in our talks at SD25 in Amsterdam soon!

12_throw_away|8 months ago

A small appreciation for the section entitled "Panic! At the Disk 0": <golf clap>

ryeats|8 months ago

I think it is interesting but obvious in hindsight that it is necessary to have the distributed system under test report the time/order things actually happened to enable accurate validation against an external model of the system instead of using wall-clock time.

matklad|8 months ago

Note that this works because we have strict serializability. With weaker consistency guarantees, there isn't necessarily a single global consistent timeline.

This is an interesting meta pattern where doing something _harder_ actually simplifies the system.

Another example is that, because we assume that the disk can fail and need to include repair protocol, we get state-synchronization for a lagging replica "for free", because it is precisely the same situation as when the entire disk gets corrupted!

cmrdporcupine|8 months ago

The articles link to the paper about "Viewstamped Replication" is unfortunately broken (https://pmg.csail.mit.edu/papers/vr-revisited.pdf connection refused).

I think it should be http://pmg.csail.mit.edu/papers/vr-revisited.pdf (http scheme not https) ?

And now I have some Friday evening reading material.

jorangreef|8 months ago

It should be fixed soon!

The VSR 2012 paper is one of my favorites as is “Protocol-Aware Recovery for Consensus-Based Storage”, which is so powerful.

Hope you enjoy the read!

eevmanu|8 months ago

I have a question that I hope is not misinterpreted, as I'm asking purely out of a desire to learn. I am new to distributed systems and fascinated by deterministic simulation testing.

After reading the Jepsen report on TigerBeetle, the related blog post, and briefly reviewing the Antithesis integration code on GitHub workflow, I'm trying to better understand the testing scope.

My core question is: could these bugs detected by the Jepsen test suite have also been found by the Antithesis integration?

This question comes from a few assumptions I made, which may be incorrect:

- I thought TigerBeetle was already comprehensively tested by its internal test suite and the Antithesis product.

- I had the impression that the Antithesis test suite was more robust than Jepsen's, so I was surprised that Jepsen found an issue that Antithesis apparently did not.

I'm wondering if my understanding is flawed. For instance:

1. Was the Antithesis test suite not fully capable of detecting this specific class of bug?

2. Was this particular part of the system not yet covered by the Antithesis tests?

3. Am I fundamentally comparing apples and oranges, misunderstanding the different strengths and goals of the Jepsen and Antithesis testing suites?

I would greatly appreciate any insights that could help me understand this better. I want to be clear that my goal is to educate myself on these topics, not to make incorrect assumptions or assign responsibility.

aphyr|8 months ago

Yeah, TigerBeetle's blog post goes into more detail here, but in short, the tests that were running in Antithesis (which were remarkably thorough) didn't happen to generate the precise combination of intersecting queries and out-of-order values that were necessary to find the index bug, whereas the Jepsen generator did hit that combination.

There are almost certainly blind spots in the Jepsen test generators too--that's part of why designing different generators is so helpful!

matklad|8 months ago

To add to what aphyr says, you generally need three components for generative testing of distributed systems:

1. Some sort of environment, which can run the system. The simplest environment is to spin up a real cluster of machines, but ideally you want something fancier, to improve performance, control over responses of external APIs, determinism, reproducibility, etc. 2. Some sort of load generator, which makes the system in the environment do interesting thing 3. Some sort of auditor, which observes the behavior of the system under load and decides whether the system behaves according to the specification.

Antithesis mostly tackles problem #1, providing a deterministic simulation environment as a virtual machine. The same problem is talked by jepsen (by using real machines, but injecting faults at the OS level), and by TigerBeetle's own VOPR (which is co-designed with the database, and for that reason can run the whole cluster on just a single thread). There there approaches are complimentary and are good at different things.

For this bug, the critical part was #2, #3 --- writing workload verifier and auditor that actually can trigger the bug. Here, it was aphyr's 1600 lines of TigerBeetle-specfic Clojure code that triggred and detected the bug (and then we patched _our_ equivalent to also trigger it. Really, what's buggy here is not the database, but the VOPR. Database having bugs is par of course, you can't just avoid bugs through the sheer force of will. So you need testing strategy that can trigger most bugs, and any bug that slips through is pointing to the deficiency in the workload generator.)

jorangreef|8 months ago

(Note also that 90% of our deterministic simulation testing is done primarily by the VOPR, TigerBeetle's own deterministic simulator, which we built inhouse, and which runs on a fleet of 1,000 dedicated CPU cores 24/7. We also use Antithesis, but as a second layer of DST.)

To understand why the query engine bug slipped through, see: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...

koakuma-chan|8 months ago

Curios if they got any large bank or stock exchange to use TigerBeetle

jorangreef|8 months ago

Joran, creator and CEO from TigerBeetle here!

At a national level, we’re working with the Gates Foundation to integrate TigerBeetle into their non-profit central bank switch that will be powering Rwanda’s National Digital Payments System 2.0 later this year [1].

At an enterprise level, TigerBeetle already powers customers processing 100M+ transactions per month in production, and we recently signed our first $2B fintech unicorn in Europe with a few more in the US about to close. Because of the move to realtime transaction processing around the world [2] there’s been quite a bit of interest from companies wanting to move to TigerBeetle for more performance.

Finally, to your question, some of the founders of Clear Street, a fairly large brokerage on Wall Street have since invested [3] in TigerBeetle.

[1] https://mojaloop.io/how-mojaloop-enables-rndps-2-0-ekash/

[2] https://tigerbeetle.com/blog/2024-07-23-rediscovering-transa...

[3] https://tigerbeetle.com/company

SOLAR_FIELDS|8 months ago

Not a bank or exchange but I work for a very large fintech and we are using it on our newer products.

nindalf|8 months ago

I think if they had, they'd brag about it on their homepage. So far the biggest endorsement from there is from some YouTuber. A popular YouTuber, no doubt, but a YouTuber nevertheless.

ManBeardPc|8 months ago

TigerBeetle is something I’m interested in. I see there is no C or Zig client listed in the clients documentation. Thought these would be the first ones to exist given it is written in Zig. Do they exist or maybe still WIP?

rbatiati|8 months ago

Hi there! We do have a C client, the libtb_client. It's written in Zig and exposes an FFI interface that is consumed by all other TigerBeetle clients.

We don't publish the pre-compiled libs yet, but it will be available soon when we stabilize the API. For now, it can be built locally from source.

Example of using the C client: https://github.com/tigerbeetle/tigerbeetle/blob/main/src/cli...

andyferris|8 months ago

I found the line about Tigerbeetle's model assuming entire disk sector errors but not bit/byte errors rather interesting - as someone who has created error correcting codes, this seems out of line with my understanding. The only situation I can see it works is where the disk or driver encodes and decodes the sectors... and (on any disk/driver I would care to store an important transactional database) would be reporting tonnes of (possibly corrected) faults before Tigerbeetle was even aware.

Or possibly my mental model of how physical disks and the driver stack behave these days is outdated.

matklad|8 months ago

Just to clarify, our _model_ totally assumes bit/byte error! It's just that our fuzzer was buggy and wasn't actually exercising those faults!

wiradikusuma|8 months ago

If memory serves, TigerBeetle is/was not free for production? I can't find the Pricing page, but I kinda remember reading about it somewhere (or it was implied) a while back.

jorangreef|8 months ago

The DBMS is Apache 2.0 and our customers pay us (well) for everything else to run, integrate, migrate, operate and support that.

For more on our open source thinking and how this is orthogonal to business model (and product!), see our interview with the Changelog: https://m.youtube.com/watch?v=Yr8Y2EYnxJs

Ygg2|8 months ago

TigerBeetle is impressive, but it's a single purpose DB. Unless you fit within the account ledger model it's extremely restrictive.

SOLAR_FIELDS|8 months ago

That is 100% correct. You use TigerBeetle when you need a really good double entry accounting system that is open source. You wouldn’t use it for much else other than that. Which makes it great software, it’s purpose made to solve one problem really well

saaaaaam|8 months ago

That's a slightly redundant criticism though - it doesn't present itself as anything other than a single purpose database designed for financial transactions.

That's like saying that rice noodles are no good for making risotto. At the core they are both rice...

jorangreef|8 months ago

Joran from TigerBeetle here!

Yes, TigerBeetle specializes only for transaction processing (OLTP). It’s not a general-purpose (OLGP) DBMS.

That said, we have customers from energy to gaming, and of course fintech.

UltraSane|8 months ago

I find drills pretty useless to drive nails.