top | item 39680579

Deterministic simulation testing for our entire SaaS

203 points| wwilson | 2 years ago |warpstream.com | reply

26 comments

order
[+] necubi|2 years ago|reply
This is so, so cool. Basically the holy grail as a distributed systems engineer. Like the author, I've also avidly consumed every Jepsen report but the effort of actually implementing Jepsen tests for my systems always seemed too high.

Very excited to see this technology democratized and made available to to more companies!

[+] mtremsal|2 years ago|reply
This is quickly becoming my favorite technical blog. Congrats Richie and Ryan. I didn't fully understand Antithesis the first time I ran into it; now it makes sense.
[+] zellyn|2 years ago|reply
Hey WarpStream folks… does your blog have an atom/rss feed?
[+] Fomite|2 years ago|reply
Question from another field that does a lot of simulation - why is the assertion that deterministic simulation testing, rather than something stochastic, is the gold standard.
[+] _dain_|2 years ago|reply
[ I work at Antithesis ]

Concurrent/distributed system bugs can be really finicky because they may depend on subtle timing conditions to manifest. So you might see a bug once, then try to re-run the test using the "same" inputs, and the bug doesn't appear a second time. This might be because e.g. threads aren't scheduled the same way as before, so some 1-microsecond-wide window of vulnerability for a race condition was missed. If you can't reliably reproduce the bug, it's much harder to study and fix.

Determinism lets you perfectly reproduce the bug as many times as you want. Perfectly as in, exactly the same thread+process scheduling, exact same memory and disk access times, exact same network packet transit times and orderings .. exact same everything. Then once you have returned to the bug, you can rewind time, to do things like explore counterfactual scenarios by varying the random seed from that moment on.

We do have randomness of course, otherwise it wouldn't be a very good fuzzer. But we save all the seeds, so it's a controlled, reproducible randomness.

[+] cpgxiii|2 years ago|reply
From yet another field where deterministic simulation is often a goal (robotics), the ideal is a simulation test system that is deterministic for a given initialization (e.g. a random seed) so that for an initialization that causes some error to occur, you can reliably reproduce and resolve the error. Of course, you then need to run that system with a range of initializations to have confidence that you didn't just get lucky with the initialization.

In practice, this can be quite hard to do in the presence of uncontrolled non-determinism (e.g. thread/process/GPU scheduling)* and it is often more pragmatic to invest the time in better stochastic testing and logging than deterministic reproduction.

* Yes, these can be made closer to deterministic. But doing so often comes with reduced performance, such that the system you are testing would no longer match the system being deployed, defeating much of the purpose of the test in the first place.

[+] figassis|2 years ago|reply
> Antithesis has created the holy grail for testing distributed systems: a bespoke hypervisor that deterministically simulates an entire set of Docker containers and injects faults, created by the same people who made FoundationDB.

I remember the Antithesis founder was having a hard time explaining what exactly they did.

[+] mamidon|2 years ago|reply
I remember that too, the ambiguity for me was how their fuzzing was good enough to explore an arbitrary state space efficiently enough.

The deterministic hypervisor is 'simple' enough albeit a pretty heavy engineering lift.

[+] Klaster_1|2 years ago|reply
This article and previous Antithesis ones mention testing distributed systems and, as someone who works at a company specialized in exactly this, I am excited. However, I wonder if Antithesis could help with nondeterministic failures observed in unit and integration tests I encounter in my Jasmine and TestCafe suites. Most of the time, these are quite hard to reproduce - if at all possible - and a significant portion of failures is caused by genuine application bugs. I wish there was a tool that helped with these.
[+] fuzzy_biscuit|2 years ago|reply
Slightly tangential, but when I went to go look at pricing information on mobile, the rates were clipped/overflowed out of bounds.
[+] oldstrangers|2 years ago|reply
Woops, what are you device details? I'll take a look!
[+] profstasiak|2 years ago|reply
hopefully in year 2300 we can have good way to test landing pages
[+] winrid|2 years ago|reply
BTW, why base pricing on r4 instances vs something more cost effective like r5?
[+] richieartoul|2 years ago|reply
I think I just followed the official recommendations I found (which are probably stale now). I'll update it to r5, but it doesn't really matter. The price difference between the two is like 5%, but hardware only ends up representing a tiny fraction of Kafka's cost at scale (the real cost comes from EBS and inter-zone networking).

I could make the hardware free for Kafka in the comparison, and WarpStream would still come out significantly more cost effective. Cloud networking is really expensive.

[+] wolframhempel|2 years ago|reply
I've bookmarked it, just because the site is so pretty.