top | item 43638145

(no title)

tomasol | 10 months ago

The biggest motivator for me is that WASM sandbox provides true deterministic execution. Contrary to engines like temporal, using hashmaps is 100% deterministic here. Attempting to spawn a thread is a compile error. It also performs well - the bottleneck is in the write throughput of sqlite. Last but not least - all the interfaces between workflows and activities are type safe, described in a WIT schema.

discuss

AlotOfReading|10 months ago

WASM isn't quite deterministic. An easy example is NaN propagation, which can be nondeterministic in certain circumstances. Obelisk itself seems to allow nondeterminism via the sleep() function. Just create a race condition among a join set. I imagine that might even get easier once the TODO to implement sleep jitter is completed.

It's certainly close enough that calling it deterministic isn't misleading (though I'd stop short of "true determinism"), but there's still sharp edges here with things like hashmaps (e.g. by recompiling: https://dev.to/gnunicorn/hunting-down-a-non-determinism-bug-...).

tomasol|10 months ago

Thanks for bringing that up. Regarding the NaN canonicalization, there is a flag for it in wasmtime [1], I should probably make sure it is turned on.

Although I don't expect to be an issue practically speaking, Obelisk checks that the replay is deterministic and fails the workflow when an unexpected event is triggered. It should be also be possible to add an automatic replay of each finished execution to verify the determinism e.g. while testing.

[1] https://docs.rs/wasmtime/latest/wasmtime/struct.Config.html#...

Edit: Enabling the flags here: https://github.com/obeli-sk/obelisk/pull/67

tomasol|10 months ago

> Just create a race condition among a join set.

All responses and completed delays are stored in a table with an auto-incremented id, so the `-await-next` will always resolve to the same value.

As you mention, putting a persistent sleep and a child execution into the same join set is not yet implemented.

genuine_smiles|10 months ago

> An easy example is NaN propagation, which can be nondeterministic in certain circumstances.

Which circumstances?

jcmfernandes|10 months ago

Somewhat similar to Golem - https://github.com/golemcloud/golem - correct?

So, I like this idea, I really do. At the same time, in the short-term, WASM is relatively messy and, in my opinion, immature (as an ecosystem) for prime time. But with that out of the way (it will eventually come), you'll have to tell people that they can't use any code that relies on threads, so they better know if any of the libraries they use does it. How do you foresee navigating this? Runtime errors suck, especially in this context, as fixing them requires either live patching code or migrating execution logs to new code versions.

tomasol|10 months ago

Yeah, looks like Golem went similar route - using WASM Component Model and wasmtime.

There is always this chicken and egg problem on a new platform, but I am hoping that LLMs can solve it partially - the activities are just HTTP clients with no complex logic.

Regarding the restrictions required for determinism, they only apply to workflows, not activities. Workflows should be describing just the business logic. All the complexities of retries, failure recovery, replay after server crash etc. are handled by the runtime. The WASM sandbox makes it impossible to introduce non-determinism - it would cause a compile error so no need for runtime checks.