top | item 45668765

(no title)

As someone with workloads that can benefit from these techniques, but limited resources to put them into practice, my working thesis has been:

* Use a multi-threaded tokio runtime that's allocated a thread-per-core * Focus on application development, so that tasks are well scoped / skewed and don't _need_ stealing in the typical case * Over time, the smart people working on Tokio will apply research to minimize the cost of work-stealing that's not actually needed. * At the limit, where long-lived tasks can be distributed across cores and all cores are busy, the performance will be near-optimal as compared with a true thread-per-core model.

What's your hot take? Are there fundamental optimizations to a modern thread-per-core architecture which seem _impossible_ to capture in a work-stealing architecture like Tokio's?

discuss

jandrewrogers|4 months ago

A core assumption underlying thread-per-core architecture is that you will be designing a custom I/O and execution scheduler that is purpose-built for your software and workload at a very granular level. Most expectations of large performance benefits follow from this assumption.

At some point, people started using thread-per-core style while delegating scheduling to a third-party runtime, which almost completely defeats the purpose. If you let tokio et al do that for you, you are leaving a lot of performance and scale on the table. This is an NP-Hard problem; the point of solving it at compile-time is that it is computationally intractable for generic code to create a good schedule at runtime unless it is a trivial case. We need schedulers to consistently make excellent decisions extremely efficiently. I think this point is often lost in discussions of thread-per-core. In the old days we didn’t have runtimes, it was just assumed you would be designing an exotic scheduler. The lack of discussion around this may have led people to believe it wasn’t a critical aspect.

The reality that designing excellent workload-optimized I/O and execution schedulers is an esoteric, high-skill endeavor. It requires enormous amounts of patience and craft, it doesn’t lend itself to quick-and-dirty prototypes. If you aren’t willing to spend months designing the many touch points for the scheduler throughout your software, the algorithms for how events across those touch points interact, and analyzing the scheduler at a systems level for equilibria and boundary conditions then thread-per-core might not be worth the effort.

That said, it isn’t rocket science to design a reasonable schedule for software that is e.g. just taking data off the wire and doing something with it. Most systems are not nearly as complex as e.g. a full-featured database kernel.