top | item 47147679

(no title)

foxmoss | 5 days ago

This metaphor totally gets muddied once you consider some of the most optimized programs are run on a single thread in an event loop. Communication between threads is expensive, epolling many io streams is less so. Not quite sure what implications this has in life but you could probably ascribe some wisdom to it.

discuss

order

jillesvangurp|5 days ago

I have a 16 core M4 Max and running at a fraction of the potential maximum speed just isn't very optimal on modern CPUs like that.

Threading is hard, especially if they share a lot of state. Memory management with multiple threads sharing stuff is hard and ideally minimized. What is optimal very much depends on the type of workload as well. Not all workloads are IO dependent, or require sharing a lot of state.

Using threads for blocking IO on server requests was popular 20 years ago in e.g. Java. But these days non blocking IO is preferred both for single and multi threaded systems. E.g. Elasticsearch uses threading and non blocking IO across CPU cores and cluster nodes to provide horizontal scalability for indexing. It tends to stick to just one indexing thread per CPU core of course. But it has additional thread pools and generally more threads than CPU cores in total.

A lot of workloads where the CPU is the bottleneck that have some IO benefit from threading by letting other threads progress while one is waiting for IO. And if the amount of context switching can be limited, that can be OK. For loads that are embarrassingly parallel with little or no IO and very limited context sharing, a 1 thread per CPU core tends to be the most optimal. It's really when you start having more than threads than cores that context switching becomes a factor. What's optimal there is very much dependent on how much shared state there is and whether you are IO or CPU limited.

In general, concurrency and parallelism tend to be harder in languages that predate when threading and multi core CPUs were common and lack good primitives for this. Python only recently started addressing the GIL obstacle and a big motivation for creating Rust was just how hard doing this stuff is in C/C++ without creating a lot of dead locks, crash bugs, and security issues. It's not impossible with the right frameworks, a lot of skill and discipline of course. But Rust is getting a well deserved reputation for being very optimal and safe for this kind of thing. Likewise functional languages like Elixir are more naturally suited for running on systems with lots of CPUs and threads.

dspillett|4 days ago

> I have a 16 core M4 Max and running at a fraction of the potential maximum speed just isn't very optimal on modern CPUs like that.

To further muddy the waters: if your process is not bottlenecked at the CPU a modern unit might be more optimal in terms of power draw (directly and through secondary effects for increased cooling needs) running at a fraction of its speed. Moving at a low clock but fast enough not to become the bottleneck compared to other factors, instead of bursting to full speed for a bit then waiting, can be optimal.

Of course there are a bunch of chip specific optimisations here if you like complexity. Some chips are better off running all cores slowly, and others that can completely power down idle cores better off running a few faster, to optimise power use while getting the same job done in the same amount of wall-clock time.

FpUser|4 days ago

>"just how hard doing this stuff is in C/C++ without creating a lot of dead locks, crash bugs, and security issues"

In my opinion this is probably problem for novice. Or people who only know how to program inside very limited and restricting environment. I write multithreaded business backends in modern C++ that accept outside http requests for processing, do some heavy math lifting. Some requests that expected to take short time are processed immediately, some long running one are going to a separate thread pools which also manage throttling of background tasks etc. etc.

I did not find it any particularly hard. All my "dangerous" stuff is centralized, debugged to death years ago and used and reused across multiple products. Stuff runs for years and years without single hick-up. To me it is a non issue.

I do realize that the situation is much tougher for those who write OS kernels but this is very specialized skill and they would know better what to do.

imtringued|4 days ago

Event loops are great but composition is hard. This is due to the fact that the OS (e.g. Linux) provides event loops with custom event types (eventfd()) but the performance is worse than if you built it yourself.

The bad performance leads to a proliferation of everyone building their own event loops, which don't mesh together, which in turn leads to people standardizing on large async frameworks like tokio.

jiehong|4 days ago

I think this is a bit like a factory:

- you get a queue as input (a belt);

- you process it;

- you output a queue (also a belt);

So you're doing one thing, over and over, synchronously, blocking in between.