top | item 47147974

(no title)

I have a 16 core M4 Max and running at a fraction of the potential maximum speed just isn't very optimal on modern CPUs like that.

Threading is hard, especially if they share a lot of state. Memory management with multiple threads sharing stuff is hard and ideally minimized. What is optimal very much depends on the type of workload as well. Not all workloads are IO dependent, or require sharing a lot of state.

Using threads for blocking IO on server requests was popular 20 years ago in e.g. Java. But these days non blocking IO is preferred both for single and multi threaded systems. E.g. Elasticsearch uses threading and non blocking IO across CPU cores and cluster nodes to provide horizontal scalability for indexing. It tends to stick to just one indexing thread per CPU core of course. But it has additional thread pools and generally more threads than CPU cores in total.

A lot of workloads where the CPU is the bottleneck that have some IO benefit from threading by letting other threads progress while one is waiting for IO. And if the amount of context switching can be limited, that can be OK. For loads that are embarrassingly parallel with little or no IO and very limited context sharing, a 1 thread per CPU core tends to be the most optimal. It's really when you start having more than threads than cores that context switching becomes a factor. What's optimal there is very much dependent on how much shared state there is and whether you are IO or CPU limited.

In general, concurrency and parallelism tend to be harder in languages that predate when threading and multi core CPUs were common and lack good primitives for this. Python only recently started addressing the GIL obstacle and a big motivation for creating Rust was just how hard doing this stuff is in C/C++ without creating a lot of dead locks, crash bugs, and security issues. It's not impossible with the right frameworks, a lot of skill and discipline of course. But Rust is getting a well deserved reputation for being very optimal and safe for this kind of thing. Likewise functional languages like Elixir are more naturally suited for running on systems with lots of CPUs and threads.

discuss

dspillett|4 days ago

> I have a 16 core M4 Max and running at a fraction of the potential maximum speed just isn't very optimal on modern CPUs like that.

To further muddy the waters: if your process is not bottlenecked at the CPU a modern unit might be more optimal in terms of power draw (directly and through secondary effects for increased cooling needs) running at a fraction of its speed. Moving at a low clock but fast enough not to become the bottleneck compared to other factors, instead of bursting to full speed for a bit then waiting, can be optimal.

Of course there are a bunch of chip specific optimisations here if you like complexity. Some chips are better off running all cores slowly, and others that can completely power down idle cores better off running a few faster, to optimise power use while getting the same job done in the same amount of wall-clock time.

FpUser|4 days ago

>"just how hard doing this stuff is in C/C++ without creating a lot of dead locks, crash bugs, and security issues"

In my opinion this is probably problem for novice. Or people who only know how to program inside very limited and restricting environment. I write multithreaded business backends in modern C++ that accept outside http requests for processing, do some heavy math lifting. Some requests that expected to take short time are processed immediately, some long running one are going to a separate thread pools which also manage throttling of background tasks etc. etc.

I did not find it any particularly hard. All my "dangerous" stuff is centralized, debugged to death years ago and used and reused across multiple products. Stuff runs for years and years without single hick-up. To me it is a non issue.

I do realize that the situation is much tougher for those who write OS kernels but this is very specialized skill and they would know better what to do.

dspillett|4 days ago

A key difference is that it sounds like you need to create and otherwise interact with that sort of code on a regular basis.

Most devs spend most of their time, all of it even, on tasks that are either naturally sequential or don't benefit from threading enough over the safer option of multiple independent processes, so when they do come across a problem that is inherently parallelizable and needs the highest performance it is not a familiar situation for them. Familiarity can make some rather complex processes feel simple.

The same can be said for event loop driven concurrency, for those who don't work that way often the collection of potential race conditions there can feel daunting so they appreciate their chosen platform holding their hand a bit.