Is parallel programming hard, and, if so, what can you do about it?

[+] ascar|2 years ago|reply

I see a lot of confusion between parallel programming [1] and concurrent programming [2] in the comments here.

The former and what this book is about deals with the problem of parallelizaing a single sequential program. There usually is strong interaction or dependencies between elements and progress needs synchronization. E.g. timestep iterations in real-time simulations that need synchronization with data communication after each timestep. These simulation also tend to get way to big to be run on a single machine, lest a single thread, and get scaled up to millions of cores/threads in supercomputers.

Concurrent programming is what most developers working with the internet are more familiar with. You have mostly independent tasks that you want to run concurrently. "A concurrent system is one where a computation can advance without waiting for all other computations to complete." [2] E.g. nginx serving thousands of user requests at the same time.

The problem domains have a lot of overlap on the basics (e.g. threading), however the focus is very different. Things like synchronization (mutex, barriers), cache locality and memory bandwith & latency play a central role in parallel programming, while concurrent programming focuses more on the engineering challenge of distributing independent tasks across multiple threads or machines.

[1] https://en.wikipedia.org/wiki/Parallel_computing

[2] https://en.wikipedia.org/wiki/Concurrent_computing

[+] vivegi|2 years ago|reply

The book covers it in Appendix A.6 (p 424) in the v2023.06.11a PDF file.

> A.6 What is the Difference Between “Concurrent” and “Parallel”?

> From a classic computing perspective, “concurrent” and “parallel” are clearly synonyms. However, this has not stopped many people from drawing distinctions between the two, and it turns out that these distinctions can be understood from a couple of different perspectives.

> The first perspective treats “parallel” as an abbreviation for “data parallel”, and treats “concurrent” as pretty much everything else. From this perspective, in parallel computing, each partition of the overall problem can proceed completely independently, with no communication with other partitions. In this case, little or no coordination among partitions is required. In contrast, concurrent computing might well have tight interdependencies, in the form of contended locks, transactions, or other synchronization mechanisms.

> This of course begs the question of why such a distinction matters, which brings us to the second perspective, that of the underlying scheduler. Schedulers come in a wide range of complexities and capabilities, and as a rough rule of thumb, the more tightly and irregularly a set oparallel processes communicate, the higher the level of sophistication required from the scheduler. As such, parallel computing’s avoidance of interdependencies means that parallel-computing programs run well on the least-capable schedulers. In fact, a pure parallel-computing program can run successfully after being arbitrarily subdivided and interleaved onto a uniprocessor. In contrast, concurrent computing programs might well require extreme subtlety on the part of the scheduler.

[+] crabbone|2 years ago|reply

I don't know who invented this nonsense distinction. First time I was introduced to this idea of "concurrent programming" being a separate thing when Go was released. So, I associate this nonsense with Go, but it could have happened earlier, I simply never heard about it before then.

Anyways. The way I see it used today, it's applied to language runtimes incapable or severely crippled when it comes to parallel / concurrent execution. Eg. Python, JavaScript etc. In such environments programmers are offered a mechanism that has many downsides of parallel / concurrent programming (eg. unpredictable order of execution) without the benefits of parallel / concurrent programming (ie. nothing actually happens at the same time, or only sleep is possible at the same time etc.)

I feel like this distinction, while a nonsense idea at its core, became so popular due to the popularity of language runtimes with disabilities and their users needing to validate their worth by adding features to their languages their runtimes are inherently incapable of implementing.

Similar situation happened with ML-style types. Python, for example, works very poorly with this add-on, but the desire to match features of other languages led Python developers to add those types anyways. Similarly, TypeScript and a bunch of similar languages, especially in Web.

[+] aidenn0|2 years ago|reply

Both I (and apparently the author of TFA) disagree with your definition of parallel programming. TFA gives an example of "embarrassingly parallel" programs as one way to make parallel programming simple.

The distinction I learned was: any time you have multiple logical threads of execution you have concurrency, any time you have multiple computations happening simultaneously, you have parallelism.

Multithreaded programming on a single core computer is concurrent, but not parallel. Vector processing is parallel, but not concurrent.

[+] dahart|2 years ago|reply

I know we’re kind stuck with the term of art “concurrent”, and will forever have to explain the difference between the synonymous words “concurrent” and “parallel” — The Merriam Webster definition of “concurrent” uses the word “parallel”, for example.

Wouldn’t it be nice if we could come up with a better word for this that doesn’t literally overlap with ‘parallel’ and doesn’t need to deviate so far from it’s dictionary definition?

Personally I think of JavaScript as ‘asynchronous’, and I know this as a term of art means a programming model, but it’s a lot easier to see that async can be done with a single thread and isn’t necessarily parallel, right?

[+] Salgat|2 years ago|reply

Parallel programming is simply a type of concurrent programming. Concurrent means that two tasks can both progress in a given duration. On a single core computer every thread runs concurrently. Parallel expands on concurrency to mean that two concurrent tasks can also run at the exact same time. On a multi-core computer threads can run in parallel. In many cases concurrent programming and parallel programming have little to no difference, and you program with the assumption that every task can run in parallel (for example, whenever you use async/await or the threadpool).

[+] remontoire|2 years ago|reply

Watching geohot code a general matrix multiply algorithm from 0.9 GFLOPS and optimising it to 100 glops by only tinkering with cache locality, it makes me wonder how much effort should be put into single threaded performance before ever thinking about multi threading

[+] jetbalsa|2 years ago|reply

I've seen stuff like that before with a game called Factroio, The only game I've ever see that is optimized so hard that your RAM Speed can affect large bases rather quickly, same with faster L2 Cache. Their entire blog series[1] covers a large part of how they did this. but for a game written mostly in LUA they sure did a good job on it.

1: https://www.factorio.com/blog/post/fff-204

[+] bob1029|2 years ago|reply

Fintech has mostly determined that 1 thread can get the job done. See LMAX disruptor and related ideas.

What problems exist that generate events or commands faster than 500 million per second? This is potentially the upper bar for 1 thread if you are clever enough.

Latency is the real thing you want to get away from. Adding more than one CPU into the mix screws up the hottest possible path by ~2 orders of magnitude. God forbid you have to wait on the GPU or network. If you have to talk to those targets, it had better be worth the trip.

[+] IshKebab|2 years ago|reply

It's quite rare for problems to be dominated by hot loops in the same way that matrix multiplication is.

Think about something like speeding up a compiler or a web server or a spreadsheet. There's no 50-line function that you can spend a few hours optimising and speed up the whole thing.

That's part of the reason why Python programs (except maths heavy stuff like ML) tend to be so slow despite everyone saying "just write your hot code in C". You can't because there is no hot code.

[+] nmarinov|2 years ago|reply

Fun video[0]. The optimization bit starts at 0:35. [0]: https://www.youtube.com/watch?v=VgSQ1GOC86s

[+] nradov|2 years ago|reply

That appears to be one of the major design goals for the Mojo programming language. It allows the developer to code at a high level of abstraction and indicate where parallel execution should be used. Then the execution environment automatically optimizes that at runtime based on the actual hardware available. That hardware may change drastically over the full lifecycle of the application code so in the future it will automatically take advantage of hardware advances such as the introduction of new types of specialized co-processors.

https://www.modular.com/mojo

[+] ascar|2 years ago|reply

Cache locality is certainly an important aspect and especially when it comes to matrix multiplication even just changing the loop order (and thus access pattern) has a huge performance impact. However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough. And the difference between these two might be as simple as sorting your input data first.

I teach parallel programming to graduates and the first exercise we give them is a sequential optimization for exactly that reason. Think about if your algorithm is efficient before thinking about all the challenges that come with parallelization.

[+] Hbruz0|2 years ago|reply

Link for that ?

[+] Const-me|2 years ago|reply

As a developer, I often choose higher-level APIs not listed in that article.

On Windows, OSX and iOS the OS userland already implements general, and relatively easy to use, thread pools. On Windows, see CreateThreadpoolWork, WaitForThreadpoolWorkCallbacks, etc. It’s easier to use threads with locks while someone else is managing these threads. On Apple, the pool is called “grand central dispatch” and does pretty much the same thing.

Modern Windows kernel supports interesting synchronization APIs like WaitOnAddress, WakeByAddressSingle which allow to implement locks without the complexity or performance overhead of maintaining special synchronization objects.

Linux kernel implements performant and scalable message queues, see mq_overview(7). And it has synchronization objects like eventfd() and pidfd_open() which allow to integrate locks or other things into poll/epoll based event loops.

[+] williamcotton|2 years ago|reply

GCD/libdispatch is a fantastic approach to concurrency and you can build and install support for non-Apple operating systems:

https://github.com/apple/swift-corelibs-libdispatch

Here’s a simple echo server:

https://github.com/williamcotton/c_playground/blob/master/sr...

Here’s a simple multithreaded database pool:

https://github.com/williamcotton/express-c/blob/master/src/d...

[+] OnlyMortal|2 years ago|reply

If you’re in C++ land, you might take a look at Boost ASIO. It’s not just for IPC and would give you portable code.

[+] codys|2 years ago|reply

Please don't use posix message queues (`mq_*`, `mq_overview`, etc) when you're writing a program with a single address space (like something that uses threads in a single process).

Posix message queues (the `mq_*` functions) are much slower than optimized shared memory queues using typical atomics and have semantics that are unexpected to most (they persist like files after process termination, because of what they are designed to be used for, they have system level limits on size and size of items, etc).

A simple benchmark vs rust's `std::sync::mpsc` queue shows `std::sync::mpsc` is 28.6 times faster when using 1 producer and 1 consumer, and is 37.32 times faster with 2 producers and 1 consumer.

[+] vmfunction|2 years ago|reply

what about "green threads" that is not managed by the OS like https://tokio.rs ?

[+] uguuo_o|2 years ago|reply

On scientific computing (computational fluid dynamics, computational electromagnetics, etc.), parallel programing is a must. Most of the algorithms are not embarrassingly parallel, and need a significant amount of communication between threads during runtime. We mostly use one of the many MPI [0] libraries available for desktop and high-performance computing machines. Using these programming paradigms is difficult but tend to result in fantastic scalability for all types of engineering problems.

[0] https://en.m.wikipedia.org/wiki/Message_Passing_Interface

[+] OscarCunningham|2 years ago|reply

So this might be a naive question, but is it literally that you're simulating a fluid in parallel by having each thread simulate a portion of the space? And then the message passing between threads is the data about the fluid on the boundary?

[+] klabb3|2 years ago|reply

I’m way-above-average interested in concurrent programming but this 600+ page brick will probably remain on my reading list until I am stranded on a deserted island. Did anyone here read the whole thing? Can one make a reasonable summary or is this more of a lexicon of different techniques?

[+] giovannibonetti|2 years ago|reply

Functional programming can be a great way to handle parallel programming in a sane way. See the Futhark language [1], for example, that accepts high-level constructs like map and reduce, then converts them to the appropriate machine code, either on the CPU or the GPU.

[1] https://futhark-lang.org/

[+] davidgrenier|2 years ago|reply

This looks brutal, for a lot of people John Reppy's book Concurrent Programming in ML (as in SML not Machine Learning) is going to be much more accessible. Pick the CSP-style library in the programming language of your choice.

Go with goroutines and channels

Clojure with core.async

F# with Hopac

It would be a very interesting project to roll your own in C# using Microsoft Robotics Studio's CCR (Coordination and Concurrency Runtime) (though I speculate those are buffered channels by default).

[+] PheonixPharts|2 years ago|reply

There are multiple replies like this one, but it's a bit shocking to see that on Hacker News people don't know the difference between concurrent programming and parallel programming.

Concurrency means that you can have multiple tasks running in the same time period, Parallelism means you have multiple tasks running at the same time.

The most obvious demonstration of this is that you can (and many languages do) have single threaded concurrency.

I did some grad course work in parallel programming and there's really no way to not make it "brutal", because to really do it in a way that increases performance you need to really understand some low-level performance issues.

[+] Paradigma11|2 years ago|reply

https://github.com/Hopac/Hopac is such an impressive piece of software. Too bad it never really took off like it deserved but with more popular competition like rx or just tasks/async (which is enough for most stuff) pretty unavoidable.

[+] kubanczyk|2 years ago|reply

Etmology of ML: "Meta Language"

[+] cmrdporcupine|2 years ago|reply

Concurrent state management is hard, and I remain disappointed that software transactional memory -- and pushing a DB-style approach to non-DB workloads generally -- hasn't caught on.

I think most "application" type programs would benefit from being able to manage state in terms of higher level transactional operations and let their runtime take care of serializing and avoiding deadlock and race conditions. Developers in those spaces have come to rely on garbage collection in their runtimes, I see no reason why they couldn't come to also rely on a fully isolated MVCC transaction model (and only get access to non-transactional memory in exceptional circumstances). Bonus points if said transactions tie back to the persistent RDBMS transaction as well.

There are too many minefields in manual lock management and the tools remain fairly low level. Ownership management in languages like Rust helps, following a discipline like an actor/CSP approach helps, etc. but in the end if there's shared state and there's parallel work, there's potential for trouble.

[+] twoodfin|2 years ago|reply

You might be interested in Joe Duffy’s retrospective on Microsoft’s failed experiment to integrate STM into .NET:

https://joeduffyblog.com/2010/01/03/a-brief-retrospective-on...

[+] mrkeen|2 years ago|reply

It's a good day when I get to use STM.

While the primary benefit is in thinking "these things should happen together or not at all" rather than thinking about locks, there's another feature I always forget about, which is retrying.

Retrying let's you 'nope' out of a transaction, and try again as soon as something changes without busy-waiting.

[+] cj|2 years ago|reply

This reminds me when I was going through the YC accelerator in 2012.

We were building a web-based email client, and PG didn’t like the idea. He pulled our team aside during one of the batch-wide Tues night dinners and suggested we pivot to building something that could take single threaded programs and quickly/easily make them multi-threaded.

No one on our team knew anything about threading (none of us had even graduated college). So we kept working on the email client, now defunct.

PG brought us in a back room where the Stripe founders were waiting (I think they were speaking that night). We were brainstorming ideas with them, and one of the brothers recommended we build something having to do with phones and calling people (don’t remember the idea anymore).

PG was not very happy when he heard we weren’t going to pivot and kept on working on a new email client :)

[+] imtringued|2 years ago|reply

The biggest problem with parallel programming are the fixed communications costs. Threads are expensive to launch and stop. This then leads to thread pooling which is not a bad idea but then you have to make sure that nothing blocks the thread pool. Java is going to fix this problem with virtual threads.

Even if threads are cheap, you still have to decide when to parallelize something. If the section is short enough, splitting up the work will make your program slower because you have to wait for the new thread to be scheduled and then for the launching thread to be scheduled.

These are just the problems to get you started. They are not big barriers but they are big enough to make it not be the default.

The next problem is dynamic runtime behaviour. A parallel program can exhibit far more weird behaviours due to the nature of interleaved execution. This means that you will want a strong ownership model for your data and so far only Rust does it competently. Instanced locks are difficult to get right. Static locks are almost trivial but only if you can guarantee that your critical section never calls code that invokes the same lock. Recursive locks are a bad idea but not using them means you need to have two sets of methods. One is the public synchronized method that library users call and the other is the private unsynchronized method that actually does most of the work. It's very ugly to work with locks.

The other problem is that a lot of problems are genuinely difficult to parallelize. It is better to parallelize hierarchically where each hierarchy is still single threaded. This way you can maintain the illusion of mostly single threaded code. The alternative often requires a bespoke architecture. There are hardly any generalized solutions. You need to be an expert at parallel programming.

[+] anttiharju|2 years ago|reply

Here's course materials about the topic from my university, I assume they're available globally even if you can't sign up for the course: https://ppc.cs.aalto.fi/

[+] namirez|2 years ago|reply

Parallel programming is not hard; state management is hard. And for problems that are not embarrassingly parallel, unfortunately there is no silver bullet.

[+] spacechild1|2 years ago|reply

I have almost finished the book (to be honest, I have skipped the "Validation" chapter) and I found it quite interesting. In particular I liked that it also covered more "exotic" topics such as sequence locks, hazard pointers and RCU. However, in general, I found it a bit too much focussed on Linux kernel code. The code examples are full of kernel code macros, such as READ_ONCE and WRITE_ONCE, instead of C11/C++11 atomics. In the end, I appreciated it because it gave me a new perspective, but I'm not sure if this helpful for people who are not already familiar with parallel programming...

[+] z3t4|2 years ago|reply

The symtoms that lead up to me learning about the need for locks in multi threaded servers was painful. That the state could change from one line of code to the next line, how could it change when I just set it a micro second ago? Because the user clicked the button twice and the call was executed simultaneously but in different threads because my new server had many cores. Then I learned about async programming, but I no longer remember what was so difficult about it. Just that it took about a year and I basically had to rewire how my brain visualize program execution.

[+] atum47|2 years ago|reply

I took the parallel programming class back in college. Even though we implemented several good examples using parallelism like matrix multiplications and different methods of achieving it using the GPU or distributed computers, the main goal of the class was to analyze and see what problems could be processed in parallel and also how to divide the problem in small chunks in order to do so. It was a lot of fun.

[+] zh3|2 years ago|reply

Has dang got a day off?

Previous discussions:-

https://news.ycombinator.com/item?id=22030928 (2020)

https://news.ycombinator.com/item?id=34859102 HN discussion (2023)

[+] DayDollar|2 years ago|reply

Ah, the old debate and approach. Centralize the control, Decentralize problem handling, centralize the knowledge needed for untangling parallelism, realize you just created a centralized authority, undoing the parallelism.

Then there is the physics engine view to all things. Data is just sand washing against the shoreline of parallel computation instances and interaction, is just sticking these dataparticles together, taking them out of the general flow to be interacted on in unison, carrying the resolution authority within them as long as they are "clumbed" together. Data is just particles, the programs interacting on it are just a limited number of program cells, traversing it and interaction clumps it together, making the processing of it slower, but also centralized to computation node. A centralized node, processing a huge clump of interacting data, even throws a sort of "relativity" shadow, as other, smaller, non interacting data, is processed faster and leaping ahead in the interactions.

[+] dejvid_|2 years ago|reply

It is not hard if you use proper libraries. So Take a look into Scala ZIO library, you can solve any concurrency problem with an ease. With the proper toolset which ZIO provides, you can tackle any consurrency problem in typesafe way, and mostly correct if it compiles :)

[+] samsquire|2 years ago|reply

Thanks for another thing I need to add to my reading list in addition to The Art of Multiprocessor Programming.

I am really interested in parallel, asynchronous, multithreading, coroutine, futures programming so it's what I spend my days thinking about and blogging about it. I hope you sense my excitement in this comment about this topic. I'm looking for a programming model that parallelises easily and doesn't require much effort, so this PDF seems relevant to me. I really should try be a user of languages like Erlang, Inko, Pony and Go but I am too interested in the mechanism of these languages!

I am also learning from Erlang and Go, nginx and LMAX disruptor.

I don't focus on number crunching parallelisation, I let libraries and frameworks parallelise matrix multiplication such as BLAS. I'm interested in rote system parallelisation architecture. For example, PHP and nodejs is not a parallel language but how PHP is hosted in FastCGI processes means it can be executed multiple times by nginx so it is in effect parallel across requests. Unfortunately PHP and nodejs cannot create threads or use a thread pool within a request.

I want heavy CPU tasks of a request to not block other requests or the event loop and heavy IO requests to not block the event loop. I am a pre-beginner in Rust but I think you can use Rayon for CPU heavy tasks and Tokio for async IO parallelisation.

Here's a system diagram that I'm thinking about lately: https://github.com/samsquire/ideas5/blob/main/NonblockingRun...

The design is that we have three groupings of thread types. The application starts up some application threads which are not associated with a request, these service multiconsumer multiproducer thread safe ringbuffers in lightweight threads with a Go-erlang-like lightweight process runtime. (My simple lightweight thread runtime is https://github.com/samsquire/preemptible-thread) We also multiplex multiple network clients sockets across a set number of kernel threads which I call control threads. Their responsibility is to dispatch work to a work stealing thread pool ASAP which has its own group of threads. So we pay a thread synchronization cost ONCE per IO which is the dispatch from the control thread to a thread pool thread. (Presumably this is fast, because the thread pool threads are all looping on a submission queue)

We split all IO and CPU tasks into two halves: submit and handle reply. I assume you can use liburing or epoll in the control threads. The same with CPU tasks and use ringbuffers to communicate between threads. We can always serve client's requests because we're never blocked on handling someone else's request. The control thread is always unblocked.

I think this article is good regarding Python's asyncio story: https://charlesleifer.com/blog/asyncio/

I think the best multithreaded architecture is to never rely on synchronization, because it doesn't scale. Try and separate your task so the work is more like a tree than a graph, so that you don't need to communicate between branches. You can shard your data and work independently and merge at the end, similar to mapreduce.

[+] jefc1111|2 years ago|reply

My answers to the two questions in the post title: "Yes" and "Read Concurrent Data Processing in Elixir [1]"

I have dealt with concurrency in PHP (good times with Laravel Horizon) in a couple of different ways and in NodeJS (never feels good, to me). The BEAM has some great primitives for concurrency and the book mentioned above walks the reader through them at a good pace. I read it cover to cover and it really enriched my mental model of how to use concurrency and async.

[1] https://pragprog.com/titles/sgdpelixir/concurrent-data-proce...

[+] PheonixPharts|2 years ago|reply

Concurrency is not the same as parallelism. You can have single threaded concurrent programs.

191 comments