top | item 26089061

(no title)

marmight | 5 years ago

For folks interested in threads in C, I also recommend reading "Threads Cannot be Implemented as a Library" [1]. The summary is that Pthreads mostly works correctly as a library (1) because its functions execute memory fence instructions and (2) because its functions are treated as opaque functions by the C compiler, i.e., functions that might read or write from any global variable. However, these properties do not generate thread-safe code under a number of conditions such as under the presence of certain compiler optimizations. Thus, the paper argues that the compiler must be aware of threads and that threads cannot be implemented purely as a library.

[1] https://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf

discuss

jcranmer|5 years ago

The language basis for threading is now the C11/C++11 memory model [1], even if you don't directly use it--it's baked into the compiler IR. I would argue that in the 2020s, you should be using C11/C++11 threading instead of direct pthreads if you're writing new code.

(And yes, Hans Boehm was one of the people instrumental in getting the threaded memory model adopted.)

[1] If you really want to be pedantic, it's the current verbiage that rules, independent of the actual language version you request. This is one of those situations where trying to describe the semantics more formally is challenging, so a lot of the changes in the specification aren't supposed to actually be changing how things are compiled but how things are described in the text.

wahern|5 years ago

In some ways the POSIX API is better. For example, the POSIX API permits you to statically define and initialize pthread_mutex_t, pthread_once_t, and similar objects. Without static initialization you're stuck in a catch-22 because to initialize such objects at runtime you have to rely on some other implicitly serialized control flow, such as by invoking initializers from main(), which is particularly cumbersome and error prone for libraries.

C++ has it a little easier because, AFAIU, the language itself supports static constructors--i.e. the ability to run [mostly] ad hoc code to initialize static objects at runtime (specifically load time).

The C11/C++11 threading API lacks static initializers because when the specification was being developed Windows light-weight threading primitives didn't support such a capability. Basically, the C11/C++11 API had to take a lowest common denominator approach, which in most cases meant dropping useful features from the POSIX API so Microsoft didn't have to reimplement their primitives. (The hope was that C11- and C++11-compliant implementations would become widely available immediately if the standard made it easy to wrap existing OS primitives, but unfortunately implementations were--and arguably still are, especially for C--many years late.)

Also, the memory model is mostly distinct from the threading APIs. The memory model was more important for nailing down atomics in C11 and C++11. The threading APIs are much higher level, and because they rely on opaque types and function calls, their implementation and the necessary compiler support (ISA memory barriers, compiler code motion barriers, etc) was largely transparent.

dragontamer|5 years ago

Though that's 10 pages and short for a paper, I'll give a shot at a simpler explanation.

Compilers are expected to optimize your code, and the primary way to optimize code is through rearranging your statements.

Consider the following:

  int i=0;
  i++;
  sleep(1); // Yeah, sleep isn't a proper memory barrier.
  i++;
  sleep(1); //  But in my experience, beginners understand sleep. So shoot me.
  i++;
  sleep(1);
  i++;
  sleep(1);
  i++;
  sleep(1);

The compiler will often "rearrange" these ++ statements to all happen on the same line, ultimately as follows:

  int i=0;  i++;  i++;  i++;  i++;  i++;
  sleep(1);
  sleep(1);
  sleep(1);
  sleep(1);
  sleep(1);

Then

  int i=5;
  sleep(5);

Simple enough. Except... what if Thread#2 had the following:

  while(i<2); // Infinite loop waiting for i to equal 2
  foo();

Then in Thread#3:

  while(i<3); // Infinite loop waiting for i to become 3
  bar();

Then in Thread#4:

  while(i<4); // Infinite loop waiting for i to become 4
  baz();

Then in Thread#5:

  while(i<5); // Infinite loop waiting for i to become 5
  foobar();

As we can see here, "i" is a synchronization variable. We only know this fact if we know how another thread works. Now that i no longer steps from 1 to 2 to 3 to 4 to 5, your threads no longer synchronize and the code gains a race condition (all threads might execute at once, since i starts off as 5).

-----------

For better or worse, modern programmers must think about the messages passed between threads. After all, semaphores are often i++ and i-- statements at the lowest level (maybe with a touch of atomic_swap or maybe a lock-statement depending on your architecture).

Modern code must note when a variable is important to inter-thread synchronization, to selectively disable the Compilers optimizer (funny enough: it also is needed to strongly order the L1 cache, as well as the Out-of-order core of modern processors).

As such, proper threading requires a top-to-bottom language-level memory model.

The "knowledge" that the i++ cannot be optimized / combined beyond the sleep statements.

---------

This is no longer an issue on modern platforms. Today, we have C++11's memory model which strongly defines where and when optimizations can occur, with "seq_cst" memory ordering.

There is also a faster, but slightly harder to understand, memory model of acquire and release. This acquire / release memory model is useful on more relaxed systems like ARM / POWER9.

Your mutex_lock() and mutex_unlock() statements will have these memory-barriers which tell the compiler, CPU, and L1 cache to order the code in ways the programmer expects. No optimizations are allowed "over" the mutex_lock() or mutex_unlock() statements, thanks to the memory model.

But back in 2004, before the memory model was formalized, it was impossible to write a truly portable posix-thread implementation. (Fortunately, compilers at the time recognized the issue and solved it in their own ways. Windows had Interlock_Exchange calls, GCC had its own memory model. But the details were non-standard and non-portable).