> Where the memory model of ARM differs from X86 is that ARM CPU’s will re-order writes relative to other writes, whereas X86 will not.
May. Not will. The difference here is important, because the actual memory ordering presented is an issue of hardware implementation choice (and of course the local vagaries like cache line alignment, interrupt order and the behavior of other CPUs on the bus). You can't just write some sample code to demonstrate it and expect it's going to work the same on "ARM".
In fact I'd be really curious how Apple handles this during the Mac transition. I wouldn't be at all surprised if, purely for the sake of compatibility, they implement an strongly ordered x86-style cache hierarchy. Bugs in this world can be extremely difficult to diagnose, and honestly cache coherence transistors aren't that expensive relative to the rest of the system.
Reordering rarely or never comes from "cache coherence" on CPUs, but rather core local effects like store buffering, out of order execution, coalescing and out of order commit in the store buffer, etc.
Isn't the following statement always true, as casting using `as` will silently ~~overflow~~ truncate the `u32` if `usize` is 64-bits?
assert!((samples as u32) <= u32::MAX);
EDIT: I know it's a contrived example, but I was just curious if my understanding is correct. I also found this page in the nomicon about casting: https://doc.rust-lang.org/nomicon/casts.html
EDIT2: As I thought casting a `usize` which is 64-bits to a `u32` causes it to be truncated and hence the assertion is always true. Further by using a number that's bigger than a `u32`, this example contains undefined behavior. This is due to the use of `slice::from_raw_parts` where `self.samples` is left as a `usize` and hence takes a much bigger slice than what was allocated (the leftover of the truncate operation). I made a small playground which demonstrates the segfault. https://play.rust-lang.org/?version=stable&mode=debug&editio.... The assertion should rather be:
assert!(samples <= u32::MAX as usize);
Don't get me wrong, I think the blogpost is a great explanatory article about memory ordering and the example is rather contrived. I just wanted to reassure myself that my understanding was correct and further perhaps help someone not seeing this issue (as this is a very easy trap to fall into).
> The x86 processor was able to run the test successfully all 10,000 times, but the ARM processor failed on the 35th attempt.
I think this issue might prove a problem in the long tail of desktop and server software running on ARM.
A lot of desktop and server applications try to take advantage of all the cores. Many times, they are using libraries that were either implemented prior to C and C++ having defined memory models or else without that much care for memory model as long as it ran without issues on the developer computer (x86) and server (x86). Going to ARM is going to expose a lot of these bugs as developers recompile their code for ARM without making sure that their code actually adheres to the C/C++ memory models.
There’s now 2 incentive to support ARM better - Apple’s move to ARM on the desktop and cheaper cloud bills if you’re willing to use ARM instances. Either one wouldn’t be enough of an incentive, but together it will cause a shift in the next 3-5 years.
Developers will become more aware of the differences between the architectures, tool chains will accommodate both better, people and software will stop assuming they are running on x86 as default. ARM won’t “win” the desktop or the server market, but it will become a viable alternative, squeezing the profits of companies who depend on x86.
Smartphones used ARM for a while. I think that many of the major libraries were used in one or another project. So I think that this problem won't be as severe, because those bugs hopefully were fixed.
Also Raspberry Pi was a popular choice for many tinkerers for years which also helps with ARM penetration.
Yeah, feels similar to Make concurrency bugs, where the makefile was developed with --jobs=1, and then years down the track someone says “let’s make it faster!” and tries --jobs=8 or similar, but that leads to compilation failures (or worse, succeeding in compiling the wrong thing), and because all the makefile was made years ago, it’s hard to track down exactly where the prerequisites are lacking; whereas if --jobs=8 had been used from the start, it would have been caught early.
So far I have not encountered any problems with e.g. desktop Firefox (no surprise, gecko has been running on Android for ages) and various server things (mostly the BEAM.. and even some Haskell, even though before 8.8 GHC did not put proper memory barriers everywhere, and I used 8.6 back when I compiled the app).
Most libraries and applications use stuff like mutexes, btw :) it's not like people who don't care about memory models try to make lockfree things often.
Old software will have been tested on CPUs with weaker memory models.
Also, Multi-threaded programming is hard, so the really long tail probably already is buggy on x86. Surfacing their bugs more often may be a blessing in disguise.
If we initialize the contents of both pointers to 0
Is this a "Rust-ism"? I had a double-take while reading that, because in C that would mean a null pointer, and in the terminology I'm used to, the intent is to set the pointee to 0.
(I have experience debugging and fixing an extremely rare bug caused by the above subtle reordering, which occurred approximately once every 3-4 months.)
It doesn’t ring a bell to me as someone who’s spent a lot of time in the Rust community, so I’d say it’s probably just a difference in personal jargon rather than a Rust versus C thing.
(Rust usually uses “reference” instead of “pointer” anyway.)
C programmers use "contents" to refer to the address of the pointer? As a non-C programmer, asking what is "contained" in the pointer would certainly refer to the pointee.
I've never thought about that aspect of the my writing, but I use the terms "value of the pointer" to mean the address, and "contents of the pointer" to mean whats stored at the address.
One thing omitted from this article is that it's not only the physical hardware that may (effectively) re-order operations. The compiler may also perform these re-orderings.
The compiler's re-orderings will always be valid according to the abstract memory model rather than the hardware's, so even on x86 you must use the correct memory orderings, or risk subtle bugs due to compiler optimisations.
In particular, the “multi-threaded using volatile” example is technically incorrect because the compiler is allowed to move non-volatile accesses across volatile ones. [1] To make it correct (but still only on x86), you’d have to use std::sync::atomic::compiler_barrier.
The problem is that there's likely to be a fair amount of logically broken 3rd party code out there because x86's memory model semantics doesn't reveal the issues. The compiler can't help here, it'll just do what the developer told it to do.
It's a whole new 'it works on my machine' issue (for some people).
It does as long as the programmer uses it properly (i.e. you don't use atomics weaker than SeqCst or unsafe code with shared + mutable accesses unless you have proven that it's correct).
See how the functions are all unsafe? The primitives given by the Rust standard library actually do handle this; the author is simply going off the beaten path to illustrate one of the aspects those primitives need to paper over for you.
[+] [-] ajross|5 years ago|reply
May. Not will. The difference here is important, because the actual memory ordering presented is an issue of hardware implementation choice (and of course the local vagaries like cache line alignment, interrupt order and the behavior of other CPUs on the bus). You can't just write some sample code to demonstrate it and expect it's going to work the same on "ARM".
In fact I'd be really curious how Apple handles this during the Mac transition. I wouldn't be at all surprised if, purely for the sake of compatibility, they implement an strongly ordered x86-style cache hierarchy. Bugs in this world can be extremely difficult to diagnose, and honestly cache coherence transistors aren't that expensive relative to the rest of the system.
[+] [-] BeeOnRope|5 years ago|reply
[+] [-] Retr0spectrum|5 years ago|reply
[+] [-] barskern|5 years ago|reply
EDIT2: As I thought casting a `usize` which is 64-bits to a `u32` causes it to be truncated and hence the assertion is always true. Further by using a number that's bigger than a `u32`, this example contains undefined behavior. This is due to the use of `slice::from_raw_parts` where `self.samples` is left as a `usize` and hence takes a much bigger slice than what was allocated (the leftover of the truncate operation). I made a small playground which demonstrates the segfault. https://play.rust-lang.org/?version=stable&mode=debug&editio.... The assertion should rather be:
Don't get me wrong, I think the blogpost is a great explanatory article about memory ordering and the example is rather contrived. I just wanted to reassure myself that my understanding was correct and further perhaps help someone not seeing this issue (as this is a very easy trap to fall into).[+] [-] redbluemonkey|5 years ago|reply
[+] [-] estebank|5 years ago|reply
[+] [-] RcouF1uZ4gsC|5 years ago|reply
I think this issue might prove a problem in the long tail of desktop and server software running on ARM.
A lot of desktop and server applications try to take advantage of all the cores. Many times, they are using libraries that were either implemented prior to C and C++ having defined memory models or else without that much care for memory model as long as it ran without issues on the developer computer (x86) and server (x86). Going to ARM is going to expose a lot of these bugs as developers recompile their code for ARM without making sure that their code actually adheres to the C/C++ memory models.
[+] [-] nindalf|5 years ago|reply
Developers will become more aware of the differences between the architectures, tool chains will accommodate both better, people and software will stop assuming they are running on x86 as default. ARM won’t “win” the desktop or the server market, but it will become a viable alternative, squeezing the profits of companies who depend on x86.
[+] [-] vbezhenar|5 years ago|reply
Also Raspberry Pi was a popular choice for many tinkerers for years which also helps with ARM penetration.
[+] [-] chrismorgan|5 years ago|reply
[+] [-] floatboth|5 years ago|reply
Most libraries and applications use stuff like mutexes, btw :) it's not like people who don't care about memory models try to make lockfree things often.
[+] [-] Someone|5 years ago|reply
Also, Multi-threaded programming is hard, so the really long tail probably already is buggy on x86. Surfacing their bugs more often may be a blessing in disguise.
[+] [-] xenadu02|5 years ago|reply
[+] [-] reitzensteinm|5 years ago|reply
It wouldn't surprise me if server focused ARM chips ended up providing x86 style ordering to ensure compatibility with ported software.
[+] [-] userbinator|5 years ago|reply
Is this a "Rust-ism"? I had a double-take while reading that, because in C that would mean a null pointer, and in the terminology I'm used to, the intent is to set the pointee to 0.
Note that x86 does allow some memory reordering:
https://preshing.com/20120515/memory-reordering-caught-in-th...
(I have experience debugging and fixing an extremely rare bug caused by the above subtle reordering, which occurred approximately once every 3-4 months.)
[+] [-] comex|5 years ago|reply
It doesn’t ring a bell to me as someone who’s spent a lot of time in the Rust community, so I’d say it’s probably just a difference in personal jargon rather than a Rust versus C thing.
(Rust usually uses “reference” instead of “pointer” anyway.)
[+] [-] kibwen|5 years ago|reply
[+] [-] O_H_E|5 years ago|reply
ouch that hurts. You should be proud of that fix....I guess you kinda are :D
[+] [-] redbluemonkey|5 years ago|reply
[+] [-] zozbot234|5 years ago|reply
[+] [-] zamalek|5 years ago|reply
It's a Rust-ism. In Rust there is no null.
[+] [-] Diggsey|5 years ago|reply
The compiler's re-orderings will always be valid according to the abstract memory model rather than the hardware's, so even on x86 you must use the correct memory orderings, or risk subtle bugs due to compiler optimisations.
[+] [-] comex|5 years ago|reply
[1] https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html
[+] [-] jhoechtl|5 years ago|reply
[+] [-] ramshanker|5 years ago|reply
[+] [-] truth_seeker|5 years ago|reply
[+] [-] secondcoming|5 years ago|reply
It's a whole new 'it works on my machine' issue (for some people).
[+] [-] devit|5 years ago|reply
[+] [-] monocasa|5 years ago|reply