Examining ARM vs. x86 Memory Models with Rust

[+] ajross|5 years ago|reply

> Where the memory model of ARM differs from X86 is that ARM CPU’s will re-order writes relative to other writes, whereas X86 will not.

May. Not will. The difference here is important, because the actual memory ordering presented is an issue of hardware implementation choice (and of course the local vagaries like cache line alignment, interrupt order and the behavior of other CPUs on the bus). You can't just write some sample code to demonstrate it and expect it's going to work the same on "ARM".

In fact I'd be really curious how Apple handles this during the Mac transition. I wouldn't be at all surprised if, purely for the sake of compatibility, they implement an strongly ordered x86-style cache hierarchy. Bugs in this world can be extremely difficult to diagnose, and honestly cache coherence transistors aren't that expensive relative to the rest of the system.

[+] BeeOnRope|5 years ago|reply

Reordering rarely or never comes from "cache coherence" on CPUs, but rather core local effects like store buffering, out of order execution, coalescing and out of order commit in the store buffer, etc.

[+] Retr0spectrum|5 years ago|reply

The cost in terms of transistors may be small, but what about performance? (This is not a rhetorical question, I'm curious to know)

[+] barskern|5 years ago|reply

Isn't the following statement always true, as casting using `as` will silently ~~overflow~~ truncate the `u32` if `usize` is 64-bits?

    assert!((samples as u32) <= u32::MAX);

EDIT: I know it's a contrived example, but I was just curious if my understanding is correct. I also found this page in the nomicon about casting: https://doc.rust-lang.org/nomicon/casts.html

EDIT2: As I thought casting a `usize` which is 64-bits to a `u32` causes it to be truncated and hence the assertion is always true. Further by using a number that's bigger than a `u32`, this example contains undefined behavior. This is due to the use of `slice::from_raw_parts` where `self.samples` is left as a `usize` and hence takes a much bigger slice than what was allocated (the leftover of the truncate operation). I made a small playground which demonstrates the segfault. https://play.rust-lang.org/?version=stable&mode=debug&editio.... The assertion should rather be:

    assert!(samples <= u32::MAX as usize);

Don't get me wrong, I think the blogpost is a great explanatory article about memory ordering and the example is rather contrived. I just wanted to reassure myself that my understanding was correct and further perhaps help someone not seeing this issue (as this is a very easy trap to fall into).

[+] redbluemonkey|5 years ago|reply

You are correct. Thanks for the pickup. Fixed the post and repo.

[+] estebank|5 years ago|reply

To be honest, this check deserves to be in clippy.

[+] RcouF1uZ4gsC|5 years ago|reply

> The x86 processor was able to run the test successfully all 10,000 times, but the ARM processor failed on the 35th attempt.

I think this issue might prove a problem in the long tail of desktop and server software running on ARM.

A lot of desktop and server applications try to take advantage of all the cores. Many times, they are using libraries that were either implemented prior to C and C++ having defined memory models or else without that much care for memory model as long as it ran without issues on the developer computer (x86) and server (x86). Going to ARM is going to expose a lot of these bugs as developers recompile their code for ARM without making sure that their code actually adheres to the C/C++ memory models.

[+] nindalf|5 years ago|reply

There’s now 2 incentive to support ARM better - Apple’s move to ARM on the desktop and cheaper cloud bills if you’re willing to use ARM instances. Either one wouldn’t be enough of an incentive, but together it will cause a shift in the next 3-5 years.

Developers will become more aware of the differences between the architectures, tool chains will accommodate both better, people and software will stop assuming they are running on x86 as default. ARM won’t “win” the desktop or the server market, but it will become a viable alternative, squeezing the profits of companies who depend on x86.

[+] vbezhenar|5 years ago|reply

Smartphones used ARM for a while. I think that many of the major libraries were used in one or another project. So I think that this problem won't be as severe, because those bugs hopefully were fixed.

Also Raspberry Pi was a popular choice for many tinkerers for years which also helps with ARM penetration.

[+] chrismorgan|5 years ago|reply

Yeah, feels similar to Make concurrency bugs, where the makefile was developed with --jobs=1, and then years down the track someone says “let’s make it faster!” and tries --jobs=8 or similar, but that leads to compilation failures (or worse, succeeding in compiling the wrong thing), and because all the makefile was made years ago, it’s hard to track down exactly where the prerequisites are lacking; whereas if --jobs=8 had been used from the start, it would have been caught early.

[+] floatboth|5 years ago|reply

So far I have not encountered any problems with e.g. desktop Firefox (no surprise, gecko has been running on Android for ages) and various server things (mostly the BEAM.. and even some Haskell, even though before 8.8 GHC did not put proper memory barriers everywhere, and I used 8.6 back when I compiled the app).

Most libraries and applications use stuff like mutexes, btw :) it's not like people who don't care about memory models try to make lockfree things often.

[+] Someone|5 years ago|reply

Old software will have been tested on CPUs with weaker memory models.

Also, Multi-threaded programming is hard, so the really long tail probably already is buggy on x86. Surfacing their bugs more often may be a blessing in disguise.

[+] xenadu02|5 years ago|reply

Some of them are latent bugs on x86 too, just rarer due to the stronger guarantees x86 provides.

[+] reitzensteinm|5 years ago|reply

It's possible to provide stronger guarantees than the spec requires.

It wouldn't surprise me if server focused ARM chips ended up providing x86 style ordering to ensure compatibility with ported software.

[+] userbinator|5 years ago|reply

If we initialize the contents of both pointers to 0

Is this a "Rust-ism"? I had a double-take while reading that, because in C that would mean a null pointer, and in the terminology I'm used to, the intent is to set the pointee to 0.

Note that x86 does allow some memory reordering:

https://preshing.com/20120515/memory-reordering-caught-in-th...

(I have experience debugging and fixing an extremely rare bug caused by the above subtle reordering, which occurred approximately once every 3-4 months.)

[+] comex|5 years ago|reply

> Is this a "Rust-ism"?

It doesn’t ring a bell to me as someone who’s spent a lot of time in the Rust community, so I’d say it’s probably just a difference in personal jargon rather than a Rust versus C thing.

(Rust usually uses “reference” instead of “pointer” anyway.)

[+] kibwen|5 years ago|reply

C programmers use "contents" to refer to the address of the pointer? As a non-C programmer, asking what is "contained" in the pointer would certainly refer to the pointee.

[+] O_H_E|5 years ago|reply

> which occurred approximately once every 3-4 months

ouch that hurts. You should be proud of that fix....I guess you kinda are :D

[+] redbluemonkey|5 years ago|reply

I've never thought about that aspect of the my writing, but I use the terms "value of the pointer" to mean the address, and "contents of the pointer" to mean whats stored at the address.

[+] zozbot234|5 years ago|reply

Yes, "targets" would be clearer than "contents" here.

[+] zamalek|5 years ago|reply

> in C that would mean a null pointer

It's a Rust-ism. In Rust there is no null.

[+] Diggsey|5 years ago|reply

One thing omitted from this article is that it's not only the physical hardware that may (effectively) re-order operations. The compiler may also perform these re-orderings.

The compiler's re-orderings will always be valid according to the abstract memory model rather than the hardware's, so even on x86 you must use the correct memory orderings, or risk subtle bugs due to compiler optimisations.

[+] comex|5 years ago|reply

In particular, the “multi-threaded using volatile” example is technically incorrect because the compiler is allowed to move non-volatile accesses across volatile ones. [1] To make it correct (but still only on x86), you’d have to use std::sync::atomic::compiler_barrier.

[1] https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html

[+] jhoechtl|5 years ago|reply

Did this page just loaded for me with the glimpse of an eye? Way to do it!

[+] ramshanker|5 years ago|reply

And the good news here is that many libraries have accelerated their ARM port / investigations.

[+] truth_seeker|5 years ago|reply

Isn't this something Rust std lib api or perhaps LLVM backend should take care of ?

[+] secondcoming|5 years ago|reply

The problem is that there's likely to be a fair amount of logically broken 3rd party code out there because x86's memory model semantics doesn't reveal the issues. The compiler can't help here, it'll just do what the developer told it to do.

It's a whole new 'it works on my machine' issue (for some people).

[+] devit|5 years ago|reply

It does as long as the programmer uses it properly (i.e. you don't use atomics weaker than SeqCst or unsafe code with shared + mutable accesses unless you have proven that it's correct).

[+] monocasa|5 years ago|reply

See how the functions are all unsafe? The primitives given by the Rust standard library actually do handle this; the author is simply going off the beaten path to illustrate one of the aspects those primitives need to paper over for you.

65 comments