top | item 11288665

The Strict Aliasing Situation Is Pretty Bad

150 points| pascal_cuoq | 10 years ago |blog.regehr.org | reply

67 comments

order
[+] kbenson|10 years ago|reply
Although I have no evidence that it is being miscompiled, OpenSSL’s AES implementation uses chunking and is undefined.

Oh, that's nice. :/

[+] kazinator|10 years ago|reply
It's nonsense. The function is external, called from a separately compiled file. The pointer comes in as a char *. The code checks its alignment before assuming it can be cast to a block. There is no way in it could be "miscompiled".

The ivec argument could in fact have come from an object that is of type aes_block_t. The only thing which might reveal that it didn't is wrong alignment. In other regards, there is no way to tell.

Lastly, any cross-compilation-unit optimization which could break code of this type is forbidden, because ISO C says that semantic analysis ends in translation phase 7.

I'm looking at C99, not the latest, but I think it's the same.

In translation phase 7 (second last), "The resulting tokens are syntactically and semantically analyzed and translated as a translation unit." Note the "semantically analyzed": semantic analysis is where the compiler tries to break your code due to strict aliasing.

In translation phase 8 "All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment."

No mention of any more semantic analysis! So unless somehow the mere resolution of external symbols can somehow break OpenSSL's AES, I don't see how anything can go wrong.

One thing I woudl do in that code, though is to make sure that it doesn't use the original ivec pointer. In the case where "chunking" goes on, it should just cast it to the block type, and put the result of that cast in a local variable. All the ememcpy's, load/store macros would be gone, and the increments by AES_BLOCK_SIZE would just be + 1.

[+] eternalban|10 years ago|reply
This was an eye opener for me. Bell Labs tech is as usual a mountain of hidden complexity hiding under a "simple" facade.

No more excuses for me. Time to learn Rust.

[+] aidenn0|10 years ago|reply
C is actually quite simple, even the aliasing rules (I think the aliasing rules all fit on about a quarter page). Programming in C though is anything but.

The tension between weakly typed pointers and the desire to generate efficient code is where there is a problem. More or less anywhere you violate the aliasing rules you are doing something that Fortran doesn't allow at all, and by disallowing it semantically the hope was that C programmers could have their cake and eat it too. The reality is that all systems code should probably be compiled with alias analysis disabled.

[+] quotemstr|10 years ago|reply
Ahaha. Rust is language that assumes dynamic memory allocation never fails. The way current compilers treat undefined behavior is terrible, but Rust is worse.
[+] haberman|10 years ago|reply
I thought this article was unnecessarily dire.

One section claims "Physical Subtyping is Broken", where "physical subtyping" is defined as "the struct-based implementation of inheritance in C." I assume this means the typical pattern of:

    typedef struct {
       int base_member_1;
       int base_member_2;
    } Base;

    typedef struct {
       Base base;

       int derived_member 1;
    } Derived;
The article claims physical subtyping is broken because casting between pointer types results in undefined behavior. The article gives this example:

    #include <stdio.h>
     
    typedef struct { int i1; } s1;
    typedef struct { int i2; } s2;
     
    void f(s1 *s1p, s2 *s2p) {
      s1p->i1 = 2;
      s2p->i2 = 3;
      printf("%i\n", s1p->i1);
    }
     
    int main() {
      s1 s = {.i1 = 1};
      f(&s, (s2 *)&s);
    }
I agree this example is broken, but casting between pointer types in this way is totally unnecessary for C-based inheritance. You can do upcasts and downcasts that are totally legal:

    Derived d;

    // Legal upcast:
    Base* base = &d->base;

    // Legal downcast:
    Derived* derived = (Derived*)base;
So I don't think the article has proved that "Physical Subtyping is Broken."

The next section says that "Chunking Optimizations Are Broken," because code like this is illegal:

    void copy_8_bytes(char *dst, const char *src) {
      *(uint64_t*)dst = *(uint64_t*)src;
    }
While this is true, such optimizations are generally unnecessary. For example, write this instead as:

    void copy_8_bytes(char *dst, const char *src) {
      memcpy(dst, src, 8);
    }
If you compile this on an architecture like x86 that truly allows unaligned reads, you'll see that modern compilers do the "chunking optimization" for you:

    0000000000000000 <copy_8_bytes>:
       0:   48 8b 06                mov    rax,QWORD PTR [rsi]
       3:   48 89 07                mov    QWORD PTR [rdi],rax
       6:   c3                      ret
It says next that "int8_t and uint8_t Are Not Necessarily Character Types." That is indeed a good point and probably not well-known. So I agree this is something people should keep in mind. But most of this article is warning against practices that are generally unnecessary and known to be bad C in 2016.

It's true that a lot of legacy code-bases still break these rules. But many are cleaning up their act, fixing practices that were never correct but used to work. For example, here is an example of Python fixing its API to comply with strict aliasing, and this is from almost 10 years ago: https://www.python.org/dev/peps/pep-3123/

[+] DannyBee|10 years ago|reply
I'm with you. I wrote the code that breaks this stuff in gcc (and the implementation of struct-sensitive pointer analysis).

It explicitly and deliberately follows the first member rule, as it should :)

In C++, this is covered by 6.5/7, and allowed because it's a type compatible with the effective type of the object (in a standard layout class, a pointer to the a structure object points to the initial member)

[+] _yosefk|10 years ago|reply
I understand the upcast (which is certainly legal but it forces the casting code to know the depth of the inheritance hierarchy - as in &derived->base1.base2), but what's the argument making the downcast back to Derived legal C? (I honestly wonder; personally I either compile with -fno-strict-aliasing or trust my tests to validate the build...)
[+] xorblurb|10 years ago|reply
About the memcpy stuff, when you consider the whole picture, this is ridiculous though. There should be no reason for any sane implementation to ever do nasty stuff about

    void copy_8_bytes(char *dst, const char *src) {
      *(uint64_t*)dst = *(uint64_t*)src;
    }
Only maybe excuses. And poor ones.

Now I don't remember the article where I saw that, but technically given the current orientation of compiler writers there are some even more ridiculous situations. Like (a<<n) | (a>>(32-n)) having the obviously desired effect on all current architectures when you look at what would be an obvious direct translation (and quite efficient one already), and yet given the current orientation of compiler writers I would not like to see that code AT ALL unless it is proved that n is always strictly between 1 and 31. And now if they want to restore any kind of efficiency after all that madness, they would have to implement yet another case of convoluted peephole optim. Stupid. Give me my original intent of the langage back, because virtually everybody is using it like that consciously or not, and that will just not change.

[+] TorKlingberg|10 years ago|reply
If you do this

    Derived* derived = (Derived*)base;
and then use both base and derived, is that not violating aliasing rules?
[+] colanderman|10 years ago|reply
Actually, casts to/from char * are always defined in C (chars are always assumed to alias). The author was talking about "chunking" non-char units.
[+] Pxtl|10 years ago|reply
... I am so happy I don't code in C right now. That's icky.
[+] DannyBee|10 years ago|reply
C++ is just as bad :)