ISO C became unusable for operating systems development

[+] WalterBright|4 years ago|reply

I had an online discussion some years back where I suggested that C nail the size of char to 8 bits. He responded that there was a CPU that had chars be 32 bits, and wasn't that great that a C compiler for it would be Standard compliant?

I replied by pointing out that nearly every non-trivial C program would have to be recoded to work on that architecture. So what purpose did the Standard allowing that actually achieve?

I also see no problem for a vendor of a C compiler for that architecture making a reasonable dialect of C for it. After all, to accommodate the memory architecture of the x86, nearly all C compilers in the 80's adopted near/far pointers, and while not Standard compliant, it really didn't matter, and was tremendously successful.

D made some decisions early on that worked out very well:

1. 2's complement wraparound arithmetic

2. sizes of basic integer types are fixed at 1 byte for chars, 2 for shorts, 4 for integers, 8 for longs. This worked out very well

3. floating point is IEEE

4. char's are UTF-8 code units (*)

5. chars are unsigned

These 5 points make for tremendous simplicity gains for D programmers, and ironically increase portability of D code.

After reading the paper, I'm inclined to change the definition of UB in D to not mean it can be assumed to not happen and not be unintended.

(*) thanks for the correction

[+] layer8|4 years ago|reply

> So what purpose did the Standard allowing that actually achieve?

I believe the situation was that there were C implementations for DSPs (32-bit-addressable only) and IBM mainframes (36-bit addressable only), and when ANSI/ISO C was established, they naturally wanted their implementations to be able to conform to that new standard. So the standard was made flexible enough to accommodate such implementations.

Similarly why signed overflow is undefined behavior. There were existing implementations that trapped (CPU interrupt) on signed overflow.

I might have gotten the details wrong, but that's what I remember from reading comp.std.c (Usenet) in the 90s.

[+] WalterBright|4 years ago|reply

I had another such discussion where I suggested that C abandon support for EBCDIC. I was told it was great that C supported any character set! I said C certainly does not, and gave RADIX50 as an example.

How many C programs today would work with EBCDIC? Zero? There's no modern point in C not requiring ASCII, at a minimum.

[+] tialaramex|4 years ago|reply

> char's are UTF-8 code points

I'm guessing you mean that char is a UTF-8 code unit as you keep saying they're only one byte and a code point is far too large to fit in a byte / octet.

But that still seems very weird because a UTF-8 code unit is almost but not quite the same as a byte, so that users might be astounded when they can't put a byte into a char in this scheme (because it isn't a valid UTF-8 code unit) and yet almost no useful value is accrued by such a rule.

[+] inkyoto|4 years ago|reply

> I had an online discussion some years back where I suggested that C nail the size of char to 8 bits. He responded that there was a CPU that had chars be 32 bits, and wasn't that great that a C compiler for it would be Standard compliant?

Back in C infancy days, there had existed architectures where a byte could hold 9 bits that C compilers had to be written for. The 36-bit PDP-10 architecture springs to mind, and some Burroughs or Honeywell mainframes had those – I remember reading a little C reference book authored by Kernighan, Ritchie and somebody else explicitely calling out the fact that a C implementation could not rely on the fact of the byte always being 8 bits long and also stressing that the «sizeof» operator was reporting the number of bytes in a type irrespective of the bit width of the byte.

9 bit byte architectures have all but perished, however, C has carried the legacy of creative days of the computer architecture design along.

[+] arunc|4 years ago|reply

> After reading the paper, I'm inclined to change the definition of UB in D to not mean it can be assumed to not happen and not be unintended.

What's the current definition of UB in D?

[+] jeffbee|4 years ago|reply

I don't think it would be a very good outcome if people forked C such that everyone working on DSP platforms and new platforms that you just haven't heard of had to use a fork with flexible CHAR_BIT while the standard defined it to be 8. Who is served by this forking? Plenty of software works fine with different CHAR_BIT values, although some poorly-written programs do need to be fixed.

[+] aw1621107|4 years ago|reply

Ralf Jung has a blog post looking at some of the claims in this paper [0]. Some hopefully representative quotes:

> The paper makes many good points, but I think the author is throwing out the baby with the bathwater by concluding that we should entirely get rid of this kind of Undefined Behavior. The point of this blog post is to argue that we do need UB by showing that even some of the most basic optimizations that all compilers perform require this far-reaching notion of Undefined Behavior.

<snip>

> I honestly think trying to write a highly optimizing compiler based on a different interpretation of UB would be a worthwhile experiment. We sorely lack data on how big the performance gain of exploiting UB actually is. However, I strongly doubt that the result would even come close to the most widely used compilers today—and programmers that can accept such a big performance hit would probably not use C to begin with. Certainly, any proposal for requiring compilers to curtail their exploitation of UB must come with evidence that this would even be possible while keeping C a viable language for performance-sensitive code.

> To conclude, I fully agree with Yodaiken that C has a problem, and that reliably writing C has become incredibly hard since undefined behavior is so difficult to avoid. It is certainly worth reducing the amount of things that can cause UB in C, and developing practical tools to detect more advanced kinds of UB such as strict aliasing violations.

<snip>

> However, I do not think this problem can be solved with a platform-specific interpretation of UB. That would declare all but the most basic C compilers as non-compliant. We need to find some middle ground that actually permits compilers to meaningfully optimize the code, while also enabling programmers to actually write standards-compliant programs.

[0]: https://www.ralfj.de/blog/2021/11/24/ub-necessary.html

[+] h2odragon|4 years ago|reply

Torvalds was a strong advocate of GCC 2.95 (iirc), early on in Linux history, because he knew the kind of code it would emit and didn't trust the newer compilers to produce code that was correct in those circumstances.

The workarounds and effort required to tell a compiler today that no, you really did want to do the thing you said might well be insupportable. I figure they started going astray about the time self modifying code became frowned upon.

[+] mananaysiempre|4 years ago|reply

To be fair, the backend in the early GCC 3.x series was just kind of stupid sometimes. Even now I find strange if cheap and harmless heisengremlins in GCC-produced x86 code (like MOV R, S; MOV S, R; MOV R, S) from time to time, while the Clang output, even if not always good, is at least reasonable. This is not to diss the GCC team—the effort required to port a whole compiler to a completely new IR with a completely different organizing principle while keeping it working most of that time boggles the mind, frankly. But the result does occasionally behave in weird ways.

[+] phicoh|4 years ago|reply

One thing that is not mentioned in the article, is that next to undefined behavior, there is also implementation defined behavior.

For example, if signed integer overflow would be implementation defined behavior, then any weirdness would be limited to just the integer operation that overflows.

Lots of other stuff can be expressed as implementation defined behavior. That would probably kill some optimizations.

So the question is more, do we want a portable assembler? In that case as many C constructs as possible need have defined behavior. Either defined by the standard or as part of the compiler documentation.

Another possibily is to have standards for C on x86, amd64, arm, etc. Then we can strictly define signed integer overflow, etc. And say that on x86, pointers don't have alignment, so a pointer that points to storage of suitable size can be used to stored an object of different type, etc.

If the goal is to run SPEC as fast as possible, then making sure every program trigger undefined behavior is the way to go.

[+] wruza|4 years ago|reply

I have a dumb question. Why can “we” write pretty good apps in languages other than C, but can’t write operating systems? Is talking to hardware so much different than talking to APIs?

Another point of view on the same question: Looking at software and hardware, the latter evolved insanely, but the former didn’t get seemingly faster, at least in userlands. Why bother with UB-related optimizations at all for a wide spectrum of software? Is there even software which benefits from -O3 and doesn’t use vectorization intrinsics? Why can’t “we” just hardcode jpeg, etc for few platforms? Is that really easier to maintain opposed to maintaining never ending sources of UB?

Iow, why e.g. my serial port or ata or network driver has to be implemented in C, if data mostly ends up in stream.on(‘data’, callback) anyway?

[+] ErikCorry|4 years ago|reply

In theory the difference between undefined behaviour and implementation defined behaviour is that ID behaviour must be documented. In practice good luck finding that documentation for each CPU and compiler combination. In fact good luck just finding it for LLVM and x64.

[+] bluecalm|4 years ago|reply

I don't think making it defined would help much. Overflowing a signed integer is a bug in logic. It would be ideal to have a crash on that. Continuing is going to be bad one way or another unless you luck out with your buggy code so the way the implementation works saves you. It can't be relied upon in general case though.

Imo the way is to develop more tools that detect (either by analysis or at runtime) those bugs and run the code with those attached as often as you can afford it (to take the performance penalty).

[+] bigcheesegs|4 years ago|reply

You don't actually want implementation defined behavior. There is no restriction on implementation defined behavior, it just needs to be documented. Suitable documentation includes "the optimizer assumes this never happens and optimizes accordingly.", or "Look at the source code."

[+] foxfluff|4 years ago|reply

I haven't got the time to read the paper yet but I believe I'd emerge with the more or less the same opinion that I've had before: nobody's forcing you to pass -O2 or -O3. It's stupid to ask the compiler to optimize and then whine that it optimizes. I usually am OK with the compiler optimizing, hence I ask it to do so. I'm glad that others who disagree can selectively enable or disable only the optimizations they're concerned about. Most of the optimizations that people whine about seem quite sane to me. Of course, sometimes you find real bugs in the optimizer (yesterday someone on libera/#c posted a snippet where gcc with -O2 (-fdelete-null-pointer-checks) removes completely legit checks)

[+] bcrl|4 years ago|reply

At least in compilers like gcc, optimization needs to be enabled to get sane warnings emitted by the compiler, so some people are indeed being forced to pass in -O2 to get sane build warnings for projects.

I would really like it for the C standard to clean up Undefined Behaviour. Back in the 1980s when ANSI C was first specified, a lot of the optimizations that modern compiler writers try to justify via Undefined Behaviour simply weren't part of many compiler's repertoires, so most systems developers didn't need to worry about UD and there was no push for the standard to do so as a result.

If people really want the optimizations afforded by things like assuming an int can't overflow to a negative number in a for loop, my personal position is that the code should be annotated such that the optimization is enabled. At the very least, the compiler should warn that it is making assumptions about potentially UB when applying such optimizations.

There is this false belief that all legacy code should be able to compiled with a new compiler with no changes and expect improved performance. Anyone who works on real world large systems knows that you can't migrate to newer compilers or updated OSes with zero effort (especially if there's any C++ involved). I understand that compiler writers want to improve their performance on SPEC, but the real world suffers from the distortions caused by viewing optimizations through the narrow scope of benchmarks like SPEC.

[+] maxlybbert|4 years ago|reply

I’m amused by the people who ask for optimization and then complain about it. Or the “it did what I said, not what I meant” crowd.

But, officially, undefined behavior is always undefined, not just at higher optimization levels.

[+] phicoh|4 years ago|reply

It seems to me a flaw in a language and in a compiler if the average programmer has to avoid higher levels of optimizating because it cannot be predicted what they do.

[+] marcosdumay|4 years ago|reply

Optimizations aren't supposed to change the meaning of your code. And specifically for C, unsafe optimizations are supposed to only apply with -O3. Level 2 is supposed to be safe.

[+] kortex|4 years ago|reply

Is it even possible to have zero undefined behavior in languages that allow user-defined pointers? It seems like allowing even just one degree of memory indirection creates a singularity beyond which any kind of formal guarantees become impossible. Seems like you'd have to allow only structures which hide memory implementation details if you truly want to avoid all UB. Same goes for any arithmetic which could overflow.

That would require kernel devs to radically rethink how they interact with I/O, which would probably require specific architectures.

In other words, writing a kernel portable on any of the existing ISAs that is also performant is basically impossible, barring some humongous breakthrough in compiler technology.

Seems to me that when it comes to brass tacks, UB is kind of the "we are all adults here" engineering tradeoff that enables shipping fast and useful software, but is technically not strictly defined and thus usually does what you want, but can result in bugs.

[+] kps|4 years ago|reply

The original C committee wrote, as part of its guiding principle to “Keep the spirit of C”, that “To help ensure that no code explosion occurs for what appears to be a very simple operation, many operations are defined to be how the target machine’s hardware does it rather than by a general abstract rule.”¹ That is, if you write `a = b + c` you expect the compiler to generate an `add a, b, c` instruction, and if that happens to trap and burn your house down, well, that's not C's problem.

I'm convinced that the original UB rule was intended to capture this, and the wording was an error later seized by compiler developers. As evidence, consider Dennis Ritchie's rejection of `noalias` as “a license for the compiler to undertake aggressive optimizations that are completely legal by the committee's rules, but make hash of apparently safe programs”². If anyone at the time had realized that this is what the definition of UB implied, it would have been called out and rejected as well.

¹ https://www.lysator.liu.se/c/rat/a.html#1-1

² https://www.lysator.liu.se/c/dmr-on-noalias.html

[+] foxfluff|4 years ago|reply

> Is it even possible to have zero undefined behavior in languages that allow user-defined pointers?

It kinda boils down to what exactly you mean by defined behavior. A C programmer's take might be that you can run a conforming program in an emulated abstract machine and get defined results out of it. And then you can run the same thing on real hardware and expect to get the same result (modulo implementation defined behavior). This definition leaves some things out (e.g. performance, observable effects in the "real world") but it captures the computational semantics.

Another programmer's take might be more akin to a portable assembler. In that case, you certainly could define reads and writes for arbitrary pointers, in the sense that they must cause corresponding (attempted) loads and stores at the machine level. However, the definition wouldn't be complete since it inevitably leaves much to the underlying implementation. Thus you could have "defined" C programs that show completely different behaviors depending on which implementation and hardware you used. It would be impossible to say what the program's output must be "in the abstract." For someone who just wants to output assembly, maybe that's fine. I'm not sure other people would be too satisfied with it. An out of bounds write could still blow up your program and be remotely exploitable; practically the same thing as undefined behavior, except that now your compiler is also barred from optimizing.

There's quite a bit of tension between these two camps.

Alternatively, you could fully define it at a great runtime cost and potential exclusion of real hardware implementations.

[+] remram|4 years ago|reply

Undefined behavior means something specific in the standard, it's not just an operation that might do different things on different compilers and machines. It means that if it happens, the program is allowed to do anything and everything, even before the UB is reached. Undefined behavior is breaking an assumption that the compiler is allowed to make.

It is probably impossible to make a low-level language with no implementation-defined behavior, but it is certainly possible to make one with no undefined behavior. For example, you can put in your spec that overflowing an unsigned integer can give any value; that is different that putting in your spec that it doesn't happen and if you write it the variable might have no value, multiple values, or burn your socks off.

https://en.wikipedia.org/wiki/Undefined_behavior

[+] pornel|4 years ago|reply

If by user-defined pointers you mean arbitrary integer-to-pointer casts, then this is a kryptonite for static analysis, and I don't think you can have a language that is both fast and fully predictable (UB-free) in their presence. It breaks pointer provenance and aliasing analysis, and existing compilers already struggle with such casts in C.

But apart from that, you can have pointers, with many levels of indirection, as long as there are rules that prevent use-after-free, unsynchronized concurrent access, and other UB-worthy problems. Rust's borrow checker with rules for no mutable aliasing and Send/Sync markers for concurrent access comes close, but it has to give up on generality for safety (e.g. it can't reason about circular data structures).

[+] CRConrad|4 years ago|reply

> Is it even possible to have zero undefined behavior in languages that allow user-defined pointers? It seems like allowing even just one degree of memory indirection creates a singularity beyond which any kind of formal guarantees become impossible.

With untyped pointers, yes. But it seems to me that if you have strong typing for function pointers you could mostly avoid that.

> UB is kind of the "we are all adults here" engineering tradeoff that enables shipping fast and useful software, but is technically not strictly defined

Well, no, of course it isn't -- the clue is probably in the first half of the name, "Undefined Behaviour"... ;-)

[+] immibis|4 years ago|reply

Even kernels only interact with memory in reasonably predictable ways. I think they could all be hidden behind such abstractions, BUT it will make the language a lot more complex.

[+] ErikCorry|4 years ago|reply

Without concurrency you don't have to have UB to have user-defined pointers. x86 assembly language has no UB and has user-defined pointers.

[+] foxfluff|4 years ago|reply

This was discussed somewhat recently: https://news.ycombinator.com/item?id=28779036

[+] flykespice|4 years ago|reply

Wow, this was an eye-opening read on my trust with ISO C.

It makes much more understandable why linux codebase is riddled with compiler extensions, ISO C is simply not reliable anymore.

The issue is bigger than what trembles on the surface, just like Dennis Ritchie said, it is a timebomb, soon enough these nuances will burst into a big issue in linux kernel, or worse yet, some essential system like avionics.

[+] jcranmer|4 years ago|reply

I think undefined behavior (as a general concept) gets an unfair share of the blame here. It's notable that almost all criticism of undefined behavior in C tends to focus on just two sources of UB: signed integer overflow and strict aliasing; other sources of UB just don't generate anywhere near the same vitriol [1]. Furthermore, it's notable that people don't complain about UB in Rust... which arguably has a worse issue with UB in that a) there's not even a proper documentation of what is UB in Rust, and b) the requirement that &mut x be the sole reference to x (and therefore is UB if it is not) is far more stringent than anything in C (bar maybe restrict), and I'm sure that most Rust programmers, especially newbies starting out with unsafe, don't realize that that's actually a requirement.

There is a necessity for some form of UB in a C-like language, and that has to deal with pointer provenance. You see, in C, everything lives in memory, but on real hardware, you want as much to live in a register as possible. So a compiler needs to be able to have reasonable guarantees that, say, any value whose address is never taken can never be accessed with a pointer, and so can be promoted to a register. As a corollary, this requires that things like out-of-bound memory accesses, or worse, converting integers to pointers (implicating pointer provenance here) need to have UB in at least some cases, since these could in principle "accidentally" compute an address which is the same as a memory location whose address was never taken.

That suggests that the problem isn't UB per se. If we look at the two canonical examples, arithmetic overflow and strict aliasing, we can see that one of the features of these things is that they have a pretty obvious well-defined semantics [2] that can be given for them, and furthermore, there's no way to access these well-defined semantics even avoiding this feature altogether. And I think it's the lack of this ability to work around UB that is the real issue with C, not UB itself.

[1] For example, it is UB to pass in rand as the comparison function to qsort. I'm sure many people will not realize that before I wrote this, and even parsing the C specification to find out that this is UB is not trivial. For an interesting challenge, try giving a definition of what the behavior should be were it not UB--and no, you can't just say it's impl-defined, since that still requires you to document what the behavior is.

[2] I will point out that, for arithmetic overflow, this semantics is usually wrong. There are very few times where you want <large positive number> + <large positive number> = <negative number>, and so you're mostly just swapping out an unpredictably wrong program for a predictably wrong program, which isn't really any better. However, the most common time you do want the wrapping semantics is when you want to check if the overflow happened, and this is where C's lack of any overflow-checked arithmetic option is really, really painful.

[+] RcouF1uZ4gsC|4 years ago|reply

I think the undefined behavior in C and C++ are even less defensible today than when they started because of the convergence of architectures.

Pretty much every non-legacy architecture does IEEE floating point. Pretty much all of them do a flat address space. The word size is some power of 2(32 bit, 64 bit, maybe 128 bit in the future). They are almost always little endian. The memory models are converging towards the C++ memory model.

Given that, I think simplifying the language and getting rid of foot guns could be done without losing any significant performance or actual flexibility/portability.

[+] jandrese|4 years ago|reply

You could also add that they all do 2s complement math.

This is what the OpenBSD team did to OpenSSL. If the code has some complexity that is only necessary because it might have been run on a VAX or AIX or early Cray architecture then it is time to excise that complexity. They deleted thousands and thousands of lines of support for architectures that are only seen in museums and landfills today.

[+] pjmorris|4 years ago|reply

It could be called 'C--' unless that's been taken.

[+] zokier|4 years ago|reply

ISO C is mostly concerned about making sure that stuff is portable; operating systems on the other hand are intrinsically platform-specific to a degree. So it is not really surprising that pure ISO C is not enough for OS development

[+] WalterBright|4 years ago|reply

D got around the overflow-is-UB problem by declaring that 2s-complement arithmetic will be used with wraparound semantics.

Is there any reason for modern C to still support anything else?

[+] gpderetta|4 years ago|reply

Guaranteed 2 complement arithmetic for signed integers was recently added to C++. Wrap around is still UB though because apparently caused regressions in some significant code bases.

Guaranteed order of evaluation of arguments almost made it into the standard, but because of regressions, we didn't quite get the full benefits; for example: i = i++ + 2;

is now fully defined, while this:

   f(++i, ++i);

is no longer UB, but implementation defined.

Ideally for every UB taken away we would get one or more pragmas to get the optimization back like ivdep. The issue is that that doesn't help old code bases.

[+] zajio1am|4 years ago|reply

Integer overflow is usually logic error, so a reasonable default behavior would be a trap instead of silent overflow (regardless of how signed integrers are stored in memory). Some architectures support that (e.g. MIPS).

[+] uecker|4 years ago|reply

(first, I am not a D user but I really like what I have seen. I wish WG14 would take more inspiration from D than from C++)

C23 will require 2s-complement.

Signed overflow is still UB in C. I think this is a better choice than wrap around, because overflow is often a bug. With UB, you can use static analysis (to some degree) or run-time traps (if the compiler supports this) and then fix those bugs. If it were defined to wrap around, those bugs are much harder to find.

[+] mhh__|4 years ago|reply

There are still processors that don't use two's complement, although I'm not sure that should really stop them if they wanted to declare all C implementations must be t-c.

[+] unknown|4 years ago|reply

[deleted]

[+] maxlybbert|4 years ago|reply

It may be unusable for “modern” operating systems (you could write an early UNIX clone with just ISO C because early UNIX didn’t support much).

But that’s basically always been the case. I doubt you could stay within the first ISO C standard and write a modern operating system.

[+] speedcoder|4 years ago|reply

Can you still compile gcc code with -O0 (gcc option to turn optimization off) to get completely defined behavior? When doing so does it actually still turn off all optimizations? Also does -Os (optimization for size) still produce defined behavior?

[+] foxfluff|4 years ago|reply

> Can you still compile gcc code with -O0 (gcc option to turn optimization off) to get completely defined behavior?

No. The standard specifies what's undefined, optimization levels don't change it (though there are compiler flags such as -fwrapv which make undefined things defined).

However, turning off optimizations will make behavior easier to predict.

[+] RustyRussell|4 years ago|reply

I still retch when told memcpy and memset cannot take NULL and 0 length.

Try asserting that they're not NULL in glibc and try to boot your machine! Oops... bad compiler people, bad!

[+] ErikCorry|4 years ago|reply

Similarly, standard C++ is not usable to implement virtual machines like Hotspot, and yet all virtual machines are implemented using C++ compilers.

[+] abfan1127|4 years ago|reply

what's the history of having undefined behavior? why would a language designer not specify?

[+] throwawayvibes|4 years ago|reply

[deleted]

[+] qualudeheart|4 years ago|reply

So we can switch to rust now?

[+] yjftsjthsd-h|4 years ago|reply

Sure; we look forward to your patches (keep in mind that they must preserve compatibility and portability to all currently supported architectures).

475 comments