Conflating pointers with arrays: C's biggest mistake? (2009)

[+] xroche|7 years ago|reply

This is IMHO by far NOT C' biggest mistake. Not even close. A typical compiler will even warn you when you do something stupid with arrays in function definitions (-Wsizeof-array-argument is the default nowadays).

On the other hand, UBE (undefined, or unspecified behavior) are probably the nastiest stuff that can bite you in C.

I have been programming in C for a very, very long time, and I am still getting hit by UBE time to time, because, eh, you tend to forget "this case".

Last time, it took me a while to realize the bug in the following code snippet from a colleague (not the actual code, but the idea is there):

struct ip_weight { in_addr_t ip; uint64_t weight; };

const struct ip_weight ipw1 = {0x7F000001, 1}; const struct ip_weight ipw2 = {0x7F000001, 1};

const uint32_t hash1 = hash_function(&ipw1, sizeof(ipw1)); const uint32_t hash2 = hash_function(&ipw2, sizeof(ipw2));

The bug: hash1 and hash2 are not the same. For those who are fluent in C UBE, this is obvious, and you'll probably smile. But even for veterans, you tend to miss that after a long day of work.

This, my friends, is part of the real mistakes in C: leaving too many UBE. The result is coding in a minefield.

[You probably found the bug, right ? If not: the obvious issue is that 'struct ip_weight' needs padding for the second field. And while all omitted fields are by the standard initialized to 0 when you declare a structure on the stack, padding value is undefined; and gcc typically leave padding with stack dirty content.]

[+] jcelerier|7 years ago|reply

> [You probably found the bug, right ? If not: the obvious issue is that 'struct ip_weight' needs padding for the second field.

No, the bug is thinking that hashing random bytes in your memory is correct. Why wouldn't you make a correct hash function for your struct ?!

[+] WalterBright|7 years ago|reply

By 'mistake' I am considering the context of the times in which C was developed. Most of the UBE in C is there because of (1) the cost of mitigating it and (2) specifying it would impede portability.

Buffer overflows are UBE, too. But the way I proposed the fix is a pretty much cost-free solution, and it's optional.

Redefining C so that struct padding is always 0'd is an expensive solution, and rarely needed.

[+] blub|7 years ago|reply

Interestingly, the idea of hashing over the bytes of a data structure would be possible in C++, but it is non-idiomatic and instead one would use hash functions for each type and in the case of a struct the individual members would be accessed to build the hash value.

The idea of using bytes is error-prone, but now that you mentioned it, pretty typical of the C mindset.

Of course C++ also has some of these cultural biases. I think they're an important reason why unsafe code continues to be written.

[+] beefhash|7 years ago|reply

Bonus mess: If hash_function operates on bytes, but the input is cast to uint8_t* instead of (unsigned) char*, this is also a violation of aliasing rules and the compiler can technically just do whatever.

[+] wruza|7 years ago|reply

This wouldn’t be obvious if met in the wild, but once you pointed that bug exists, it is clear. The problem here is that we trust written code (its author) and rarely go into deep analysis at each line.

But like you said, you have to be aware of this effect at least, and it is not possible to eliminate that by simply clearing all variables to zero. See,

  p = (struct ip_weight *)some_used_mem;
  p->ip = ip;
  p->weight = weight;

At which point should imaginary p->garbage be set to zero? At cast? But that may again do unexpected thing, since casts usually do not modify data. The entire struct abstraction seems leaky as hell, but that’s the price of not dealing with asm directly.

These examples show that not C itself is hard, but low-level semantics are. You have to somehow deal with struct {x, y} and the fact that y has to be aligned at the same time. And have different platforms in mind. Maybe it is platforms that should be fixed? Maybe, but these are hardware with other issues that may be even harder to get right.

I think C is okay, really (apart from compilers that push ub to the limit). Type systems in successors try to hide its rough edges, but in the end of the day you end up with semantics of the compiler (c++, rust), that a regular guy has to understand anyway; it’s trading one complexity for another. C++ folks often seen treating it as magic, simply not doing what they’re not sure about. Good part is some languages force you to write correct code, no matter how much knowledge you have. But NewC could e.g. instead force one to create that ‘auto garbage’ to make it clear (why not, safety measures are inconvenient irl too).

I have no strong conclusion, but at least let’s think of all non-cs people who make their 64KB arduinos drive around and blink leds.

[+] sehugg|7 years ago|reply

Clang has -Wpadded to catch these kinds of bugs, FWIW. (and at least on my system it inits locals' padding with zero)

[+] hzhou321|7 years ago|reply

While other comments suggest the solution is to implement the hash function based on field values, it throws away the simple, efficient, and general implementations of original memory based hash function. But if we understand the true source of the problem, isn't the obvious solution is to redefine the structure into two 64-bit fields or add in explicit padding bytes so one can explicitly zero them when necessary?

The reason for undefined behaviors is to avoid over engineering. In a capable engineer's eye, it is beautiful.

[+] bluetomcat|7 years ago|reply

C requires you to understand its conceptual execution and memory layout model in order to write safe code. That is, how the call stack works, the different types of storage, that each type has alignment requirements, and I'm not even mentioning threading issues.

No amount of syntax sugar on top will prevent you from writing unsafe code, unless that basic model is thrown away.

[+] pornel|7 years ago|reply

Rust solved this particular class of problems nicely with `#[derive(Hash)]` without having to define what the padding bytes are.

[+] jibal|7 years ago|reply

I think Walter gave a very good argument for why it's the biggest mistake, and no warning will help with the consequences that he pointed out.

That C has UBE is not a mistake, it's fundamental to the language design, which allows for unrestricted access to the bare metal. If you want a different sort of language, use Java.

[+] kahlonel|7 years ago|reply

I don't think there is any UB case that applies to the given snippet, or am I missing something? Treating unpacked structs as byte buffers is asking for trouble.

[+] bluecalm|7 years ago|reply

To run into UB here you need to read struct's bytes as something else and then do calculations on them. UB or not it is asking for trouble. Why not just read field of the struct and use them to computer a hash?

[+] jasonkostempski|7 years ago|reply

I don't know much about C. Are there code analysis tools that would have caught the issue?

[+] senatorobama|7 years ago|reply

Why is struct ip_weight padded.. is it to make it word aligned on 64-bit platforms?

[+] drfuchs|7 years ago|reply

The proposal here is way too vague. And if you flesh it out, things start to fall apart: If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"? If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer)? On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

Similar questions apply to general arrays, as well. Also: Am I able to take the address of an element of an array? Will that be a fat pointer too? How about a pointer to a sequence of elements? Can I do arithmetic on these pointers? If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array? How would you write Quicksort? Heapsort? And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

In short, the reason that this sort of thing hasn't appeared in C is not because nobody has thought about it, nor because the C folks are too hide-bound to accept a good idea, but rather because it's not clear that there's a real, workable, limited, concise, solution that doesn't warp the language far off into Java/C#-land. It would be great if there were, but this isn't it.

[+] WalterBright|7 years ago|reply

I happened to know the idea does work, and has been working in D for 18 years now.

> If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"?

No. You'll still have the same issues with how memory is allocated and resized. But, once the memory is allocated, you have a safe and reliable way to access the memory without buffer overflows.

> If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer)?

As I mentioned, it does not address allocating memory. However, it does offer one performance advantage in not having to call strlen to determine the size of the data.

> On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

In D, we call those slices. They look like this:

    T[] array = ...
    T[] slice = array[lower .. upper];

The compiler can insert checks that the slice[] lies within the bounds of array[].

> Am I able to take the address of an element of an array?

Yes: `T* p = &array[3];`

> Will that be a fat pointer too?

No, it'll be regular pointer. To get a fat pointer, i.e. a slice:

    slice = array[lower .. upper];

> How about a pointer to a sequence of elements?

Not sure what you mean. You can get a pointer or a slice of a dynamic array.

> Can I do arithmetic on these pointers?

Yes, via the slice method outlined above.

> If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array?

No, just the slice.

> How would you write Quicksort? Heapsort?

Show me your pointer version and I'll show you an array version.

> And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

The arena will likely be an array, right? Then return slices of it.

[+] alphaglosined|7 years ago|reply

This article was not created based upon theory.

It was created based upon real world experience having designed and implemented it in D. Where all of your concerns have not been discussed in the years following that article (it was already in the language for about 8 years at that point aka the start and has been solidly proven to work in the exact same context as it would have done in C).

In D at least, you can grab the pointer by a simple .ptr and for length .length. To get a specific element, it is as you would expect &f[i] all nice and straight forward. But what if you want to create an array from malloc? In D that is easy, just slice it! malloc(len)[0 .. len]. And free is just as you would expect from above, free(array.ptr);

[+] GlitchMr|7 years ago|reply

> The proposal here is way too vague. And if you flesh it out, things start to fall apart

No, it's not, those ideas were implemented in practice in D and Rust, and there are no real issues with those. This feature could be easily implemented in C, there are no dependencies on features that C doesn't have.

> If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long

No need to store the capacity. This is a slice, not a buffer. Go conflates those two for user's convenience, but this is not necessary, and in fact is waste of RAM - not an issue for Go, but it is an issue for C. For instance, `&str` in Rust is a pair of pointer to a string and its length and it works really well.

> If not, how do you manage to get string variable on the stack if its length might change? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer)?

Use your own slice buffer abstraction for that purpose. It can be implemented as a struct storing a slice and its capacity. Pass a pointer to slice buffer abstraction, if you want a function to be able to add elements to it. This is also how it works in Go, for that matter.

Slices don't define concatenation. This is C, not a high level programming language.

> Am I able to take the address of an element of an array?

Yes. `&a[3]`. It's still an array, it just knows its size.

> Will that be a fat pointer too?

No.

> How about a pointer to a sequence of elements?

Probably you could add some sort of a range access syntax. Say, something like `&a[1:3]`.

> Can I do arithmetic on these pointers?

I don't know whether pointer arithmetic should be allowed or not, but even if it shouldn't be, there is nothing to stop you from doing `&a[4]` as a replacement for `a + 4`.

> How would you write Quicksort? Heapsort?

The same way you would with a regular array. Think of it as a struct storing an array pointer and its length. If you prefer to working with pair of start/end pointers instead of pair of start and array size, then note that `end - start` is array length, so getting an end pointer is trivial.

[+] Too|7 years ago|reply

Can't see how this would affect string concatenation negatively compared to plain char? The problem with java string is that it's pre-allocated to the exact length, normally char-strings are also that. They don't magically make things faster just because they are missing a length-field, quite the contrary actually because now you need to iterate it twice to concatenate two strings without a StringBuffer equivalent, once to figure out the length of the result so you can allocate the correct size and once to do the actual copy.

Don't see why you shouldn't be able to make a fat pointer point into a range inside the original array either? Just point it to an element and make the length-field shorter than the original? This is usally called array_view, span or slice in other languages.

[+] _ph_|7 years ago|reply

It would be 2/3rds of a Go slice. You have a fat pointer with the capacity of the array it is pointing to. If you want to implement shorter strings, you would have to store the length independently, or use 0 termination. You still would have the length information as a safeguard of overflowing the array. You still could do everything, what you can do with current C strings, just safer. One could have fat pointers to array elements to, just with an accordingly shorter capacity.

In the end, I think the Go slices are the consequential implementation of safe fat pointers, having both the capacity and length and allowing efficient and still safe reslicing. The overhead of having 24 vs 8 bytes per pointer on a 64 bit machine should be worth it in modern times.

[+] jacinabox|7 years ago|reply

I tried implementing a scheme like this once. What you do for efficiency is allocate some extra header space with the array size, and access it with negative pointer offsets. You pass to fat pointers if asked to take slices of the array. This way the common use case has good locality. The way you get things onto the stack is with a macro, which preallocates and initializes the array with the (statically known) array length.

[+] fnord123|7 years ago|reply

>If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"?

You don't need a fat pointer. It can be part of the memory layout on the heap. How do you think `free` knows the length of the memory you are deallocating? Because the length is on the heap snuggled in right before the actual pointer malloc returned.

[+] Animats|7 years ago|reply

Yes, that's C's biggest mistake. (But remember, they had to cram the compiler into a 16-bit machine.) No, "fat pointers" are not a backwards-compatible solution. They've been tried. They were a feature of GCC at one time, used by almost nobody.

I once had a proposal on this. See [1]. Enough people looked it over to find errors; this is version 3. The consensus is that it would work technically but not politically.

The basic idea is that the programmer knows how big the array is; they just don't have a way to tell the compiler what expression defines the length of the array. Instead of

    int read(int fd, char buf[], size_t n);

you write

    int read(int n; int fd, char (&buf)[n], size_t n);

It generates the same calling sequence. Arrays are still passed as plain pointers. But the compiler now knows how big "buf" is, both on the caller and callee side, and can check.

I also proposed adding slice syntax to C, so, when you want to talk about part of an array, you do it as a slice, not via pointer arithmetic.

The key idea here is that you can call old code from new ("strict") code, and strict code from old code. When you get to all strict code, subscript errors should be all checkable.

[1] http://www.animats.com/papers/languages/safearraysforc43.pdf

[+] WalterBright|7 years ago|reply

I suspect that the reason your idea was not adopted was the syntax. It's not a phat pointer, it's two arguments with some rather complex syntax to connect the two.

The reason I'm fairly confident of that assessment is I've had similar experiences with D when the syntax for something was too complex. Early on, the syntax for lambdas was rather clunkly. Everyone either hated it, or insisted that D didn't even have lambdas. Greatly simplifying the syntax was a revelation, suddenly D had lambdas and they became used everywhere.

Syntax matters a great deal.

[+] raverbashing|7 years ago|reply

> Yes, that's C's biggest mistake. (But remember, they had to cram the compiler into a 16-bit machine.

Pascal's compiler was smaller and it worked in 16bit machines no problem.

Maybe C's base library was bigger? I'm not sure

[+] mFixman|7 years ago|reply

> I also proposed adding slice syntax to C, so, when you want to talk about part of an array, you do it as a slice, not via pointer arithmetic.

I highly disagree with this. One of the advantages of conflating pointers with arrays is an obvious and very consistent way of indexing and slicing on the entire language that has minimal syntactic baggage.

[+] pjmlp|7 years ago|reply

Burroughs and IBM also had to cram safer languages in more restrained environments.

[+] int_19h|7 years ago|reply

It's not just argument passing, though. You also want to be able to return slices, store them inside structs etc.

[+] MrBingley|7 years ago|reply

I absolutely agree. Adding an array type to C that knows its own length would solve so many headaches, fix so many bugs, and prevent so many security vulnerabilities it's not even funny. Null terminated strings? Gone! Checked array indexing? Now possible! More efficient free that gets passed the array length? Now we could do it! The possibilities are incredible. Sadly, C is so obstinately stuck in its old ways that adding such a radical change will likely never happen. But one can dream ...

[+] bluetomcat|7 years ago|reply

> Adding an array type to C that knows its own length would solve so many headaches

C arrays know their length, it's always `sizeof(arr) / sizeof(*arr)`. It's just that arrays become pointers when passed between functions, and dynamically-sized regions (what is an array in most other languages) are always accessed via a pointer.

[+] m_mueller|7 years ago|reply

I’ll add to this that C having committed to this mistake is one of thr main reasons some people (scientific programmers) are still using Fortran. Arrays with dimensions, especially multidimensional ones, allow for a lot of syntactic sugar that are very useful, such as slicing.

[+] ars|7 years ago|reply

> But one can dream ...

There's nothing stopping you from simply doing it. With a couple of macros the whole thing can just be a header file.

True, it doesn't take you all the way there (you'll still need to manually check array access to make sure they don't go over), but it's a start. And those manual checks can be a macro as well, to make it easy to add them where needed.

[+] kahlonel|7 years ago|reply

Its actually quite common for C programmers to create their own array type that knows its length, and use it in their projects. See this for example: https://github.com/antirez/sds

[+] WalterBright|7 years ago|reply

Just for fun, type in this program:

    int fred(int a[10]) {
        return a[11];
    }

It compiles without error with gcc and clang, even with -Wall. The code generated by clang is:

    mov EAX,02Ch[RDI]
    ret

i.e. buffer overflow, even though the array size is given. Compile the equivalent DasBetterC program:

    int fred(ref int[10] a) {
        return a[11];
    }

    fred.d(2): Error: array index 11 is out of bounds a[0 .. 10]

And the 32 bit code generated (when using 9 instead of 11 so it will compile):

    mov     EAX,024h[EAX]
    ret

[+] bluetomcat|7 years ago|reply

Quite surprised to see this not mentioned. C99 allows you to use the "static" keyword in array function parameters like this:

    void foo(int arr[static 10]);

It cannot check whether a passed pointer will point to enough space, but the compiler can warn you if you pass a fixed-size array of a smaller size.

[+] WalterBright|7 years ago|reply

Apparently someone posted this here because of my remark:https://www.reddit.com/r/programming/comments/90ov9i/a_respo...

Nice to see it get such a nice response!

[+] chmike|7 years ago|reply

From my experience Go's array (slice) is a far better solution. It does not only carry the size (number of elements), it also carries the array buffer capacity. To me it's the epitome of what arrays should be.

[+] speedplane|7 years ago|reply

Gimme a break, making stricter requirements on C arrays may theoretically make some things easier, but we’re talking 1% improvement. What makes C hard (and great) is requiring an understanding of not just memory, but memory allocation and deallocation schemes. For many beginners this is hard conceptually, but for everyone, keeping track of allocated and unallocated memory is extremely difficult.

[+] ufmace|7 years ago|reply

I haven't written much C, and I don't have a firm opinion on whether or not that particular issue is C's biggest mistake. I do think that just this one change sounds radical enough, as far as the effort it would take to convert existing C code that uses the high-risk pattern, that it seems better to just wholesale convert to a language that already mandates safety like Rust or Java. Particularly when you consider all of the other high-risk patterns in C that these other languages eliminate.

[+] User23|7 years ago|reply

This is a very good article that highlights the importance of semantics.

[+] hota_mazi|7 years ago|reply

Conflating pointers and arrays seem pretty minor and not the cause for many bugs.

The main source of bugs in C to me would be pointer arithmetics.

[+] nearmuse|7 years ago|reply

What's the mistake? You pass a pointer and the number of elements, it's just the C way. At any point in time you have to pay attention. What is the proposal here? Make all arrays structures? Or add some weird un-C syntactic sugar?

[+] bluecalm|7 years ago|reply

Why is this such a serious issue? I mean it is inconvenient to always pass length along with the pointer but it's not that inconvenient. It's a bit more typing but that's where problems end.

[+] unknown|7 years ago|reply

[deleted]

[+] altrego99|7 years ago|reply

Agree that this is a problem (if the programmer is not careful).

But serious question, why even bother with this one fix?

The only reason for the fix is so to make it more difficult to make errors.

Fix arrays, then you would fix null pointer, then you might add objects, templating/generics to support a good collections library, rtti, and before you know it you are creating another one of c++, D, go, java. And we already have those.

C paved the way. Why not let it be the end of it?

[+] toolslive|7 years ago|reply

Isn't the fact that core types don't have a fixed representation a bigger mistake ? a char can be 16 bits, for example, aso.

[+] nurettin|7 years ago|reply

Fat pointers manifested themselves in Pascal as strings and are still being used in modern Delphi.

[+] apz28|7 years ago|reply

I would love one day that programming should adhere to the discipline as in bridge/car safety. Simple malpractice will go to jail for it then there will be no argumment/discussion about this stupid mistake that can be verified by tool Cheers Pham

[+] xaduha|7 years ago|reply

That's why I hope Red/System and just Red in general takes off https://static.red-lang.org/red-system-specs.html

254 comments