Designing a better strcpy

[+] dataflow|4 years ago|reply

I hate to nitpick on something so mundane and superficial, but why in the world are people still writing code like this in 2020?

  while (--len && (*dst++ = *src++))
    ;

Dereferenced post-increments are already confusing enough as-is, and yet here we have 1 pre-increment and 2 dereferenced post-increments happening on top of an assignment in a conditional, all in a single expression. Even as someone who does put an assignment in a conditional once in a while, this still feels 100% unjustifiable to me. It's especially ironic given the premise is that C code has security bugs... if the goal is to avoid that, shouldn't there be even more care taken to avoid this kind of code?

EDIT: For those who don't think the readability can be improved... any thoughts on something more like this? Do we really need a compound assignment with three side effects in a conditional modifying the function arguments to make this readable?

  char *strxcpy(char *restrict dst, const char *restrict src, size_t len) {
    size_t i;
    for (i = 0; i < len; ++i) {
      if (src[i] == '\0') { break; }
      dst[i] = src[i];
    }
    int input_exhausted = i < len;
    if (input_exhausted) { ++i; }
    if (i > 0) { dst[i - 1] = '\0'; }
    return input_exhausted ? &dst[i] : NULL;
  }

[+] saagarjha|4 years ago|reply

I can’t answer why other people might do this, but I wrote this code so I can at least give my rationale for it. In short: it leans into C idioms to convey intent. “--len” in a loop condition is a way to say “do this len times”. “*dst++ = *src++” is quite literally the reference strcpy implementation. Taken together, it clearly and quite concisely represents the main operation of this function: do a strcpy, but only up to len characters. There is some extra bookkeeping that needs to be done for errors and null-termination, and that’s delineated from the core part of the code. In all, I read the function as “bail out early if there is nothing to do, otherwise do the copy operation and then fix up the result”.

Now, of course you might say that my code is prone to off-by-one errors and the like. And it totally is! This is C and you are going to get string handling wrong, guaranteed. When I was writing this I personally ran it through the typical edge cases and it caught an instance where I (IIRC) was writing an extra NUL when I had exactly filled the buffer previously. These things happen and you need to verify the code yourself to fix it beforehand. But, back to the point, I don’t trust your code any more than I would mine. I mean, you’re also breaking out of a loop on an additional condition and reusing the index variable for later logic. That’s basically a big red flag for “there might be bugs here”. There’s really no way to avoid it. I would argue that the most complex part of this code is actually verifying that the transition between the loop and the error handling that happens afterwards is correct, not the copying bit. Adding a few extra variables splits up the logic a bit but it also gives you more loop invariants to keep track of. Anyways, that’s my 2¢.

[+] acmj|4 years ago|reply

This looks straightforward to me and probably to many C programmers. It is natural to write this as well as other styles you can come up with. Nitpicking is a bigger problem which distracts the attention to the real issue.

[+] FabHK|4 years ago|reply

I don't know, it could be written more verbosely, but then, this is just so idiomatic. Not sure writing it out makes it much more legible.

ETA: yes, your version reads really well. Question though: does it require an extra multiplication and addition for every "[i]" in the code below, or are compilers smart enough to optimise that these days? It seems to me that for sufficiently dumb compilers, your "nice" version would be slower.

    for (i = 0; i < len; ++i) {
      if (src[i] == '\0') { break; }
      dst[i] = src[i];
    }

[+] blux|4 years ago|reply

I disagree, I think that is just the way idiomatic C code looks like and should look like. A post incremented dereferenced pointer for example is an idiom, it is so common that at some point you will immediately recognize it for what it is. These idioms keep the code concise, which has a lot of value in itself, since they make it easier to recognize what the code is all about instead of figuring out code that essentially expresses the same intent but does it in some non-idiomatic way. They are like chords in music, standard positions in chess, etc.

[+] Tempest1981|4 years ago|reply

Homage to K&R? Hard to unlearn that style.

How about reading the STL source code? I know it's not deliberately obfuscated, but sometimes feels that way.

[+] liendolucas|4 years ago|reply

Thank you for mentioning this! As idiomatic as the original loop might be it still forces someone unfamiliar with C to think about it carefully, on the other side this version is trivial to read and understand. Code should be always designed with readability in mind as we spend much of our time reading and understanding code (if an idiom makes hard to understand something I would probably consider it a bad idiom and avoid the style).

EDIT: And no, there is also people that write code like you did regardless the language ;)

[+] jmrm|4 years ago|reply

Personally, I try to make my code as easier to read as I can. Not only is faster to read to others, but also faster to analyse when you're trying to fix a bug.

At the end of the day, today's compilers are good enough to translate the code to the machine in the most efficient way.

[+] abhishekjha|4 years ago|reply

I guess this is so that something like this can be asked in interviews to test how lucky the candidate is on that particular day and hour. Otherwise everybody would pass the interviews.

[+] userbinator|4 years ago|reply

but why in the world are people still writing code like this in 2020?

To be frank, because some people still value intelligence and understanding.

I have been working with C for decades and that code is basically second-nature to me. It reads very naturally. Perhaps it should be you and all the anti-intellectuals who should instead consider why they find it "confusing", and take some personal responsibility to actually try to learn the language instead of complaining that it's "not readable".

It's especially ironic given the premise is that C code has security bugs... if the goal is to avoid that, shouldn't there be even more care taken to avoid this kind of code?

On the contrary, mental laziness is a bigger source of bugs than anything else. Making code more verbose does not help, and your version has so many more "moving pieces" that I'd say it's harder to understand.

[+] saurik|4 years ago|reply

> In the case where src fits in dst, it will return a pointer past the NUL byte it placed; otherwise it returns NULL to indicate a truncation.

It is amazing to me how personal these preferences are ;P... like, I'd be much happier with an API that always returns the location of the NUL byte on success; and, if the string gets truncated, then it instead returns dst+len (the address of the byte past the end of the buffer). This allows for chained constructions that provide efficient strcat-style semantics with easy error propagation, such as this example which concatenates three strings (which I honestly hope I got right... I'm giving reasonable odds to Saagar telling me I've coded a buffer overflow by accident somewhere ;P):

    char buf[X]; // for any X, even 0!
    char *cur = buf;
    char *end = buf + sizeof(buf);
    cur = strjcpy(cur, str1, end - cur);
    cur = strjcpy(cur, str2, end - cur);
    cur = strjcpy(cur, str3, end - cur);
    if (cur == end) goto fail;

[+] saagarjha|4 years ago|reply

Not seeing anything wrong, but it’s C so who knows ;) Thankfully, the primitive I (and memccpy) provide makes writing your wrapper easy and efficient, as opposed to all the other functions which don’t compose at all. (From my phone) I think this might work?

  char *strjcpy(char *dst, const char *src, size_t len) {
      char *result = strxcpy(dst, src, len);
      return result ? --result : result + len;
  }

[+] unknown|4 years ago|reply

[deleted]

[+] kolbusa|4 years ago|reply

Did you mean to check for (cur > end)?

[+] WalterBright|4 years ago|reply

An even better solution:

https://www.digitalmars.com/articles/C-biggest-mistake.html

With the lengths of strings known, copying becomes fast, safe, and trivial.

[+] dheera|4 years ago|reply

It always boggles me why people went with null-terminated strings in C instead of just putting the length of the string in the first 4 bytes and then the string after that.

And just generally doing that for all arrays. That way if you're passed ONLY a pointer to the array, you can always tell how long it is without requiring the caller to tell you how long it is, which seems to be a common theme among C functions. And that also allows your string to contain null characters, which is useful in many circumstances.

[+] Hackbraten|4 years ago|reply

Not everyone is going to be happy about 8 bytes overhead for every buffer.

Also, not every C-style API will migrate. If you consume one of those, you may still have to count the length, which costs cycles.

[+] mhh__|4 years ago|reply

If someone implemented that, you might even call it betterC...

[+] wruza|4 years ago|reply

Rant ahead.

(which value to return)

This always bugged me in C interfaces. All these metrics are accumulated in the scope of your qweasd(), why not at least take a pointer to a “struct strcpy_result” instead and return them all with it, dammit, stack space is almost always free. After two decades of C, I turned to jitted scripting and hopefully will NEVER look back (but thanks for the xp). Low-level bit-fiddle style programming outside of very restrictive embedded tech is a distilled schizophrenia of tiling sharp irregular shapes, trying, failing, vapid engineering, byte economy, and finally losing a larger part of the ephemeral benefits that you never needed (or were able to get) in the first place. (C++ being a drug that doesn’t fix the issue, but at least reduces voices in your head to “just” a heavy form of OCD).

discussion on losing 4/8 bytes per string in this thread

Insanity.

[+] notaplumber|4 years ago|reply

> strlcpy... it’s not standard so it doesn’t satisfy 5 either.

It's a de facto standard supported on *BSD/macOS/Android/Linux(w/ musl libc) and a number of other commercial systems, it's also on track for standardization in POSIX (Issue 8).

https://www.austingroupbugs.net/view.php?id=986#c5050

https://www.opengroup.org/austin/docs/austin_1110.pdf

While it might not meet your arbitrary set of requirements, it already has decent adoption.

[+] saagarjha|4 years ago|reply

Yikes, I hope they don’t add it, that would make people want to use it even more. Given its surprising performance characteristics it’s usually not what you want.

[+] tedunangst|4 years ago|reply

The concern with performance seems a little overwrought. strlcpy is only slow in the bad case where it truncates, which is ideally not the common case. I've never heard or seen of a performance bottleneck traced to a strlcpy in the hot path.

If you really cared about performance, you'd be using nothing but memcpy with careful length tracking. Regardless of algorithmic runtime, any function that examines bytes as it copies will be slower than a length based copy.

[+] saagarjha|4 years ago|reply

I think one use for this kind of thing is if you are doing some sort of logging you may have a fixed size buffer, both to keep overhead down (you don't want to allocate extra memory) and also prevent overly verbose logs from spamming output. In this case, waiting to calculate the length of a 10 MB string just so you can fit it in a 1 KB buffer is unacceptable. For your second point: not necessarily! memcpy would be slightly faster if you are keeping the length around, but if you're dealing with arbitrary strings you wouldn't have that. Calculating the full length beforehand is just a no-go as I mentioned above, and using memchr first to get the length and then memcpy is not going to be faster.

[+] vasama|4 years ago|reply

Migrating to a pointer-size-pair string representation would be a better use of one's time.

[+] compiler-guy|4 years ago|reply

And only works if you don't rely on any third-party libraries that take normal C strings.

Which is to say, is unrealistic for many programs.

[+] Gibbon1|4 years ago|reply

I swear part of the problem is with C there is a cargo cult prohibition against passing small structs by value.

[+] radicalcentrist|4 years ago|reply

Are there any recommended libraries for doing this, if I'd like to migrate my C codebase?

[+] zabzonk|4 years ago|reply

Referring to strcpy:

> we can only use it if we know our destination buffer is smaller than our source buffer

Should be the other way around, surely?

[+] saagarjha|4 years ago|reply

LOL yes, thanks for catching that. I’ll fix it when I’m at my computer :)

Edit: edited. I also had a messaging from past me that I had forgotten about:

  <!-- I made a mistake somewhere, didn't I? -->

[+] Randor|4 years ago|reply

Absolutely,

That's not the only discrepancy.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/m...

Note the following: "The memccpy() function does not check for the overflow of the receiving memory area." "If copying takes place between objects that overlap, the behavior is undefined."

The strxcpy he provides at the bottom doesn't look better at all. I'm not sure where the author got that function. I found some better variants of the proposed strxcpy function with bounds checks and that provides overflow detection.

[+] zbuf|4 years ago|reply

> While string truncation has its own issues, it is often a fairly reasonable fallback.

Seemingly swimming against the tide here; a strcpy() that _automatically_ truncates strings is a worse, and huge hidden risk.

Yes, buffer overflows from user data are very, very bad.

But a buffer overflow has a clear fingerprint as leaning on undefined behaviour; tools exist to detect these (eg. valgrind), and the fix is well understood.

Defining strcpy to include truncation is the wrong choice, as it's plain dangerous in most cases. It reclassifies the bug as valid program behaviour, where it's riskier and its fingerprint is harder to detect.

For example, a function which looks up credentials for a login name limited to 16 characters. During lookup, a buffer truncates and by default the code is now doing lookups against the wrong login name.

The better solution is a strcpy() which accepts a length and is _expressly not defined_ for cases where the output buffer would overflow; ie. asserts or aborts (which now has scope to be detected at compile time for some cases)

It's now clear where bug is, and a developer must preempt overflow for cases where it's possible and handle it.

Perhaps folks have different experiences, but mine is that copying a string and _wanting_ to truncate it is so incredibly rare and is the cases that should be explicit, not implicit.

[+] saagarjha|4 years ago|reply

Failing perceptibly composes better than failing loudly. It's easy to go one way:

  char *strxxxcpy(char *restrict dst, const char *restrict src, size_t len) {
      char *result = strxcpy(dst, src, len);
      assert(result);
      return result;
  }

but going in reverse is difficult.

[+] towergratis|4 years ago|reply

I would use ssize_t instead of size_t.

Mixing unsigned and signed is seriously broken in C, and hence better to stick with signed.

[+] minipci1321|4 years ago|reply

> Mixing unsigned and signed is seriously broken in C, and hence better to stick with signed.

Mmmm .. maybe in this specific case (didn't look). But if you meant this as general advice, then one should keep in mind that unsigned overflows are specified but signed overflow is UB (barring maybe the very latest version of C standard); because of that, unsigned division in many cases is trivially optimised to less complex ops, etc.

There are many advantages in using unsigned.

[+] rbanffy|4 years ago|reply

Would it make sense to extend the C standard to have a sensible string type?

/me ducks

[+] pjmlp|4 years ago|reply

It would, just like a sensible array type and proper enumerations as well.

Obviously WG14 doesn't care about it.

[+] stefan_|4 years ago|reply

I don't want to use any of the str*cpy functions, all of them are either braindead or missing in most libcs. At this point I'm all in on snprintf(%s, foo).

[+] saagarjha|4 years ago|reply

I hate to do this, but: read the post. I mention snprintf specifically: https://saagarjha.com/blog/2020/04/12/designing-a-better-str.... It is neither efficient nor general-purpose as a string copying routine.

[+] Tempest1981|4 years ago|reply

Something feels wrong about using memccpy to copy strings... ever since I saw bugs where people used memcpy incorrectly.

And is there a wchar_t version of memccpy?

[+] saagarjha|4 years ago|reply

memccpy≠memcpy, even though their names are similar. While the function is somewhat generic, I would bet 90+% of its use is going to be to copy null-terminated strings.

[+] okareaman|4 years ago|reply

I expected to read about SIMD and copying 8 bytes at a time in 64 bit registers, but I guess that is a compiler optimization

[+] unknown|4 years ago|reply

[deleted]

[+] unknown|4 years ago|reply

[deleted]

120 comments