C Strings and my slow descent to madness

[+] nathell|2 years ago|reply

> Our last function is strcmp. It looks at two strings and determines whether they are equal to each other or not. If they are it returns 0. If they aren’t it returns 1.

No it doesn’t.

    RETURN VALUES
         The strcmp() and strncmp() functions return an integer greater than, equal
         to, or less than 0, according as the string s1 is greater than, equal to,
         or less than the string s2.  The comparison is done using unsigned
         characters, so that ‘\200’ is greater than ‘\0’.

[+] stephc_int13|2 years ago|reply

If you are using C and do some non-trivial work with strings you should either use a good library to handle strings or build your own.

It is not that difficult in practice.

The old C std lib is, in my opinion, outdated, obsolete and a very bad fit for complex string handling, especially on the memory management side.

In my own framework, the string management module is using a dedicated memory allocator and a "high level" string API with full UTF8 support from the start.

As a general rule, I think that the C std lib is the weakest part of the C language and it should only be used as a fallback.

[+] nneonneo|2 years ago|reply

Pop quiz, which of these is safe, given "char buf[80]" and arbitrary user input in argv[1]?

    gets(buf);
    scanf("%s", buf);
    strcpy(buf, argv[1]);
    scanf("%80s", buf);
    strncpy(buf, argv[1], 80);
    snprintf(buf, 80, argv[1]);

----

The delightful answer is none of them. The first three have no bounds checking at all, meaning that they will happily overflow the buffer to an arbitrary extent (gets, at least, will usually trigger a warning on modern compilers). The next two have off-by-one errors: scanf will write a NUL byte out of bounds (and that's exploitable! https://googleprojectzero.blogspot.com/2014/08/the-poisoned-...) while strncpy will fail to NUL-terminate the string. The last one uses the right buffer length, but treats user input as a format string and can leak memory contents or produce arbitrary memory corruption with the %n format specifier.

C string handling practically invites off-by-one errors and horrible security practices out-of-the-box.

[+] robomartin|2 years ago|reply

> The delightful answer is none of them.

No. Sorry. This is bad programming. C'mon.

I started programming back in the 8080, 8085, 6502, etc. days. I had to program some prototype computers using a hex keypad while entering raw machine code (not even assembler). I still own a couple of these:

https://i.imgur.com/ZsIJj1p.png

In a couple of cases I had to take this approach to bootstrap Forth on a 6502, then write a full Forth code editor and finally write the robotics application from there.

Do not confuse bad programming or lack of knowledge with something attributable to a language, any language. A knowledgeable software developer, among other things, stays clear of these issues. This is also the value of experience and exposure to a wide range of technologies.

It's like blaming MicroPython for a machine getting destroyed because garbage collection interrupted a critical real time process. There's nothing wrong with MicroPython in that regard, the programmer/designer of the embedded system either lacked knowledge and understanding.

Part of the problem, as I see it, is that a good deal of modern university CS degrees don't even touch low level stuff. They start students on languages like Javascript and Python. These are fantastic, however, someone with deep-rooted experience in these languages who jumps into C is very likely to do some truly horrific things. The language isn't the problem, at all.

I mean, not to go too far, the Linux kernel is written in C. Right? It's about the person, not the language.

[+] myrmidon|2 years ago|reply

Are you sure that strncpy does an out-of-bound write here? I believe it doesnt, but would give you an unterminated string in buf which is also... less than ideal (if the input is 80 non-null characters or longer).

[+] marcodiego|2 years ago|reply

> If we try to print out some Japanese characters… [] The output isn’t what we expect.

Yes it is. And I bet on a modern windows version it is too. The terminal has been (probably intentionally) neglected by ms for a long time, but as far as I know this has mostly been fixed on modern windows versions.

EDIT: Author admits it later in the text "will be fixed in Windows 11 and Windows Server 2022"

Also it says "strlen("有り難う")); [...] and the output is… The length of the string is 12 characters". But according to "man strlen": "RETURN VALUE: The strlen() function returns the number of bytes in the string pointed to by s.". It says nothing about "number of characters".

[+] userbinator|2 years ago|reply

Well-written C tends to minimise string usage in general, preferring to convert to another format as soon as possible. Allocating, copying, and passing around strings in large quantities is not a good idea for efficiency, but of course some people coming from other HLLs seem to try to do it anyway, which causes many other problems.

[+] bell-cot|2 years ago|reply

THIS.

And programming, engineering, and life in general have so, SO many other situations where "X is not very good at doing Y". Yet (my experience) guys seem extremely resistant to the common-sense strategy of "then try to minimize how much Y you do with X".

[+] agumonkey|2 years ago|reply

to the point I often wonder if strings should exist.. buffers -> symbols | structs.

[+] gaws|2 years ago|reply

> preferring to convert to another format as soon as possible.

Like what?

[+] zh3|2 years ago|reply

I once got called in to fix an SS7 stack suffering from poor performance. Pretty well written, and not obvious at first sight why it was going slow. Most of it was low-level bit fiddling, and some small strncpy's() - generally about 8 chars or so.

Didn't take that long to profile (well, printf's as no profiling available) and figure out it was the strncpy's causing the problem, but why? Well, there was a handy 8 megabyte buffer used for working memory that the strings were being copied into that for modification.

From the strncpy() man page:-

>If the length of src is less than n, strncpy() pads the remainder of dest with null bytes.

Ah. So every little strncpy was essentially copying the string then zeroing out 7,999,992 bytes. And there were lots of little strncpy's...

[+] simonblack|2 years ago|reply

"We're not in Kansas any more, Toto"

Or to paraphrase that "We're not in Python any more, and C is not Python".

You know what sends me insane? Indentation and lack of fixed types in Python. But I don't have problems with C strings. Because I have grown to love and know C's string foibles just like the author will certainly not be driven insane by 'Python's shortcomings according to me'.

The world is full of people who complain that something or other is different from what they know, so that 'other' is wrong. That's just being isolationist. Everything has its own advantages, its own disadvantages. Let's accept that and move on, instead of making mountains out of mole-hills.

[+] Sohcahtoa82|2 years ago|reply

> Indentation and lack of fixed types in Python.

Whenever I see someone complain about Python's indentation, my brain internally translates it to "I poorly format my code."

If you code is properly formatted, then Python's indentation is never a problem. I praise Python's indentation-as-syntax because it prevents issues like a dangling else or a forgotten brace while also making proper formatting a requirement for your program to run.

[+] lmm|2 years ago|reply

Indentation and lack of fixed types aren't responsible for over 50% of known security issues in software.

C's issue are not just harmless foibles. They cause real harm to the poor people actually using the software.

[+] eloff|2 years ago|reply

That’s a fair point, but of the over twenty programming languages I’ve used in my career, only C uses null terminated strings. All the others store the length. There’s good reasons for that. I think C strings are objectively bad and error prone.

[+] vkou|2 years ago|reply

You've grown to love the footguns and hundreds of thousands of security holes that null-terminated strings have introduced over the decades?

It's not so much a question of different is bad, it's that having one of the six positions for your car's stick shift be marked 'Self-destruct' is... Sub-optimal. I'm sure you're smart enough to operate that car safely, but the ditches seem to be filled with burnt-out husks.

Tab-based, versus curly-brace indentation, on the other hand, is a question of how you want the car painted. Purely personal taste.

[+] chlorion|2 years ago|reply

There is no practical advantage to null terminated strings though.

It's not that they are "different", it's that they are extremely error prone and have poor performance for certain operations such as getting the length.

>But I don't have problems with C strings

Everyone thinks they are clever enough to use them and other parts of C without problems, and those people are the most dangerous.

[+] gavinhoward|2 years ago|reply

Okay, I agree that by default, C strings are bad.

But it doesn't have to stay that way. Someone else in the comments mentioned antirez's sds library for dynamic strings. This works, but you could also easily roll your own. All you need is an init function, and perhaps an assert or other check at the end of it that the string has a nul terminator.

At that point, type checking will let you blindly pass those strings (or their char arrays) to any of those C functions without worry.

Edit: I'll also add that I think a string library should have a difference between static strings and string builders (dynamic strings). It makes everything easier.

[+] bluetomcat|2 years ago|reply

In well-written C, you don't work with strings the way you do in other HLLs. For example, extracting and copying substrings is something unnecessary, unless you want to modify the parent string. Otherwise, a substring is represented by a pointer and a size_t length, and can easily be printed that way via the "%.*s" printf specifier:

    const char *s = "Hello World!";
    const char *world = s + 6;
    size_t world_len = 5;
    printf("%.*s\n", world_len, world);

[+] tom_|2 years ago|reply

* consumes an int, not a size_t: https://port70.net/~nsz/c/c11/n1570.html#7.21.6.1p5

[+] gpderetta|2 years ago|reply

On other HLLs it is easy to have subviews on other strings. C makes is needlessly hard by requiring null termination in half the APIs.

[+] flohofwoe|2 years ago|reply

This is from a C fan: If you are going to do any string heavy work, please use anything else than C (Python is pretty nice for this sort of stuff for instance).

And if you need to use C anyway, then please use anything else than the string functions from the standard library. The C stdlib is (mostly) a leftover from the K&R era when opinions about what makes a good API were very different from today, and C was a much 'harsher' language.

C is pretty nice for a lot of things, but working with strings definitely isn't one of them.

[+] axilmar|2 years ago|reply

For string heavy workload, C is ideal, provided that you don't use C string functions.

You can always allocate a very large buffer and do your string operations there, using memncpy and the assorted functions which can be inlined in many architectures and be really fast.

Then you can dispose of the buffer really quickly with one call or reuse it for later operations by simply setting a few pointers to initial status...

[+] BananaaRepublik|2 years ago|reply

As a newcomer to C, why is it that the C standard library doesn't get updated? Newer languages seem to place a lot of emphasis on getting their standard libraries as useful as possible. It's odd to be told not to use the standard library functions but to write my own instead. I'm really doubting I can just sit down and hammer out string functions superior to string.h.

[+] cozzyd|2 years ago|reply

I would say, unless there's a performance reason not to, always use asprintf for every string operation.

[+] _benj|2 years ago|reply

With the woes of string.h being known, why not just use an alternative like https://github.com/antirez/sds ?

I’ve also been having a blast with C because writing C feels like being a god! But the biggest thing that I like about C is that the world is sort of written on it!

Just yesterday I needed to parse a JSON… found a bunch of libraries that do that and just picked one that I liked the API.

[+] benmmurphy|2 years ago|reply

`strlcpy` is the function you probably want. but again it is not standard. https://lwn.net/Articles/507319/

I think the reason people don't want to standardise this kind of function is it often gives wrong behaviour. for example if you are trying to copy a string into a fixed buffer and its too long then often it is an error or potentially even a security bug to truncate it. so these functions generally do the 'wrong' thing even though they are 'safer'. if you are dealing with static buffers then I think you should be explicitly checking the source fits in the target and then handling the error case. you could even have a function like `strlcpy` that does `strlen` then checks if it fits, then does the copy or return an error code. alternatively, if the string should always fit and you don't want to handle the error case then the safe thing to do is check at runtime that it fits then abort the program if it doesn't fit.

[+] kelnos|2 years ago|reply

On systems that aren't memory constrained, we just shouldn't be using static buffers at all. Just always use something like asprintf() and free() the result when you're done. No, it's not in the C or POSIX standards, and that's a shame, but it's at least available on Linux and the BSDs.

I end up working on a lot of code that uses Glib, so I tend to use g_strdup_printf() a lot, which works the same as asprintf().

Ultimately the cost of allocations is usually not a big deal, and you gain a lot of safety. Sure, you then have to remember to free(), but I'll take a memory leak over a segfault (and its possible security consequences) any day.

And if allocation cost is a problem, you can always go back and optimize with static buffers later. That shouldn't be the default that people reach for, though.

[+] Night_Thastus|2 years ago|reply

strlcpy is not needed, strcpy_s (not strncpy_s) is safe and is part of the C11 standard.|

In fact, strlcpy is worse:

* strlcpy truncates the source string to fit in the destination (which is a security risk)

* strlcpy does not perform all the runtime checks that strcpy_s does

* strlcpy does not make failures obvious by setting the destination to a null string or calling a handler if the call fails.

[+] saagarjha|2 years ago|reply

No, it’s not. The return value it provides is generally unwanted.

[+] tragomaskhalos|2 years ago|reply

K&R contains this beautiful koan-like string copy code:

    while (*t++ = *s++)
        ;

Honestly the elegance of this thing was one of the hooks that made me fall in love with C. But this was from a now-forgotten age of innocence, as there are so many "nopes" around this line-and-a-half that one would, rightly, be tarred and feathered for ever putting it in a program today.

[+] kovac|2 years ago|reply

Could you explain why this line should be discouraged? I'm a beginner in C, so I really don't know. That's why I'm asking.

[+] habibur|2 years ago|reply

I don't use null terminated strings. ptr+len struct everywhere. And when I need to call an API, like fopen, I make a temporary copy of that string + the null termination, do my work and then free it.

You can printf non-null terminated strings too. Check printf("%.*s", length, strptr).

[+] kevin_thibedeau|2 years ago|reply

wchar_t is a massive landmine that should never be used since its size varies by platform. The locale of the compiler has to match the end user for L prefixed strings to work correctly. Likewise char16_t and char32_t are just swimming against the easy path at this point. You're much better off sticking to UTF-8 and using the C11 u8 prefix on literals so you can use the regular string API and never have to worry about locale settings.

[+] Decabytes|2 years ago|reply

This is great advice! I wasn't aware of this and I will keep that in mind. When I first came across Unicode literals I was unsure when exactly you would use them over wchar_t

[+] Dwedit|2 years ago|reply

> "But for real if anyone knows how to get this to work on Windows 10 let me know!"

Since the May 2019 update, Windows 10 has supported declaring the code page in a manifest file.

In Visual Studio, you must add "/utf-8" to the compiler command line, this makes it parse the source code as a UTF-8 file, and makes it output UTF-8 string literals.

To make console output work, call the Win32 function "SetConsoleOutputCP(65001);"

To get support for opening files with names that aren't in your system codepage:

* Create a manifest file as shown in https://learn.microsoft.com/en-us/windows/apps/design /globalizing/use-utf8-code-page

* Add this as an "Additional Manifest File" in Visual Studio project settings for the manifest tool

Additionally, there is an undocumented NTDLL function "RtlInitNlsTables" that sets the code page for the process. It is difficult to use without a lot of example code, but some app locale type tools (used to change locale for a process) make use of this function.

[+] gpderetta|2 years ago|reply

The worst part of C strings is that they tend to show up in APIs (especially system calls). This make interoperability with other languages harder than it should m

[+] ziml77|2 years ago|reply

This is why I hate them too. You can use a custom length + pointer type for representing strings in your own code, but interfacing with other libraries and the OS almost always requires having a null-terminated string. It forces you to make copies just to tack on the null terminator.

[+] mahoho|2 years ago|reply

Just a pedantic comment, but 有り難う is arigatou or roughly "thanks", not "hello". Hello would usually be こんにちは or, more confusingly, 今日は

[+] commandlinefan|2 years ago|reply

Sort of unfortunate, because there's really no good translation for "hello" into Japanese - you'd say こんにちわ in the morning, in the afternoon こんばんわ and もしもし when answering the phone...

[+] teddyh|2 years ago|reply

もしもし

[+] jmclnx|2 years ago|reply

Yes this is something to get use to. The BSDs created strlcpy(3) and wcslcpy(3)

https://man.openbsd.org/strlcpy.3

https://man.openbsd.org/wcslcpy.3

which to me will help with some of these issues. Too bad other Operating Systems do not have these. On Linux there is libbsd to get these, but I would like to see these to be added to the stdc.

Instead the c23 standard is messing with realloc(3) which could break some old programs. I have not looked at that in detail yet, so maybe it is a non-issue :)

[+] GabrielTFS|2 years ago|reply

These functions are in the current POSIX draft - though not published, it's quite unlikely to be removed (someone actually specifically filed an issue against POSIX to try and get it removed, basically on the basis of "it's not perfect so it should be removed", and the issue got rejected on the basis that there's no consensus for removal, and it seems unlikely this will change), and as a result, the functions are getting added to glibc: https://sourceware.org/pipermail/libc-alpha/2023-April/14696...

[+] torstenvl|2 years ago|reply

I don't know of a compiler that forces you to use the newest version of the standard, which is why I've always kind of thought "don't break old code" was treated too much like dogma. So from that perspective, a non issue.

However, there is a problem that has nothing to do with old code: they increased the number of situations that constitute undefined behavior, with no public discussion and no justification. It's frankly dangerous behavior.

[+] alecco|2 years ago|reply

Ushering out strlcpy() https://lwn.net/Articles/905777/

[+] anthomtb|2 years ago|reply

strlcpy is nice due to the guaranteed NUL termination.

strlcpy is not so nice due to the strange (IMO) return value of the number of characters in the source string. Which could be the number of characters copied or much, much larger than the number of characters copied. snprintf does the same thing.

So using strlcpy is safe (by C's low bar) but using the return value may be highly unsafe.

[+] hgs3|2 years ago|reply

Yup, there is also Linux's strscpy which doesn't require reading memory from the source string beyond the specified "count" bytes and the return value is idiot proof.

[+] unknown|2 years ago|reply

[deleted]

[+] russellbeattie|2 years ago|reply

Literally 25 years ago I was a beginner programmer and tried writing a .dll for Microsoft's Internet Information Server, which was relatively new at the time. (I hadn't so much as seen a Unix-based OS at the time, let alone understood CGI). C strings were mind boggling and frustrated me so much I simply gave up. Happily around the same time, MS introduced Active Server Pages and I was able to use that and never messed with C again. It's amazing the same issues still exist decades later.

[+] chihuahua|2 years ago|reply

That is the most mind-boggling part of this saga to me. People have been using C since the 1970s. It's now 2023, and there still isn't an obvious solution to this other than suggestions that every team should write their own string library from scratch.

And apparently it all started with some genius deciding that using a single 0-byte at the end is so deliciously efficient and therefore obviously the way to go. We can't waste 4 bytes for the string length, that's out of the question. I think only the Pascal solution of having a single byte for the string length is worse.

322 comments