C, what the fuck??!

What do you think that the value of a will be here?

int a = 0;
// What will be the value of a????/
a++;

You probably know that it won’t be 1, but there is a big chance that you only know that because I asked this question this way.

a will actually not change, and that is because a++; will never be run. This is because of the comment above. There is something special about this line. Before we jump into that, let’s look at another example:

!didIMakeAMistake() ??!??! CIsWrongHere();

This actually compiles, which is already impressive on its own. The question is, however, what the fuck does this do?

To understand this, I have to admit one thing: I have to pass -trigraphs to a modern version of gcc before this actually works. Trigraphs are special combinations of characters that were invented because of a problem in C: it uses 9 characters that are not in the ISO/IEC 646 Invariant character set. That are these characters:

Image: Wikipedia

Who uses C on a regular basis, should be able to figure out which 9 characters are missing from here. Those are the following:

 # \ ^ [ ] | { } ~

The table in the image above is meant to be clarifying, but we discover that this table might be confusing. That is because the characters that are missing, can all be found in the table above. You should note however, that they are all grayed out. That is because they are national code points, and therefore not an international rule.

This could become very interesting. Let’s look at this simple line for example:

{ a[i] = '\n'; }

This would be written like this by a Swedish programmer:

ä aÄiÜ = 'Ön'; ü

This is because they would use different characters for the national code points, than an American programmer would use for example.

The ANSI C committee of course recognized this problem and therefore, they decided to introduce the trigraphs. That are nine combinations of characters that were meant to replace the non-standard characters.

Image: Wikipedia

Of course, this isn’t a beautiful solution, but it should do the trick.

Now, with this knowledge, let’s look at the lines that we started with. The latter is the easier one:

!didIMakeAMistake() ??!??! CIsWrongHere();

If we look at the table above, we see that ??! should be replaced with |. Therefore, this line actually says:

!didIMakeAMistake() || CIsWrongHere();

If you understand how short-circuit evaluation works, you can understand that this will result in the following:

if (didIMakeAMistake()) 
  CIsWrongHere();

The other example is actually more interesting, and a good reason to be cautious with trigraphs:

int a = 0;
// What will be the value of a????/
a++;

Earlier, I already explained that a will be 0, because a++; is never executed.

A trigraph is only a trigraph when the ??s are followed by one of the nine string literals. So in this case, the C preprocessor will replace the code above with the following:

int a = 0;
// What will be the value of a??\
a++;

This \ actually escapes the newline, which eventually results in the following:

int a = 0;
// What will be the value of a??a++;

And this is why a++; was never executed.

I would like to end with a note from the committee itself:

The Committee makes no claims that a program written using trigraphs looks attractive. As a matter of style, it may be wise to surround trigraphs with white space, so that they stand out better in program text. Some users may wish to define preprocessing macros for some or all of the trigraph sequences.

Rationale for International Standard Programming Languages C (5.2.1.1)

14 thoughts on “C, what the fuck??!

  1. Your syntax highlighting is broken. More people are being misled by the syntax highlighting being wrong than not knowing a well know feature of C.

        1. Oh, I see what you mean. Yeah, my syntax highlighter doesn’t take trigraphs into consideration, that’s correct. I do wonder how many modern highlighters actually do though.

          1. Atom and vim also don’t see this as a comment. I think that is the most tricky thing about this: editors that don’t catch this.

      1. Everything, including \, after // should be ignored on any and all single line comments no matter where // is located in your code. If that’s not the case because of a particular compiler that’s a not so good situation.

    1. To understand this, I have to admit one thing: I have to pass -trigraphs to a modern version of gcc before this actually works.

  2. Your Swedish is slightly off. 🙂 The right brace is å, not ü. The full “Swedish translation table” looks like this:

    [ Ä
    \ Ö
    ] Å
    { ä
    | ö
    } å
    @ É
    ` é
    ~ ü
    ^ Ü

    though many “Swedish” character sets omitted the last four, which are mostly needed for names (and breakfast, if you want müsli!).

    A common mnemonic back in the day was “räksmörgås” (shrimp sandwich) which has the brackets in the “right” order — r{ksm|rg}s. Many Swedish hackers/programmers would set their terminals to display US ASCII, which meant that program code looked right, but natural l}ng{}ge ended {p looking like thi|. One get| the h}ng of it after a while! (Here, I used U, S, and A to simulate the effect.)

    I haven’t met a single person, Swedish or other, who actually used trigraphs.

    And, thank goodness, we’ve had mostly-working 8-bit character sets since the 90:s, and explosions from software not understanding Unicode is getting rarer and rarer! =)

    1. My sincerest apologies for my horrible Swedish, haha!

      Your example is actually great! I love it. Congratulations on your 8-bit character sets!

Leave a Reply

Your email address will not be published. Required fields are marked *