Writing UTF-8 Programs in Plan 9

[+] tialaramex|4 years ago|reply

The "An aside on whitespace" section is bafflingly wrong.

It notices that for some reason the Go code here just lists all the characters it considers to be whitespace (rather than asking its Unicode database if they're White_Space, perhaps as a speed-up) but then it forgets how UTF-8 works and considers U+0085 and U+00A0 to be a single UTF-8 byte because their Rune is less than 256.

Beyond that though almost everything is confused about what's going on because Plan9's documentation keeps talking about "characters" which might actually mean a Unicode code point, or a Unicode scalar value, or that Plan9 doesn't know the difference because it is old, and is unlikely to be a "character" in any sense you understand. The problem is that even Unicode's scalar values aren't necessarily as this article claims, "individual" or "legible".

This just isn't how human writing systems work, if you don't like it don't blame Unicode, or Plan9, blame all your ancestors back to whoever first figured out tally marks.

[+] lifthrasiir|4 years ago|reply

It is clear that the article simply translates K&R exercise into runes without a consideration, but I should note that reversing a codepoint (or rune) sequence is not identical to reversing the human-readable text. The hiragana example is outright wrong because no Japanese would think `ゃき` (codepoint-wise reversal of `きゃ` kya) as a meaningful Japanese text.

[+] ynfnehf|4 years ago|reply

How is that any different than reversing the English digraphs? "th", "ch", "wh", etc. Reversing English doesn't usually produce something meaningful either.

[+] donatj|4 years ago|reply

Can you explain further, I’m not sure I understand the objection. Doesn’t any language backwards become unintelligible to its readers? Doesn’t seem specific to Japanese.

How would you wish `ゃき` to reverse if not `きゃ`?

https://emoji.boats/s/きゃ

Seeing as they are both are normal runes with no modifiers attached, I don’t know what alternative there would be?

[+] eqvinox|4 years ago|reply

If the goal was to properly reverse Unicode text, this doesn't do that — it completely fails to consider combining characters. Those need to stay in-order… otherwise the combining accent jumps to the next/previous character.

Generally speaking, Unicode text can't be reversed without the UCD (Unicode Character Database) at hand.

Also, as soon as you consider that arbitrary groups of characters (or "Rune"s) need to stay in-order, you might as well stay on UTF-8 since the variable length encoding no longer really matters.

(Also-also, some text just can't be reversed, or might change characters in reverse. For example, Greek Sigma [σ] changes to [ς] if it is at the end of a word. Do you readjust that after flipping a word around?)

[+] lillywastaken|4 years ago|reply

A lot of this you can find out from reading the programming guide to plan 9 - http://doc.cat-v.org/plan_9/programming/c_programming_in_pla...

[+] e12e|4 years ago|reply

Interesting. Gives me a better intuition for strings in zig as well.

I'm curious about the reverse()-function - it requires the caller to allocate and pass in the "out" buffer as a mutable (well, mutable in the sense that the buffer is written to) - yet returns a pointer at the end (rather than void, or an error code).

Is that a typical c/plan9 idiom?

I would probably prefer the function allocating and returning, or the caller allocating and the function (procedure) just writing to the buffer it got as arguments?

    Rune*
    reverse(Rune *in, Rune *out, usize len)
    {
     int i;
     int to = 0, from = len-1;
     while(from >= 0){
      out[to] = in[from];
      from--;
      to++;
     }

    return out;
    }

[+] henesy|4 years ago|reply

Author here.

While the other comments are correct, the exact reason for this use of providing in/out is that the reversal is called on a subset of the incoming array.

  line = Brdstr(in, '\n', 1);

will give us a null-terminated string, but we don't want to flip the null and truncate the string, so to lazily avoid that we do:

  rlen = runestrlen(rstr);
  rev = calloc(rlen+1, sizeof (Rune));
  reverse(rstr, rev, rlen);

so we get the number of runes in the input, add 1 for the \0, then reverse the pre-\0 characters.

We could have the reverse() function allocate n+1 elements for the string and return an always null-delimited string, but then we need to pass it a string that doesn't have a \0, or make it assume that it will always get a \0 and treat that some way.

Passing in both items and the number to iterate felt less noisy for a quick solution :)

[+] kevin_thibedeau|4 years ago|reply

Caller provided objects are a standard idiom that offers greater flexibility to use static/global vars, objects with a FAM, and custom allocators.

[+] 0x20cowboy|4 years ago|reply

I’ve been playing with UTF8 in c99 and the examples here have helped me understand how it works a bit better.

22 comments