top | item 44484224

(no title)

timbray | 7 months ago

Relevant: https://www.ietf.org/archive/id/draft-bray-unichars-15.html - IETF approved and will have an RFC number in a few weeks.

Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.

discuss

order

chrismorgan|7 months ago

The most important bit of that is the “Unicode Assignables” subset <https://www.ietf.org/archive/id/draft-bray-unichars-15.html#...>:

  unicode-assignable =
     %x9 / %xA / %xD /               ; useful controls
     %x20-7E /                       ; exclude C1 controls and DEL
     %xA0-D7FF /                     ; exclude surrogates
     %xE000-FDCF /                   ; exclude FDD0 nonchars
     %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
     %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
     %x30000-3FFFD / %x40000-4FFFD /
     %x50000-5FFFD / %x60000-6FFFD /
     %x70000-7FFFD / %x80000-8FFFD /
     %x90000-9FFFD / %xA0000-AFFFD /
     %xB0000-BFFFD / %xC0000-CFFFD /
     %xD0000-DFFFD / %xE0000-EFFFD /
     %xF0000-FFFFD / %x100000-10FFFD

josephg|7 months ago

This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.