(no title)
jkwchui | 2 years ago
----
Conceptually it is simple: 1. assign a default (most likely) sound for each character, 2. loop through contexts, extracting words (char-combos) where the sound is different from the default ("alt-word") 3. create SVGs + font-paths (fallback for incompatible systems) for every char and every alt-word 4. assign a ligature to substitute each char-sequence that forms the alt-word (e.g., "when 乾 隆 appears adjacently, replace with `uniF1234` (the codepoint for the alt-word 乾隆")
It is not perfect, but I didn't expect this to work so well, and was stunned when the testers report high accuracy. I have always believed that bespoke computation with word segmentation (with some 1M frequency attached library) and large data-bank (100k+ words) was necessary.
----
Practically it was horrific, tedious, mind-numbing, gawd-awful set of "why this doesn't work": 1. SVG automation that works for 10^3 breaks with 10^5 2. what worked for Latin breaks for unicode 3. what worked for unicode breaks for PUA 4. what worked for monochrome breaks for color 5. what worked for single glyphs breaks for ligatures 6. what?! The assignments in the database is wrong?? 7. [...]
As I was trying to coerce the system to do what it wasn't designed to do, many of these breaks are undocumented, pretty mysterious to solve, and some steps just got manually gritted through. (And each of the 15k+ glyphs got gritted through about five times.)
It does look pretty elegant at the end ;)
ackfoobar|2 years ago
> Unfortunately, without being able to do proper word segmentation, this will remain a limitation.
Can the user manually add a zero width space to help?
jkwchui|2 years ago
(For everyone else wonder what ackfoobar is proposing: let's take the phrase (if you don't read Chinese, just treat them as shapes) 香港地少人多, properly segmented, is 香港.地少.人多. The font treats this incorrectly, because "香港地" is a commonly used fragment, the 地 in the fragment have a special sound, and parsing as 香港地.少.人多 gives a mistaken sound for 地.
Ackfoobar is absolutely correct that we can coerce the correct reading by going 香港[ ]地少人多 --- where the [ ] is an invisible spacer. My contention is that most users don't know how to do that in their favorite word processor.
Someone is probably thinking, could you add "香港地少" as a fragment? Purist says it's not pretty, but I'm a pragmatist, so I did do many of these patching. Doing this or not relies on some acumen as a native speaker, and there were hundreds of these decisions made. This language knowledge would be necessary if someone were to do Mandarin (or Thai or, ...))
jfk13|2 years ago
I notice you're using OpenType-SVG here; have you investigated whether it would be possible to implement this using COLRv1 (which would potentially result in a lighter-weight font, I suspect, and eventually wider support)? Or are there technical limitations in COLRv1 that make it impossible?
jkwchui|2 years ago
But I did try to make it into COLRv1 (as well as COLR/CPAL). The only tools that build COLRv1 right now are the tools from the Google Fonts team; I remember them stalling for hours before saying completion, yet the output was broken (I can't remember how it was broken).
I personally would love to see a COLR/CPAL version, and have some idea on how that could happen. But I probably should be working on some revenue-generating product instead ;)
creamyhorror|2 years ago
jkwchui|2 years ago
The history of digital fonts added a great deal of complexity to font formats, and without him writing such a concise yet comprehensive guide, I would have been stuck for even longer.