(no title)
dryark | 3 months ago
I’ve been working on doing exactly that. Reconstructing clean vector glyphs from old metal-type Japanese books. The quality of those prints is surprisingly high, and they include thousands of kanji in consistent style. With some new technological innovations and a reasonable amount of hard work, you can produce a completely new, fully legal font family without touching any commercial IP.
The method I've devised is proprietary, but I’ll say this: it’s absolutely possible, and the output rivals modern JP fonts.
Given the sudden jump from ~$300/year to ~$20k/year for some devs, I expect more people to go down the “rebuild from PD artifacts” route instead of staying locked to a monopoly.
oliwarner|3 months ago
A few hours later, you have a font you can use how you like. Is it as good? Probably not, but it's much cheaper.
Edit: oh look https://news.ycombinator.com/item?id=46127400
dryark|3 months ago
This isn't like anything ever done before. It's entirely different and higher quality than any result you can get through AI or OCR.
I do agree that detailed work is required to do it correctly and produced high quality results. I'm not offhandedly saying "just do these simple things and bam perfection."
afandian|3 months ago
How do you match up the scans with unicode entities? Human supervision and/or OCR? To what extent is the breadth and quality of OCR the limiting factor?
How do you define your target entity coverage?
dryark|3 months ago
1. Latin vs. CJK differences Latin glyphs are structurally simple: limited stroke vocabulary, mostly predictable modulation, and relatively low topological variation. Once you can recover outlines and stroke junctions accurately, mapping to Unicode is almost trivial.
That can be done with standard OCR methods for Latin.
CJK is the opposite. Each character is effectively a miniature blueprint with dozens of micro-decisions: stroke order, brush pressure artifacts, serif style, shape proportion, and even regional typographic conventions. Treating it like Latin “but bigger” doesn’t work. So the workflow for CJK has extra normalization steps and more constraints, especially when reconstructing consistent glyph families rather than one-offs.
From a simple perspective, CJK has many characters with disconnected pieces that are still part of the same character.
2. How we match scans to Unicode entities We don’t rely on conventional OCR at all. OCR engines are optimized for reading text, not recovering the underlying design intent. Our process is closer to forensic glyph analysis — reconstructing stable structural signatures, then mapping those signatures to references.
This ends up being a hybrid: • deterministic structural matching • limited supervised correction when ambiguity exists • and zero reliance on any off-the-shelf OCR models
It’s not “OCR first, match later.” It’s “reconstruct the letterpress structure, then Unicode becomes a lookup.” OCR quality literally doesn’t limit us because OCR isn’t part of the critical path.
3. What determines coverage Coverage is defined by what we can physically access and reconstruct cleanly. For Latin, coverage is straightforward. For CJK, coverage is shaped by: • typeface completeness in the source material • the consistency of impression depth • survivability of fine strokes in early printings • and the practical question of how many thousand characters the original font designer actually cut
There’s no need for the entire Unicode set per book. The historical font only ever covered a finite subset. It is unfortunate that every book doesn't use every glyph, but not catastrophic because we can source many public domain books from the same era and eventually find enough characters matching the style.
In short: Latin is an engineering challenge. CJK is an archaeological one. OCR is not a bottleneck because we don’t use it. Coverage follows the historical material, not Unicode completeness.