top | item 46132431

(no title)

dryark | 3 months ago

The Monotype pricing change is brutal, but there’s a workaround. Derive new Japanese font families directly from public-domain sources.

I’ve been working on doing exactly that. Reconstructing clean vector glyphs from old metal-type Japanese books. The quality of those prints is surprisingly high, and they include thousands of kanji in consistent style. With some new technological innovations and a reasonable amount of hard work, you can produce a completely new, fully legal font family without touching any commercial IP.

The method I've devised is proprietary, but I’ll say this: it’s absolutely possible, and the output rivals modern JP fonts.

Given the sudden jump from ~$300/year to ~$20k/year for some devs, I expect more people to go down the “rebuild from PD artifacts” route instead of staying locked to a monopoly.

discuss

oliwarner|3 months ago

Indeed. Scan a book in public domain, feed into an online font generation service, pay somebody to clean it up.

A few hours later, you have a font you can use how you like. Is it as good? Probably not, but it's much cheaper.

Edit: oh look https://news.ycombinator.com/item?id=46127400

dryark|3 months ago

Yes. I did see that other article. No the process we are using is not using AI. We are not using OCR either. We are using computational geometry and forensics methodology. No flatbed scanners. No sheet fed scanners.

This isn't like anything ever done before. It's entirely different and higher quality than any result you can get through AI or OCR.

I do agree that detailed work is required to do it correctly and produced high quality results. I'm not offhandedly saying "just do these simple things and bam perfection."

afandian|3 months ago

It's fascinating how different this challenge must be between Latin vs CJK.

How do you match up the scans with unicode entities? Human supervision and/or OCR? To what extent is the breadth and quality of OCR the limiting factor?

How do you define your target entity coverage?

dryark|3 months ago

Great questions — and you’re absolutely right that Latin vs. CJK is effectively two different universes in terms of reconstruction.

1. Latin vs. CJK differences Latin glyphs are structurally simple: limited stroke vocabulary, mostly predictable modulation, and relatively low topological variation. Once you can recover outlines and stroke junctions accurately, mapping to Unicode is almost trivial.

That can be done with standard OCR methods for Latin.

CJK is the opposite. Each character is effectively a miniature blueprint with dozens of micro-decisions: stroke order, brush pressure artifacts, serif style, shape proportion, and even regional typographic conventions. Treating it like Latin “but bigger” doesn’t work. So the workflow for CJK has extra normalization steps and more constraints, especially when reconstructing consistent glyph families rather than one-offs.

From a simple perspective, CJK has many characters with disconnected pieces that are still part of the same character.

2. How we match scans to Unicode entities We don’t rely on conventional OCR at all. OCR engines are optimized for reading text, not recovering the underlying design intent. Our process is closer to forensic glyph analysis — reconstructing stable structural signatures, then mapping those signatures to references.

This ends up being a hybrid: • deterministic structural matching • limited supervised correction when ambiguity exists • and zero reliance on any off-the-shelf OCR models

It’s not “OCR first, match later.” It’s “reconstruct the letterpress structure, then Unicode becomes a lookup.” OCR quality literally doesn’t limit us because OCR isn’t part of the critical path.

3. What determines coverage Coverage is defined by what we can physically access and reconstruct cleanly. For Latin, coverage is straightforward. For CJK, coverage is shaped by: • typeface completeness in the source material • the consistency of impression depth • survivability of fine strokes in early printings • and the practical question of how many thousand characters the original font designer actually cut

There’s no need for the entire Unicode set per book. The historical font only ever covered a finite subset. It is unfortunate that every book doesn't use every glyph, but not catastrophic because we can source many public domain books from the same era and eventually find enough characters matching the style.

In short: Latin is an engineering challenge. CJK is an archaeological one. OCR is not a bottleneck because we don’t use it. Coverage follows the historical material, not Unicode completeness.