Show HN: I modeled the Voynich Manuscript with SBERT to test for structure
381 points| brig90 | 9 months ago |github.com
The Voynich Manuscript is a 15th-century book written in an unknown script. No one’s been able to translate it, and many think it’s a hoax, a cipher, or a constructed language. I wasn’t trying to decode it — I just wanted to see: does it behave like a structured language?
I stripped a handful of common suffix-like endings (aiin, dy, etc.) to isolate what looked like root forms. I know that’s a strong assumption — I call it out directly in the repo — but it helped clarify the clustering. From there, I used SBERT embeddings and KMeans to group similar roots, inferred POS-like roles based on position and frequency, and built a Markov transition matrix to visualize cluster-to-cluster flow.
It’s not translation. It’s not decryption. It’s structural modeling — and it revealed some surprisingly consistent syntax across the manuscript, especially when broken out by section (Botanical, Biological, etc.).
GitHub repo: https://github.com/brianmg/voynich-nlp-analysis Write-up: https://brig90.substack.com/p/modeling-the-voynich-manuscrip...
I’m new to the NLP space, so I’m sure there are things I got wrong — but I’d love feedback from people who’ve worked with structured language modeling or weird edge cases like this.
patcon|9 months ago
I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!
https://patcon.github.io/polislike-opinion-map-painting/
Painted groups: https://t.co/734qNlMdeh
(Sorry, only really works on desktop)
[1]: https://www.technologyreview.com/2025/04/15/1115125/a-small-...
brig90|9 months ago
loxias|9 months ago
This ain't your parents' "factor analysis".
khafra|9 months ago
staticautomatic|9 months ago
minimaxir|9 months ago
The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.
brig90|9 months ago
Appreciate you calling that out — that’s a great push toward iteration.
thih9|9 months ago
Does it make sense to check the process with a control group?
E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?
flir|9 months ago
awinter-py|9 months ago
cedws|9 months ago
I wanted to do an analysis of what letters occur just before/after a line break to see if there is a difference from the rest of the text, but couldn't find a transcribed version.
My completely amateur take is that it's an elaborate piece of art or hoax.
unknown|9 months ago
[deleted]
IAmBroom|9 months ago
ting words at the end of lines.
tetris11|9 months ago
Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis
brig90|9 months ago
And yes to the cross-cluster reference idea — I didn’t build a similarity matrix between clusters, but now that you’ve said it, it feels like an obvious next step to test how much signal is really being captured.
Might spin those up as a follow-up. Appreciate the thoughtful nudge.
lukeinator42|9 months ago
jszymborski|9 months ago
(Before I get yelled out, this isn't prescriptive, it's a personal preference.)
DonaldFisk|9 months ago
I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.
You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...
Avicebron|9 months ago
brig90|9 months ago
I didn’t re-map anything back to glyphs in this project — everything’s built off those EVA transliterations as a starting point. So if "okeeodair" exists in the dataset, that’s because someone much smarter than me saw a sequence of glyphs and agreed to call it that.
us-merul|9 months ago
The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.
I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.
veqq|9 months ago
philistine|9 months ago
It's not a mental issue, it's just a rare thing that happens. Voynich fits the whole bill for the work of a naive artist.
GolfPopper|9 months ago
1.https://en.wikipedia.org/wiki/Edward_Kelley
2.https://en.wikipedia.org/wiki/Cardan_grille
quantadev|9 months ago
Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.
codesnik|9 months ago
Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.
I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.
veqq|9 months ago
A lot of work's been done here. There are believed to have been 2 scribes (see Prescott Currier), although Lisa Fagin Davis posits 5. Here's a discussion of an experiment working off of Fagin Davis' position: https://www.voynich.ninja/thread-3783.html
empath75|9 months ago
bunderbunder|9 months ago
I'd argue that these are just the camps that non-traditional, amateur analysis efforts fall into. I've only briefly skimmed Voynich work, but my impression is that, traditionally, more academic analyses rely on a combination of linguistic and cryptological analysis. This does happen to be informed by some statistical analysis, but goes way beyond that.
For example, as I recall the strongest argument that Voynichese probably isn't just an alternative alphabet for a well-known language relies on comparing Voynichese to the general patterns for how writing systems map symbols to sounds. That permits the development of more specific hypotheses about how it could possibly function, including how likely it is to be an alphabet or abjad, and, hypotheses about which characters could plausibly represent more than one sound, possible digraphs, etc. All of that work casts severe doubt on the likelihood of it representing a language from the area because it just can't plausibly represent a language with the kinds of phonological inventories we see in the language families that existed in that place and time.
There's also been some pretty interesting work on identifying individual scribes based on a confluence of factors including, but not limited to, analysis of the text itself. Some of the inferred scribes exclusively wrote in the A language (oh yeah, Voynichese seems to contain two distinct "languages"), some exclusively wrote in the B language, I think they've even hypothesized that there's one who actually used both languages.
There isn't a lot of popular awareness of this work because it's not terribly sexy to anyone but a linguistics nerd. But I'd guess that any attempt to poke at the Voynich manuscript that isn't informed by it is operating at a severe disadvantage. You want to be standing on the shoulders of the tallest giants, not the ones with the best social media presence.
gwillen|9 months ago
brig90|9 months ago
tough|9 months ago
many english as second language speakers use LLMs as translators nowadays tho
pawanjswal|9 months ago
brig90|9 months ago
That second part wasn’t super important though — this was more about learning and experimenting than trying to break new ground. Really appreciate the kind words, and hopefully it sparks someone to take it even further.
user32489318|9 months ago
frozenseven|9 months ago
brig90|9 months ago
ck2|9 months ago
https://arstechnica.com/science/2024/09/new-multispectral-an...
but imagine if it was just a (wealthy) child's coloring book or practice book for learning to write lol
Avicebron|9 months ago
Even if it was "just" an (extraordinarily wealthy and precocious) child with a fondness for plants, cosmology, and female bodies carefully inscribing nonsense by repeatedly doodling the same few characters in blocks that look like the illuminated manuscripts this child would also need access to, that's still impressive and interesting.
bdbenton5255|9 months ago
brig90|9 months ago
Appreciate the nudge — always fascinating to see where people take this kind of thinking.
PaulDavisThe1st|9 months ago
andrewla|9 months ago
[1] https://www.voynich.nu/transcr.html
marcodiego|9 months ago
munchler|9 months ago
brig90|9 months ago
The challenge (as I understand it) is that the vocabulary size is pretty massive — thousands of unique words — and the structure might not be 1:1 with how real language maps. Like, is a “word” in Voynich really a word? Or is it a chunk, or a stem with affixes, or something else entirely? That makes brute-forcing a direct mapping tricky.
That said… using cluster IDs instead of individual word (tokens) and scoring the outputs with something like a language model seems like a pretty compelling idea. I hadn’t thought of doing it that way. Definitely some room there for optimization or even evolutionary techniques. If nothing else, it could tell us something about how “language-like” the structure really is.
Might be worth exploring — thanks for tossing that out, hopefully someone with more awareness or knowledge in the space see's it!
mellow_observer|9 months ago
Pecularities in Voynich also suggest that one to one word mappings are very unlikely to result in well described languages. For instance there's cases of repeated word sequences you don't really see in regular text. There's a lack of extremely common words that you would expect would be neccessary for a word based structured grammar, there's signs that there's at least two 'languages', character distributions within words don't match any known language, etc.
If there still is a real unencoded language in here, it's likely to be entirely different from any known language.
gthompson512|9 months ago
brig90|9 months ago
Clustering by sentence or page would be interesting too — I haven't gone that far yet, but it’d be fascinating to see if there’s consistency across visual/media sections. Appreciate the insight!
bpiroman|9 months ago
https://www.youtube.com/watch?v=p6keMgLmFEk&t=1s
bpiroman|9 months ago
https://youtu.be/p6keMgLmFEk?feature=shared&t=559
unknown|9 months ago
[deleted]
thearn4|9 months ago
theRealEros|9 months ago
GTP|9 months ago
brig90|9 months ago
rossant|9 months ago
adzm|9 months ago
mach5|9 months ago
Tade0|9 months ago
AStonesThrow|9 months ago
glimshe|9 months ago
lolinder|9 months ago
himinlomax|9 months ago
The age of the document can be estimated through various methods that all point to it being ~500 year old. The vellum parchment, the ink, the pictures (particularly clothes and architecture) are perfectly congruent with that.
The weirdest part is that the script has a very low number of different signs, fewer than any known language. That's about the only clue that could point to a hoax afaik.
poulpy123|9 months ago
As far as I know it's just gibberish since it doesn't follow the statistics of the known languages or cyphers of the time.
unknown|9 months ago
[deleted]
andyjohnson0|9 months ago
I have no background in NLP or linguistics, but I do have a question about this:
> I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants
This seems to imply stripping the right-hand edges of words, with the assumption that the text was written left to right? Or did you try both possibilities?
Once again, nice work.
brig90|9 months ago
fader|9 months ago
veqq|9 months ago
https://www.voynich.ninja/thread-4327-post-60796.html#pid607... is the main forum discussing precisely this. I quite liked this explanation of the apparent structure: https://www.voynich.ninja/thread-4286.html
> RU SSUK UKIA UK SSIAKRAINE IARAIN RA AINE RUK UKRU KRIA UKUSSIA IARUK RUSSUK RUSSAINE RUAINERU RUKIA
That is, there may be 2 "word types" with different statistical properties (as Feaster's video above describes)(perhaps e.g. 2 different Cyphers used "randomly" next to each other). Figuring out how to imitate the MS' statistical properties would let us determine cypher system and make steps towards determining its language etc. so most credible work's gone in this direction over the last 10+ years.
This site is a great introduction/deep dive: https://www.voynich.nu/
brig90|9 months ago
unknown|9 months ago
[deleted]
akomtu|9 months ago
nine_k|9 months ago
<quote>
Key Findings
* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
* Cluster 3 has high diversity and flexible positioning — likely a root content class
* Transition matrix shows strong internal structure, far from random
* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)
Hypothesis
The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.
</quote>
brig90|9 months ago
unknown|9 months ago
[deleted]
timonofathens|9 months ago
[deleted]
cookiengineer|9 months ago
[deleted]
brig90|9 months ago
My main goal was to learn and see if the manuscript behaved like a real language, not necessarily to translate it. Appreciate the link — I’ll check it out (once I get my German up to speed!).
Nursie|9 months ago
0points|9 months ago
So, sorry but you are not busting any bubbles today.
ablanton|9 months ago
https://www.researchgate.net/publication/368991190_The_Voyni...
Reubend|9 months ago
For more info, see https://www.voynich.ninja/thread-3940-post-53738.html#pid537...