Computational philology
A weighted edit distance for restoring damaged Byzantine Greek — and a statistically significant gain on the variants real manuscripts preserve.
Two Greek words that differ by one letter-pair. A Byzantine scribe taking dictation could not hear the difference between ταις and τοις — both were pronounced /i/. Plain edit distance scores that slip exactly like any unrelated one-letter change. The weighted scheme makes it cost 0.35 instead of 1.00.
Ancient Greek reaches us through centuries of hand-copying. Letters smudge, words drop out, scribes mishear from dictation or misread a neighbour’s hand. Princeton’s Logion — a BERT language model trained on premodern Greek — reads a passage, flags where the transmitted word looks wrong, and proposes what belongs there. One thing it gets blunt: every letter swap costs the same. Weight swaps by how easily letters are actually confused, and the suggestions match how scribes really err — recovering more of the variants real manuscripts preserve.
The mechanism
Logion scores spellings near a suspect word. Watch how the shortlist changes when swaps are weighted.
For each word, Logion asks: how likely is what’s written, against the most likely near-spelling it can imagine in that spot? When some other spelling is far more probable in context, the word is surfaced for an editor.
Building that set of near-spellings needs a notion of close. Logion uses edit distance — the number of single-character changes that turn one word into another.
Plain edit distance is blunt. Changing one letter to any other costs exactly one — whether the two are constantly confused or unrelated. Logion’s original filter is blunter still: a yes/no gate that admits every word within distance 1 and treats them all alike.
So the five candidates on the left arrive as equals. The model has discarded the one thing a philologist knows cold: which mistakes scribes actually make.
Give each substitution a cost from documented scribal confusions — itacistic vowels heard alike, look-alikes in the minuscule hand. Easily-confused letters cost little; unrelated ones cost the full amount. The hard gate becomes a smooth weight.
The shortlist re-sorts. The itacistic neighbour τοις — one /i/-swap away — climbs to the top, and the implausible candidates sink.
An itacism swap like ταις / τοις costs 0.35 — about 3.7× cheaper than an unrelated same-distance edit like τις / οις, which costs the full 1.00.
The result
Tested against real attested variants — 333 spots in the SBLGNT apparatus where the manuscripts genuinely disagree — the weighted filter recovers more of them. This is the fairest test: real scribal errors, not artificial ones.
Paired McNemar test, p < 10−3: the weighted filter recovered 28 loci the baseline missed and lost only 6. Top-1 trends positive but isn’t significant (16.5% vs. 14.4%, p = 0.30).
One caveat worth naming: a variant isn’t always an error — many SBLGNT disagreements are between defensible readings — so this measures whether Logion finds a locus unusual, not whether it’s wrong. The weights are hand-tuned from textbooks; the full ablation, the controlled protocols, and a deployment on Photius are in the PDF.