Autoformalization · Lean 4

Can an AI rewrite Euclid into a language a computer can check?

And if it can — what decides whether it works: the wording of the math, or how you ask?

A study by Reza Ramji · GPT-4o · 2,653 runs

Euclid, Book I · human prose

On a given finite straight line AB, to construct an equilateral triangle.

Lean 4 · machine-checkable

theorem prop_I_1 (A B : Point) : ∃ C, equilateral A B C := …

prose → formal · if it compiles and checks, it is correct — no trust required

Scroll

The question

A textbook proof is written for people. Lean 4 is a language a computer can verify line by line: if a Lean proof compiles and type-checks, it is correct — no appeal to authority required. Getting a language model to do that translation for us is autoformalization.

So: can an AI put Euclid's Elements into that machine-checkable form — and what decides whether it works? Two suspects. The wording of the math (which English translation you feed in), or how you ask (the style of the examples in your prompt).

The experiment

Two knobs, one geometry

Wording (three translations) vs. how you ask (three prompts).

Compile rate by translation × prompt style (Book I, GPT-4o)
	Fitz. prompt	Heath prompt	Mod. prompt
Fitzpatrick	76.3%	60.0%	45.9%
Heath	67.9%	65.5%	52.6%
Modern	73.5%	63.2%	46.8%

Down a column (prompt fixed) the translations barely separate; across a row (prompt changing) the rate collapses from 72.6% to 48.4%.

01 · What we varied

Two knobs: the wording, and the prompt

The first knob is wording. The 48 propositions of Euclid's Book I survive in three sharply different English registers — Heath (1908, archaic), Fitzpatrick (2007, variables named explicitly), and a plain Modern paraphrase. Same math, three voices.

The second knob is the prompt: the style of the worked examples we show GPT-4o before asking it to translate. We built three of those too.

02 · The design

Cross every wording with every prompt

To tell the two knobs apart, we crossed them: three translation registers × three prompt styles, fully counterbalanced. Every translation faces every prompt.

That 3×3 grid shows GPT-4o's compile rate — how often the Lean it produces actually type-checks — over 2,653 runs.

03 · Read it down

Change the wording — almost nothing happens

Hold a prompt style constant — read down a column — and the three translations land within a few points of each other. The wording you feed in hardly matters.

No significant effect (p=0.603).

04 · Read it across

Change the prompt — the result collapses

Now read across a row. Swap the prompt style and the compile rate falls off a cliff — from 72.6% with the best style to 48.4% with the worst.

A 24-point swing (p<0.001). The lever is how you ask, not the prose.

How you prompt matters.
Which translation you feed in does not.

What this means

To get autoformalization working on your own mathematics, the lever isn't cleaning up your prose — it's styling your few-shot examples to match how you write, especially whether they introduce variables explicitly.

These numbers are GPT-4o, on compilation rather than full semantic correctness. A small local model (Qwen2.5-Coder-7B) had memorized the benchmark — it compiled best from empty input — so the study controls for that and reports the clean model only. The full design, statistics, and caveats are in the PDF.