Research · June 2026

Introducing PNRD#1

PNRD#1 wraps an open model and makes it prove its work. On miniF2F it nearly doubled the solve rate — and every proof is machine-checked, so you don’t have to take our word for it.

By Expanova Labs✓ Kernel-verified

24.6%→47.1%

kernel-verified · 115 / 244≈ 1.9× · +22.5 pts · best-of-16 + repair + few-shot

miniF2F-test · 244 Lean 4 theorems · gpt-oss-120b (open weights)

How it works

Sample. Check. Repair. Repeat.

The model writes many proofs. The Lean 4 kernel checks each one. Failures come back with the exact error to fix. The first that compiles wins — nothing else ships.

Why you can trust it

The grader is a kernel, not an LLM.

A proof passes only if the Lean kernel compiles it with no shortcuts — there is no model to talk into a yes. A wrong proof can't be bluffed past it; a false positive would take a mis-stated theorem or a kernel bug, not a convincing-looking proof.

The result

One open model, nearly twice the proofs.

Same open model on every row · ✓ each solve checked by Lean 4 kernel v4.31.0
	Configuration	Solved / 244	Pass rate
✓	One-shot baseline	60	24.6%
✓	PNRD#1 · best-of-8, no few-shot	93	38.1%
✓	PNRD#1 · best-of-16 + repair + few-shot	115	47.1%

The lift is real and uneven. We solved more of everything — and stayed honest about the hard tail.

The honest ceiling

What it can't do yet.

Olympiad math is mostly out of reach. 3 of 19 IMO problems — all classical. The rest stay unsolved no matter how much we sample.
Below specialized provers. Purpose-built provers hit ~85–90%. We wrap a general model and reach 47%.
A stronger model would score higher. One reached 53% on a slice, but couldn’t run at full scale here. That’s an infrastructure limit — named, not hidden.

Reproducible: official miniF2F-test · gpt-oss-120b open weights · Lean 4 kernel v4.31.0. View the raw per-problem data →