Research · June 2026
Introducing PNRD#1
PNRD#1 wraps an open model and makes it prove its work. On miniF2F it nearly doubled the solve rate — and every proof is machine-checked, so you don’t have to take our word for it.
How it works
Sample. Check. Repair. Repeat.
The model writes many proofs. The Lean 4 kernel checks each one. Failures come back with the exact error to fix. The first that compiles wins — nothing else ships.
Why you can trust it
The grader is a kernel, not an LLM.
A proof passes only if the Lean kernel compiles it with no shortcuts — there is no model to talk into a yes. A wrong proof can't be bluffed past it; a false positive would take a mis-stated theorem or a kernel bug, not a convincing-looking proof.
The result
One open model, nearly twice the proofs.
| Configuration | Solved / 244 | Pass rate | |
|---|---|---|---|
| One-shot baseline | 60 | 24.6% | |
| PNRD#1 · best-of-8, no few-shot | 93 | 38.1% | |
| PNRD#1 · best-of-16 + repair + few-shot | 115 | 47.1% |
The lift is real and uneven. We solved more of everything — and stayed honest about the hard tail.
The honest ceiling
What it can't do yet.
- Olympiad math is mostly out of reach. 3 of 19 IMO problems — all classical. The rest stay unsolved no matter how much we sample.
- Below specialized provers. Purpose-built provers hit ~85–90%. We wrap a general model and reach 47%.
- A stronger model would score higher. One reached 53% on a slice, but couldn’t run at full scale here. That’s an infrastructure limit — named, not hidden.