Research · June 2026

Introducing PNRD#1

PNRD#1 wraps an open model and makes it prove its work. On miniF2F it nearly doubled the solve rate — and every proof is machine-checked, so you don’t have to take our word for it.

✓ Kernel-verified
24.6%47.1%
kernel-verified · 115 / 244
Test-time compute against a sound oracleOne-shot 24.6% (60/244) versus the PNRD#1 harness 47.1% (115/244).24.6%One-shot60 / 24447.1%PNRD#1 harness115 / 244↑ 1.9× · +22.5 pts

How it works

Sample. Check. Repair. Repeat.

The model writes many proofs. The Lean 4 kernel checks each one. Failures come back with the exact error to fix. The first that compiles wins — nothing else ships.

How PNRD#1 worksOpen modelgpt-oss-120bN candidateproofsLean 4 kernelmechanical, not an LLM✓ Verifiedor nothingwrites Nchecks eachcompiles ✓✗ fails → repair: the kernel’s exact error goes back to the model

Why you can trust it

The grader is a kernel, not an LLM.

A proof passes only if the Lean kernel compiles it with no shortcuts — there is no model to talk into a yes. A wrong proof can't be bluffed past it; a false positive would take a mis-stated theorem or a kernel bug, not a convincing-looking proof.

A mechanical judge — no model to bluffKERNELcompiles · no sorry · no admit✓ PASSanything elseblockedthe check is mechanical — there is no model to talk into a yes

The result

One open model, nearly twice the proofs.

ConfigurationSolved / 244Pass rate
One-shot baseline6024.6%
PNRD#1 · best-of-8, no few-shot9338.1%
PNRD#1 · best-of-16 + repair + few-shot11547.1%
Same open model on every row · ✓ each solve checked by Lean 4 kernel v4.31.0

The lift is real and uneven. We solved more of everything — and stayed honest about the hard tail.

Honest spread by problem familymathdn=1305285 / 130amcn=45213 / 45imon=1903 / 19aimen=1534 / 15othern=35310 / 35

The honest ceiling

What it can't do yet.

  • Olympiad math is mostly out of reach. 3 of 19 IMO problems — all classical. The rest stay unsolved no matter how much we sample.
  • Below specialized provers. Purpose-built provers hit ~85–90%. We wrap a general model and reach 47%.
  • A stronger model would score higher. One reached 53% on a slice, but couldn’t run at full scale here. That’s an infrastructure limit — named, not hidden.