Math Olympiad Solver

The five things that change outcomes

Strip thinking before verifying

— a verifier that sees the reasoning is

biased toward agreement. Fresh context, cleaned proof only.

"Does this prove RH?"

— if your theorem's specialization to ζ is a famous

open problem, you have a gap. Most reliable red flag.

Short proof → extract the general lemma

— try 2×2 counterexamples. If

general form is false, find what's special about THIS instance.

Same gap twice → step back

— the case split may be obscuring a unified

argument. Three lines sometimes does what twelve pages couldn't.

Say "no confident solution"

— wrong-and-confident is worse than honest

abstain.

Tool policy

Solvers and verifiers use THINKING ONLY in the tight-budget

workflow. Competition math is reasoning. Computation is for deep mode (§6c), and

even then bounded — a recurrence that's doubly-exponential can't be computed

past n~30, work mod 2^m instead.

When to use which approach

Problem

Approach

Verification

AIME numeric answer

Best-of-N → majority vote

Answer check only

Olympiad proof (IMO/Putnam/USAMO)

Full workflow below

5-pass adversarial

"Is this proof correct?"

Skip to verification (step 4)

Adversarial + spec-gaming

Full problem set

(e.g. all 6 from a competition)

Sequential: one full workflow per problem, collect results, compile single PDF

Per-problem adversarial

Batch in one Workflow

Set

opts.label

on every

agent()

call to include

the problem ID (e.g.,

label: "P3:solver:2"

). Without labels, 36 results come

back with no problem association. Run problems in parallel — the label is what

matters, not ordering.

For a full problem set

Launch one solver workflow per problem (same VERBATIM prompt, different

statement). Run them in parallel. When all return, run adversarial verification

per problem. Problems that pass get their proof in the PDF; problems that

abstain get "No confident solution" with partial notes.

Don't try to solve all N problems in one agent's context — each problem needs

its own thinking budget and its own fresh-context verifier. The composition is

mechanical: collect the per-problem outputs, fill in LaTeX sections, compile

once. | "Simplify this proof" | Skip to presentation (step 8) | — |

The Workflow

1. Interpretation check (30 seconds, catches 50/63 of one class of errors)

Before solving anything, identify the interpretation.

Read the problem statement. List 2-3 ways it could be interpreted. For each:

is this reading TRIVIAL? If one reading makes the problem easy and another

makes it hard, the hard one is almost certainly intended. State which

interpretation you're solving and WHY you believe it's the intended one.

The Aletheia case study found 50 of 63 "technically correct" solutions were for

the wrong interpretation. Olympiad problems often have a trap easy reading.

2. Generate candidates with internal refinement (parallel, thinking only)

Launch 8-12 attempt agents in parallel.

Each agent internally iterates

—

solve → self-improve → self-verify → correct → repeat. This is the Yang-Huang

structure that achieves 85.7% on IMO: one-shot solving isn't enough; per-attempt

refinement matters.

The Agent tool cannot enforce tool restriction.

Subagents get the full tool

set. The only mechanism is the prompt. Use this prompt VERBATIM — do not

summarize, do not synthesize your own:

NO COMPUTATION. Do not use Bash, Python, WebSearch, Read, Write, or any tool that runs code or fetches data. Numerical verification is not a proof step. "I computed n=1..10 and the pattern holds" is not a proof.

(If your agent harness requires a StructuredOutput or similar return-mechanism tool call, that is NOT a computation tool — call it to return your answer. The restriction is on tools that DO work, not tools that REPORT work.)

Your internal process (iterate until done):

- Solve: Complete rigorous solution.

- Self-improve: Reread. Fix gaps before a grader sees it.

- Self-verify: Strict grader mode. Every step justified?

- Correct: Fix and re-verify. Up to 5 rounds.

- Stop: Self-verify passes twice clean, OR 5 rounds, OR approach fundamentally wrong.

A correct answer from flawed reasoning is a failure. If incomplete, say so honestly. Never hide gaps.

PROBLEM:

ANGLE:

The first two paragraphs are load-bearing. A session that writes its own prompt

and omits them will produce subagents that grind Python for 30 iterations and

confidently get wrong answers — a pattern that fits n≤10 but fails at n=100 is

not a proof.

Starting angles (vary across agents — see

references/solver_heuristics.md

):

Work out small cases (test past n=3)

Look for an invariant or monovariant

Consider the extremal case

Try induction

What symmetries?

Work backwards

Drop a condition — where does it become trivially false?

Generalize (inventor's paradox — more structure is sometimes easier)

Each returns its FINAL state (not intermediate rounds):

Verdict: complete solution | partial result | no progress

Rounds: [how many verify→correct cycles]

Method: [key idea, one paragraph]

Detailed Solution: [full step-by-step, every step justified]

Answer: [if applicable]

Self-verification notes: [what you caught and fixed; remaining concerns]

Retry policy

If an agent fails or times out, retry once. Transient failures
happen.
3. Clean the solution (context isolation — the #1 lever)
The thinking trace biases the verifier toward agreement — a long chain of
reasoning reads as supporting evidence even when the conclusion is wrong. Before
any verification, strip:
All thinking-block content
All "Let me try..." / "Actually wait..." / "Hmm" prose
All false starts and backtracking
What remains: problem statement + clean final argument only.
Extract only the
Method
+
Proof
+
Answer
sections from each solver's
output. The verifier never sees how the solver got there.
4. Adversarial verify (fresh context, pattern-armed)
For each cleaned solution, launch a fresh verifier agent.
Fresh context: it sees only (problem statement + cleaned solution). No tools. The verifier's job is to ATTACK, not grade. Load references/adversarial_prompts.md for the prompts. The key patterns it runs: Pattern The check

4

Does this theorem specialize to a famous object (ζ, quadratic reciprocity, etc.) and prove something open about it? → gap

18

Substitute the proof's own intermediate identities into any "remaining gap." Recover the original claim? → tautological

40

Is any step a "one-line lemma"? Extract the GENERAL form. Find a 2×2 counterexample. If the general form is false, find what special structure saves THIS instance

5

For each invoked theorem: re-check hypotheses FROM SCRATCH. "Continuous on [0,1]" ≠ "continuous on ℝ"

6

Any infinite sum "bounded" via a regularized value? Check the boundary — if there's a pole there, the sum diverges

Full pattern list:

references/verifier_patterns.md

Verifier returns:

Verdict: HOLDS | HOLE FOUND | UNCLEAR

If HOLE FOUND:

- Location: [quote the problematic step]

- Pattern: [which check fired, or "other"]

- Why it breaks: [specific]

- Fixable?: [yes with X / no, fundamental]

5. Rank and vote-verify (asymmetric + early exit)

Rank solutions by (verdict, verifier confidence). Take the top one. Run up to 5

fresh verifier agents.

Asymmetric thresholds

4 HOLDS to confirm, 2 HOLE FOUND to refute. Why

asymmetric: one flaky verifier shouldn't kill a correct proof; but two

independent dissents is a real signal.

Pigeonhole early exit

stop launching verifiers once the outcome is decided.

2 say HOLE FOUND → refuted, stop (save the remaining 3 calls)

4 say HOLDS → confirmed, stop (save the 5th)

After 3 verifiers: if 2 HOLDS + 1 HOLE, launch 2 more (outcome undecided). If

3 HOLDS + 0 HOLE, launch 1 more (could still hit 4-1).

Dual context-isolation

each verifier is blind to (a) the solver's thinking

trace — already stripped in step 3 — AND (b) other verifiers' verdicts. Each

verifier thinks it's the first. No "3 agents already confirmed this" social

proof.

A solver cannot verify its own solution.

Different agent, fresh context.

5b. When one case won't close — step back before grinding

If a proof splits into cases and one case proves easily but the other resists:

before grinding through the hard case, ask whether there's a route that makes

the split disappear.

The pattern that saves you: the hard case's very hypothesis often implies

something strong about an

intermediate object

you haven't looked at. Use that

implication directly instead of the original chain.

Concrete shape: proving f(n) ≤ cn for a constrained function f, with a case

split on a prime p dividing f(n). One branch closes by index arguments in

(ℤ/p^e)*. The other branch resists — same group structure, but the arithmetic

doesn't contradict. The fix: the hypothesis "p | f(n)" plugged back into the

governing equation implies

f(p) = p itself

. Once you have that, a

Fermat+Dirichlet argument kills both branches in three lines. The case split was

a detour — it was splitting on a variable that, under the hypothesis, takes a

known value.

Check when stuck on case B:

What does case B's hypothesis imply about f at

other

inputs?

Is there a different pair (a,b) to plug into the governing equation?

Are you proving too much? (A cleaner contradiction needs less machinery.)

This is also a presentation-pass win: the split-free proof is shorter AND more

general.

6. Revise (if needed)

If verification finds a hole: launch a reviser agent. It gets (cleaned

solution + verifier's hole report). STILL no access to the original thinking —

the reviser works from the hole, not by rereading how you got there.

A verifier found this issue in the proof:

[hole report]

Fix the proof. If the hole is fundamental (the approach doesn't work), say so and return Verdict: no confident solution with what partial progress remains.

For any step you cannot fully close, mark it inline: [GAP: specific description of what remains]. Gaps in the proof text, not in a separate list — they're greppable and the next reviser knows exactly where to look.

Up to 3 revise cycles. Then re-run the vote on the revised proof.

If pattern #40 fired

(one-line-proof-too-clean), the reviser gets a stronger

brief — the Adversarial Brief template from

references/adversarial_prompts.md

§7. It forces a binary: "the general lemma is obviously false (here's a 2×2

counterexample) — so either find what's special about THIS case, or find where

the proof breaks." Can't return "looks fine."

6c. Deep mode (when tight-budget abstains)

The standard workflow is tight-budget: 8 solvers, ~15 min, pure reasoning. When

it abstains, the problem may need more time, not more capability.

Deep mode

is a single focused agent with:

Unlimited time

— no wall-clock pressure

Targeted computation allowed

— modular arithmetic checks, small-case

enumeration, symbolic verification of identities. NOT exploratory brute force

or unbounded recursion.

The abstention reason as starting point

— if verifiers found a specific

gap, start there. If solvers never claimed complete, start from what they

partially proved.

The archetype: a focused agent that gets the proven-so-far state plus "one case

of Lemma 5 is open" — and finds a 3-line argument the case split was obscuring.

Often under 10 minutes with almost no computation. Deep mode is about giving the

problem sustained attention, not throwing compute at it.

What deep mode is NOT

open-ended exploration, literature search, looking up

solutions, multi-day investigation. That's a different workflow

(

math-research

). Deep mode is still "solve THIS problem yourself" — just

without the clock.

NO WEB. NO LOOKUP.

Deep mode may use Bash/Python for bounded computation,

but NEVER WebFetch, WebSearch, or any network access. Finding the solution on

AoPS or a blog is not solving the problem — it's cheating on an olympiad, and it

teaches us nothing about the skill's actual capability. Put this at the TOP of

the deep-mode prompt:

NO WEB ACCESS. Do not use WebFetch, WebSearch, or any tool that touches the internet. Do not look up this problem, its solution, or related problems. You are solving this yourself — the only allowed computation is local (Bash/Python for mod-k arithmetic, small-case enumeration n≤10, symbolic identity checks). If you invoke a web tool, the proof is void.

Computation bounds in deep mode

(bug #8 lesson): A6's b_{n+1}=2b_n²+b_n+1

is doubly-exponential; b_99 has ~10^{2^98} digits. Never compute such objects

exactly — work in ℤ/2^m, or track only v_p(·), or prove the recursion mod the

quantity you care about. If a computation is running longer than 60 seconds,

it's probably unbounded. Kill it and work symbolically.

Step 6d (not optional)

After any ABSTAIN at the verify stage, automatically

launch one deep-mode agent before writing the abstention into the output. Give

it:

The problem statement

The best partial proof from tight-budget solvers

The verifier gap descriptions (what specifically didn't close)

The instruction: "NO WEB ACCESS — do not look up this problem or its solution.

Bounded local computation allowed (mod 2^k, small cases n≤10, symbolic

identity checks via Bash/Python only). 60-second computation limit. If n≤10

brute force reveals a pattern the tight-budget solvers missed, that pattern IS

the proof structure."

The deep agent may find the construction the pure-reasoning solvers couldn't

see. If it also abstains, THEN write the abstention. Do not skip this step —

problems with √n or log n answers are often invisible to pure reasoning because

the optimal structure is the asymmetric one.

Orchestrator self-restraint

The orchestrator itself must not web-search the

problem "to help" the deep agent. If you're tempted to Fetch an AoPS thread

"just to check the answer," don't — that contaminates the skill's output and

misrepresents its capability.

7. Calibrated abstention

If 3 revise cycles all fail:

stop and admit it.

Verdict: no confident solution

What was tried: [approaches]

What WAS proven: [any lemma or partial result that survived verification]

Where it breaks: [the unfixed hole]

Do NOT guess. A wrong confident answer is worse than an honest "couldn't solve

it." The metric that matters is CONDITIONAL accuracy — when you say "solved,"

are you right?

8. Presentation pass (after correctness is established)

A VERIFIED-CORRECT proof is often not a BEAUTIFUL proof. The order you

discovered it is rarely the best order to present it. Launch a fresh

presentation agent with the verified proof.

Load

references/presentation_prompts.md

. The agent asks:

What's the simplest way to say this?

Which lemmas should be inlined? Which deserve to stand alone?

Is anything OVERKILL? (constructing a double exponential when linear suffices)

Now that we know the answer, is there a 3-line hindsight proof?

Output: LaTeX-formatted proof. If

pdflatex

is available

(

scripts/check_latex.sh

returns 0), also compile to PDF via

scripts/compile_pdf.sh

.

Model tier defaults

Read

references/model_tier_defaults.md

for full details. Summary:

Model

Solvers

Verify passes

Abstain after

Presentation

Haiku

8

3

2 revise fails

skip

Sonnet

4

5

3 revise fails

yes

Opus

3

5 + full pattern sweep

4 revise fails

2 drafts, pick cleaner

Weaker models: more parallel attempts, faster abstention. Stronger models:

deeper verification, more presentation effort.

For numeric-answer problems (AIME-style)

Skip the proof machinery. Run 5-7 solvers with varied approaches, take majority

vote on the numeric answer. If no majority: verify the top 2 candidates by

substitution.

Key references

references/verifier_patterns.md

— the 12 adversarial checks

references/adversarial_prompts.md

— ready-to-use verifier prompts

references/presentation_prompts.md

— beautification prompts + LaTeX template

references/model_tier_defaults.md

— per-model configuration

What makes this different from generic verify-and-refine

Dual context isolation

verifier is blind to (a) the solver's thinking
trace — which biases toward agreement — and (b) other verifiers' verdicts —
social proof also biases. Each verifier thinks it's first.
Pattern-specific attacks: not "is this correct?" but "does this make the

40 mistake? the #4 mistake?" Specific beats generic. The 7-category

refutation taxonomy gives the verifier a checklist.

Asymmetric vote + pigeonhole exit

4-to-confirm, 2-to-refute. One flaky

verifier doesn't kill a correct proof; two dissents does. Stop launching

verifiers once the outcome is decided — saves ~30% of verification cost on

clear cases.

Specification-gaming check first

explicitly asks "is this the intended

interpretation?" before solving. The #1 failure mode in prior work (50/63

"correct" answers solved the wrong reading).

Calibrated abstention

will say "no confident solution" with partial
results. Optimizes conditional accuracy, not coverage.
Presentation pass: correctness and elegance are separate steps. The presentation agent gets the VERIFIED proof and finds the cleanest way to say it.

math-olympiad

安装

4

18

40

5

6

40 mistake? the #4 mistake?" Specific beats generic. The 7-category