judge-with-debate Pattern: Debate-Based Evaluation This command implements iterative multi-judge debate: Phase 0: Setup mkdir -p .specs/reports │ Phase 1: Independent Analysis ┌─ Judge 1 → {name}.1.md ─┐ Solution ┼─ Judge 2 → {name}.2.md ─┼─┐ └─ Judge 3 → {name}.3.md ─┘ │ │ Phase 2: Debate Round (iterative) │ Each judge reads others' reports │ ↓ │ Argue + Defend + Challenge │ ↓ │ Revise if convinced ─────────────┤ ↓ │ Check consensus │ ├─ Yes → Final Report │ └─ No → Next Round ─────────┘ Process Setup: Create Reports Directory Before starting evaluation, ensure the reports directory exists: mkdir -p .specs/reports Report naming convention: .specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md Where: {solution-name} - Derived from solution filename (e.g., users-api from src/api/users.ts ) {YYYY-MM-DD} - Current date [1|2|3] - Judge number Phase 1: Independent Analysis Launch 3 independent judge agents in parallel (recommended: Opus for rigor): Each judge receives: Path to solution(s) being evaluated Evaluation criteria with weights Clear rubric for scoring Each produces independent assessment saved to .specs/reports/{solution-name}-{date}.[1|2|3].md Reports must include: Per-criterion scores with evidence Specific quotes/examples supporting ratings Overall weighted score Key strengths and weaknesses Key principle: Independence in initial analysis prevents groupthink. Prompt template for initial judges: You are Judge {N} evaluating a solution independently. < solution_path

{path to solution file(s)} </ solution_path

< task_description

{what the solution was supposed to accomplish} </ task_description

< evaluation_criteria

{criteria with descriptions and weights} </ evaluation_criteria

< output_file

.specs/reports/{solution-name}-{date}.{N}.md </ output_file

Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md for evaluation methodology and execute using following criteria. Instructions: 1. Read the solution thoroughly 2. For each criterion: - Find specific evidence (quote exact text) - Score on the defined scale - Justify with concrete examples 3. Calculate weighted overall score 4. Write comprehensive report to {output_file} 5. Generate verification 5 questions about your evaluation. 6. Answer verification questions: - Re-examine solutions for each question - Find counter-evidence if it exists - Check for systematic bias (length, confidence, etc.) 7. Revise your report file and update it accordingly. Add to report begining Done by Judge {N} Phase 2: Debate Rounds (Iterative) For each debate round (max 3 rounds): Launch 3 debate agents in parallel : Each judge agent receives: Path to their own previous report ( .specs/reports/{solution-name}-{date}.[1|2|3].md ) Paths to other judges' reports ( .specs/reports/{solution-name}-{date}.[1|2|3].md ) The original solution Each judge: Identifies disagreements with other judges (>1 point score gap on any criterion) Defends their own ratings with evidence Challenges other judges' ratings they disagree with Considers counter-arguments Revises their assessment if convinced Updates their report file with new section:

Debate Round

After they reply, if they reached agreement move to Phase 3: Consensus Report Key principle: Judges communicate only through filesystem - orchestrator doesn't mediate and don't read reports files itself, it can overflow your context. Prompt template for debate judges: You are Judge {N} in debate round {R}. < your_previous_report

{path to .specs/reports/{solution-name}-{date}.{N}.md} </ your_previous_report

< other_judges_reports

Judge 1: .specs/reports/{solution-name}-{date}.1.md ... </ other_judges_reports

< task_description

{what the solution was supposed to accomplish} </ task_description

< solution_path

{path to solution} </ solution_path

< output_file

.specs/reports/{solution-name}-{date}.{N}.md (append to existing file) </ output_file

Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md for evaluation methodology principles. Instructions: 1. Read your previous assessment from {your_previous_report} 2. Read all other judges' reports 3. Identify disagreements (where your scores differ by >1 point) 4. For each major disagreement: - State the disagreement clearly - Defend your position with evidence - Challenge the other judge's position with counter-evidence - Consider whether their evidence changes your view 5. Update your report file by APPENDING: 6. Reply whether you are reached agreement, and with which judge. Include revisited scores and criteria scores.

Debate Round {R}

Disagreements Identified ** Disagreement with Judge {X} on Criterion "{Name}" ** - My score: {my_score}/5 - Their score: {their_score}/5 - My defense: [quote evidence supporting my score] - My challenge: [what did they miss or misinterpret?] [Repeat for each disagreement]

Revised Assessment

After considering other judges' arguments:

-

Criterion "{Name}"

[Maintained {X}/5 | Revised from {X} to {Y}/5]

Reason for change: [what convinced me] OR

Reason maintained: [why I stand by original score]
[Repeat for changed/maintained scores]
**
New Weighted Score
**: {updated_total}/5.0

Evidences [specific quotes]

CRITICAL:

Only revise if you find their evidence compelling

Defend your original scores if you still believe them

Quote specific evidence from the solution

Consensus Check

After each debate round, check for consensus:

Consensus achieved if:

All judges' overall scores within 0.5 points of each other

No criterion has >1 point disagreement across any two judges

All judges explicitly state they accept the consensus

If no consensus after 3 rounds:

Report persistent disagreements

Provide all judge reports for human review

Flag that automated evaluation couldn't reach consensus

Orchestration Instructions:

Step 1: Run Independent Analysis (Round 1)

Launch 3 judge agents in parallel (Judge 1, 2, 3)

Each writes their independent assessment to

.specs/reports/{solution-name}-{date}.[1|2|3].md

Wait for all 3 agents to complete

Step 2: Check for Consensus

Let's work through this systematically to ensure accurate consensus detection.

Read all three reports and extract:

Each judge's overall weighted score

Each judge's score for every criterion

Check consensus step by step:

First, extract all overall scores from each report and list them explicitly

Calculate the difference between the highest and lowest overall scores

If difference ≤ 0.5 points → overall consensus achieved

If difference > 0.5 points → no consensus yet

Next, for each criterion, list all three judges' scores side by side

For each criterion, calculate the difference between highest and lowest scores

If any criterion has difference > 1.0 point → no consensus on that criterion

Finally, verify consensus is achieved only if BOTH conditions are met:

Overall scores within 0.5 points

All criterion scores within 1.0 point

Step 3: Decision Point

If consensus achieved

Go to Step 5 (Generate Consensus Report)

If no consensus AND round < 3

Go to Step 4 (Run Debate Round)
If no consensus AND round = 3: Go to Step 6 (Report No Consensus) Step 4: Run Debate Round Increment round counter (round = round + 1) Launch 3 judge agents in parallel Each agent reads: Their own previous report from filesystem Other judges' reports from filesystem Original solution Each agent appends "Debate Round {R}" section to their own report file Wait for all 3 agents to complete Go back to Step 2 (Check for Consensus) Step 5: Reply with Report Let's synthesize the evaluation results step by step. Read all final reports carefully Before generating the report, analyze the following: What is the consensus status (achieved or not)? What were the key points of agreement across all judges? What were the main areas of disagreement, if any? How did the debate rounds change the evaluations? Reply to user with a report that contains: If there is consensus: Consensus scores (average of all judges) Consensus strengths/weaknesses Number of rounds to reach consensus Final recommendation with clear justification If there is no consensus: All judges' final scores showing disagreements Specific criteria where consensus wasn't reached Analysis of why consensus couldn't be reached Flag for human review Command complete Phase 3: Consensus Report If consensus achieved, synthesize the final report by working through each section methodically:

Consensus Evaluation Report Let's compile the final consensus by analyzing each component systematically.

|

|
|
{Name}
|
{X}/5
|
{X}/5
|
{X}/5
|
{X}/5
|
...
**
Consensus Overall Score
**: {avg}/5.0

Consensus Strengths [Review each judge's identified strengths and extract the common themes that all judges agreed upon]

Consensus Weaknesses [Review each judge's identified weaknesses and extract the common themes that all judges agreed upon]

Debate Summary Let's trace how consensus was reached: - Rounds to consensus: {N} - Initial disagreements: {list with specific criteria and score gaps} - How resolved: {for each disagreement, explain what evidence or argument led to resolution}

Final Recommendation

Based on the consensus scores and the key strengths/weaknesses identified:

{Pass/Fail/Needs Revision with clear justification tied to the evidence}

Reports directory

:

.specs/reports/

(created if not exists)

Initial reports

:

.specs/reports/{solution-name}-{date}.1.md

,

.specs/reports/{solution-name}-{date}.2.md