Judge Command
The evaluation is
report-only
- findings are presented without automatic changes.
Your Workflow
Phase 1: Context Extraction
Before launching the judge, identify what needs evaluation:
Identify the work to evaluate
:
Review conversation history for completed work
If arguments provided: Use them to focus on specific aspects
If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"
Extract evaluation context
:
Original task or request that prompted the work
The actual output/result produced
Files created or modified (with brief descriptions)
Any constraints, requirements, or acceptance criteria mentioned
Provide scope for user
:
Evaluation Scope:
- Original request: [summary]
- Work produced: [description]
- Files involved: [list]
- Evaluation focus: [from arguments or "general quality"]
Launching judge sub-agent...
IMPORTANT: Pass only the extracted context to the judge - not the entire conversation. This prevents context pollution and enables focused assessment. Phase 2: Launch Judge Sub-Agent Use the Task tool to spawn a single judge agent with the following prompt and context. Adjust criteria rubric and weights to match solution type and complexity, for example: Code Quality Documentation Quality Test Coverage Security Performance Usability Reliability Maintainability Scalability Cost-effectiveness Compliance Accessibility Performance Judge Agent Prompt: You are an Expert Judge evaluating the quality of work produced in a development session.

Work Under Evaluation [ORIGINAL TASK] {paste the original request/task} [/ORIGINAL TASK] [WORK OUTPUT] {summary of what was created/modified} [/WORK OUTPUT] [FILES INVOLVED] {list of files with brief descriptions} [/FILES INVOLVED] [EVALUATION FOCUS] {from arguments, or "General quality assessment"} [/EVALUATION FOCUS] Read ${CLAUDE_PLUGIN_ROOT}/tasks/judge.md and execute.

Evaluation Criteria

Criterion 1: Instruction Following (weight: 0.30) Does the work follow all explicit instructions and requirements? ** Guiding Questions ** : - Does the output fulfill the original request? - Were all explicit requirements addressed? - Are there gaps or unexpected deviations? | Level | Score | Description | |

|

Criterion 2: Output Completeness (weight: 0.25) Are all requested aspects thoroughly covered? ** Guiding Questions ** : - Are all components of the request addressed? - Is there appropriate depth for each component? - Are there obvious gaps or missing pieces? | Level | Score | Description | |

|

Criterion 3: Solution Quality (weight: 0.25) Is the approach appropriate and well-implemented? ** Guiding Questions ** : - Is the chosen approach sound and appropriate? - Does the implementation follow best practices? - Are there correctness issues or errors? | Level | Score | Description | |

|

Criterion 4: Reasoning Quality (weight: 0.10) Is the reasoning clear, logical, and well-documented? ** Guiding Questions ** : - Is the decision-making transparent? - Were appropriate methods/tools used? - Can someone understand why this approach was taken? | Level | Score | Description | |

|

Criterion 5: Response Coherence (weight: 0.10) Is the output well-structured and easy to understand? ** Guiding Questions ** : - Is the output organized logically? - Can someone unfamiliar with the task understand it? - Is it professionally presented? | Level | Score | Description | |

|

|

Excellent

|

5

|

Well-structured, clear, professional

|

Good

|

4

|

Generally coherent with minor issues

|

Adequate

|

3

|

Understandable but could be clearer

|

Poor

|

2

|

Difficult to follow

|

Failed

|

1

|

Incoherent or confusing

|

Phase 3: Process and Present Results

After receiving the judge's evaluation:

Validate the evaluation

:

Check that all criteria have scores in valid range (1-5)

Verify each score has supporting justification with evidence

Confirm weighted total calculation is correct

Check for contradictions between justification and score

Verify self-verification was completed with documented adjustments

If validation fails

:

Note the specific issue

Request clarification or re-evaluation if needed

Present results to user

:

Display the full evaluation report

Highlight the verdict and key findings

Offer follow-up options:

Address specific improvements

Request clarification on any judgment

Proceed with the work as-is

Scoring Interpretation

Score Range

Verdict

Interpretation

Recommendation

4.50 - 5.00

EXCELLENT

Exceptional quality, exceeds expectations

Ready as-is

4.00 - 4.49

GOOD

Solid quality, meets professional standards

Minor improvements optional

3.50 - 3.99

ACCEPTABLE

Adequate but has room for improvement

Improvements recommended

3.00 - 3.49

NEEDS IMPROVEMENT

Below standard, requires work

Address issues before use

1.00 - 2.99

INSUFFICIENT

Does not meet basic requirements

Significant rework needed

Important Guidelines

Context Isolation

Pass only relevant context to the judge - not the entire conversation

Justification First

Always require evidence and reasoning BEFORE the score

Evidence-Based

Every score must cite specific evidence (file paths, line numbers, quotes)

Bias Mitigation

Explicitly warn against length bias, verbosity bias, and authority bias

Be Objective

Base assessments on evidence and rubric definitions, not preferences

Be Specific

Cite exact locations, not vague observations

Be Constructive

Frame criticism as opportunities for improvement with impact context

Consider Context

Account for stated constraints, complexity, and requirements

Report Confidence

Lower confidence when evidence is ambiguous or criteria unclear
Single Judge: This command uses one focused judge for context isolation Notes This is a report-only command - it evaluates but does not modify work The judge operates with fresh context for unbiased assessment Scores are calibrated to professional development standards Low scores indicate improvement opportunities, not failures Use the evaluation to inform next steps and iterations Pass threshold (3.5/5.0) represents acceptable quality for general use Adjust threshold based on criticality (4.0+ for critical operations) Low confidence evaluations may warrant human review

sadd:judge

安装

|

|

|

|

|

|

|

|

|

|