Customization
Before executing, check for user customizations at: ~/.claude/skills/CORE/USER/SKILLCUSTOMIZATIONS/Evals/
If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.
🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)
You MUST send this notification BEFORE doing anything else when this skill is invoked.
Send voice notification:
curl -s -X POST http://localhost:8888/notify \ -H "Content-Type: application/json" \ -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
/dev/null 2>&1 &
Output text notification:
Running the WorkflowName workflow in the Evals skill to ACTION...
This is not optional. Execute this curl command immediately upon skill invocation.
Evals - AI Agent Evaluation Framework
Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).
Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.
When to Activate "run evals", "test this agent", "evaluate", "check quality", "benchmark" "regression test", "capability test" Compare agent behaviors across changes Validate agent workflows before deployment Verify ALGORITHM ISC rows Create new evaluation tasks from failures Core Concepts Three Grader Types Type Strengths Weaknesses Use For Code-based Fast, cheap, deterministic, reproducible Brittle, lacks nuance Tests, state checks, tool verification Model-based Flexible, captures nuance, scalable Non-deterministic, expensive Quality rubrics, assertions, comparisons Human Gold standard, handles subjectivity Expensive, slow Calibration, spot checks, A/B testing Evaluation Types Type Pass Target Purpose Capability ~70% Stretch goals, measuring improvement potential Regression ~99% Quality gates, detecting backsliding Key Metrics pass@k: Probability of at least 1 success in k trials (measures capability) pass^k: Probability all k trials succeed (measures consistency/reliability) Workflow Routing Trigger Workflow "run evals", "evaluate suite" Run suite via Tools/AlgorithmBridge.ts "log failure" Log failure via Tools/FailureToTask.ts log "convert failures" Convert to tasks via Tools/FailureToTask.ts convert-all "create suite" Create suite via Tools/SuiteManager.ts create "check saturation" Check via Tools/SuiteManager.ts check-saturation Quick Reference CLI Commands
Run an eval suite
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s
Log a failure for later conversion
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity
Convert failures to test tasks
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all
Manage suites
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create
ALGORITHM Integration
Evals is a verification method for THE ALGORITHM ISC rows:
Run eval and update ISC row
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u
ISC rows can specify eval verification:
| # | What Ideal Looks Like | Verify |
|---|---|---|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |
Available Graders Code-Based (Fast, Deterministic) Grader Use Case string_match Exact substring matching regex_match Pattern matching binary_tests Run test files static_analysis Lint, type-check, security scan state_check Verify system state after execution tool_calls Verify specific tools were called Model-Based (Nuanced) Grader Use Case llm_rubric Score against detailed rubric natural_language_assert Check assertions are true pairwise_comparison Compare to reference with position swap Domain Patterns
Pre-configured grader stacks for common agent types:
Domain Primary Graders coding binary_tests + static_analysis + tool_calls + llm_rubric conversational llm_rubric + natural_language_assert + state_check research llm_rubric + natural_language_assert + tool_calls computer_use state_check + tool_calls + llm_rubric
See Data/DomainPatterns.yaml for full configurations.
Task Schema (YAML) task: id: "fix-auth-bypass_1" description: "Fix authentication bypass when password is empty" type: regression # or capability domain: coding
graders: - type: binary_tests required: [test_empty_pw.py] weight: 0.30
- type: tool_calls
weight: 0.20
params:
sequence: [read_file, edit_file, run_tests]
- type: llm_rubric
weight: 0.50
params:
rubric: prompts/security_review.md
trials: 3 pass_threshold: 0.75
Resource Index Resource Purpose Types/index.ts Core type definitions Graders/CodeBased/ Deterministic graders Graders/ModelBased/ LLM-powered graders Tools/TranscriptCapture.ts Capture agent trajectories Tools/TrialRunner.ts Multi-trial execution with pass@k Tools/SuiteManager.ts Suite management and saturation Tools/FailureToTask.ts Convert failures to test tasks Tools/AlgorithmBridge.ts ALGORITHM integration Data/DomainPatterns.yaml Domain-specific grader configs Key Principles (from Anthropic) Start with 20-50 real failures - Don't overthink, capture what actually broke Unambiguous tasks - Two experts should reach identical verdicts Balanced problem sets - Test both "should do" AND "should NOT do" Grade outputs, not paths - Don't penalize valid creative solutions Calibrate LLM judges - Against human expert judgment Check transcripts regularly - Verify graders work correctly Monitor saturation - Graduate to regression when hitting 95%+ Build infrastructure early - Evals shape how quickly you can adopt new models Related ALGORITHM: Evals is a verification method Science: Evals implements scientific method Browser: For visual verification graders