- Agent Evaluation (AI Agent Evals)
- Based on Anthropic's "Demystifying evals for AI agents"
- When to use this skill
- Designing evaluation systems for AI agents
- Building benchmarks for coding, conversational, or research agents
- Creating graders (code-based, model-based, human)
- Implementing production monitoring for AI systems
- Setting up CI/CD pipelines with automated evals
- Debugging agent performance issues
- Measuring agent improvement over time
- Core Concepts
- Eval Evolution: Single-turn → Multi-turn → Agentic
- Type
- Turns
- State
- Grading
- Complexity
- Single-turn
- 1
- None
- Simple
- Low
- Multi-turn
- N
- Conversation
- Per-turn
- Medium
- Agentic
- N
- World + History
- Outcome
- High
- 7 Key Terms
- Term
- Definition
- Task
- Single test case (prompt + expected outcome)
- Trial
- One agent run on a task
- Grader
- Scoring function (code/model/human)
- Transcript
- Full record of agent actions
- Outcome
- Final state for grading
- Harness
- Infrastructure running evals
- Suite
- Collection of related tasks
- Instructions
- Step 1: Understand Grader Types
- Code-based Graders (Recommended for Coding Agents)
- Pros
-
- Fast, objective, reproducible
- Cons
-
- Requires clear success criteria
- Best for
- Coding agents, structured outputs
Example: Code-based grader
def grade_task ( outcome : dict ) -
float : """Grade coding task by test passage.""" tests_passed = outcome . get ( "tests_passed" , 0 ) total_tests = outcome . get ( "total_tests" , 1 ) return tests_passed / total_tests
SWE-bench style grader
- def
- grade_swe_bench
- (
- repo_path
- :
- str
- ,
- test_spec
- :
- dict
- )
- -
- >
- bool
- :
- """Run tests and check if patch resolves issue."""
- result
- =
- subprocess
- .
- run
- (
- [
- "pytest"
- ,
- test_spec
- [
- "test_file"
- ]
- ]
- ,
- cwd
- =
- repo_path
- ,
- capture_output
- =
- True
- )
- return
- result
- .
- returncode
- ==
- 0
- Model-based Graders (LLM-as-Judge)
- Pros
-
- Flexible, handles nuance
- Cons
-
- Requires calibration, can be inconsistent
- Best for
- Conversational agents, open-ended tasks
Example: LLM Rubric for Customer Support Agent
- rubric
- :
- dimensions
- :
- -
- name
- :
- empathy
- weight
- :
- 0.3
- scale
- :
- 1
- -
- 5
- criteria
- :
- |
- 5: Acknowledges emotions, uses warm language
- 3: Polite but impersonal
- 1: Cold or dismissive
- -
- name
- :
- resolution
- weight
- :
- 0.5
- scale
- :
- 1
- -
- 5
- criteria
- :
- |
- 5: Fully resolves issue
- 3: Partial resolution
- 1: No resolution
- -
- name
- :
- efficiency
- weight
- :
- 0.2
- scale
- :
- 1
- -
- 5
- criteria
- :
- |
- 5: Resolved in minimal turns
- 3: Reasonable turns
- 1: Excessive back-and-forth
- Human Graders
- Pros
-
- Highest accuracy, catches edge cases
- Cons
-
- Expensive, slow, not scalable
- Best for
-
- Final validation, ambiguous cases
- Step 2: Choose Strategy by Agent Type
- 2.1 Coding Agents
- Benchmarks
- :
- SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
- Terminal-Bench: Complex terminal tasks
- Custom test suites with your codebase
- Grading Strategy
- :
- def
- grade_coding_agent
- (
- task
- :
- dict
- ,
- outcome
- :
- dict
- )
- -
- >
- dict
- :
- return
- {
- "tests_passed"
- :
- run_test_suite
- (
- outcome
- [
- "code"
- ]
- )
- ,
- "lint_score"
- :
- run_linter
- (
- outcome
- [
- "code"
- ]
- )
- ,
- "builds"
- :
- check_build
- (
- outcome
- [
- "code"
- ]
- )
- ,
- "matches_spec"
- :
- compare_to_reference
- (
- task
- [
- "spec"
- ]
- ,
- outcome
- [
- "code"
- ]
- )
- }
- Key Metrics
- :
- Test passage rate
- Build success
- Lint/style compliance
- Diff size (smaller is better)
- 2.2 Conversational Agents
- Benchmarks
- :
- τ2-Bench: Multi-domain conversation
- Custom domain-specific suites
- Grading Strategy
- (Multi-dimensional):
- success_criteria
- :
- -
- empathy_score
- :
- >
- = 4.0
- -
- resolution_rate
- :
- >
- = 0.9
- -
- avg_turns
- :
- <= 5
- -
- escalation_rate
- :
- <= 0.1
- Key Metrics
- :
- Task resolution rate
- Customer satisfaction proxy
- Turn efficiency
- Escalation rate
- 2.3 Research Agents
- Grading Dimensions
- :
- Grounding
-
- Claims backed by sources
- Coverage
-
- All aspects addressed
- Source Quality
- Authoritative sources used
def
grade_research_agent
(
task
:
dict
,
outcome
:
dict
)
-
dict : return { "grounding" : check_citations ( outcome [ "report" ] ) , "coverage" : check_topic_coverage ( task [ "topics" ] , outcome [ "report" ] ) , "source_quality" : score_sources ( outcome [ "sources" ] ) , "factual_accuracy" : verify_claims ( outcome [ "claims" ] ) } 2.4 Computer Use Agents Benchmarks : WebArena: Web navigation tasks OSWorld: Desktop environment tasks Grading Strategy : def grade_computer_use ( task : dict , outcome : dict ) -
dict : return { "ui_state" : verify_ui_state ( outcome [ "screenshot" ] ) , "db_state" : verify_database ( task [ "expected_db_state" ] ) , "file_state" : verify_files ( task [ "expected_files" ] ) , "success" : all_conditions_met ( task , outcome ) } Step 3: Follow the 8-Step Roadmap Step 0: Start Early (20-50 Tasks)
Create initial eval suite structure
mkdir -p evals/ { tasks,results,graders }
Start with representative tasks
- Common use cases (60%)
- Edge cases (20%)
- Failure modes (20%)
Step 1: Convert Manual Tests
Transform existing QA tests into eval tasks
def convert_qa_to_eval ( qa_case : dict ) -
dict : return { "id" : qa_case [ "id" ] , "prompt" : qa_case [ "input" ] , "expected_outcome" : qa_case [ "expected" ] , "grader" : "code" if qa_case [ "has_tests" ] else "model" , "tags" : qa_case . get ( "tags" , [ ] ) } Step 2: Ensure Clarity + Reference Solutions
Good task definition
task : id : "api-design-001" prompt : | Design a REST API for user management with: - CRUD operations - Authentication via JWT - Rate limiting reference_solution : "./solutions/api-design-001/" success_criteria : - "All endpoints documented" - "Auth middleware present" - "Rate limit config exists" Step 3: Balance Positive/Negative Cases
Ensure eval suite balance
suite_composition
{ "positive_cases" : 0.5 ,
Should succeed
"negative_cases" : 0.3 ,
Should fail gracefully
"edge_cases" : 0.2
Boundary conditions
} Step 4: Isolate Environments
Docker-based isolation for coding evals
eval_environment : type : docker image : "eval-sandbox:latest" timeout : 300s resources : memory : "4g" cpu : "2" network : isolated cleanup : always Step 5: Focus on Outcomes, Not Paths
GOOD: Outcome-focused grader
def grade_outcome ( expected : dict , actual : dict ) -
float : return compare_final_states ( expected , actual )
BAD: Path-focused grader (too brittle)
def grade_path ( expected_steps : list , actual_steps : list ) -
float : return step_by_step_match ( expected_steps , actual_steps ) Step 6: Always Read Transcripts
Transcript analysis for debugging
def analyze_transcript ( transcript : list ) -
dict : return { "total_steps" : len ( transcript ) , "tool_usage" : count_tool_calls ( transcript ) , "errors" : extract_errors ( transcript ) , "decision_points" : find_decision_points ( transcript ) , "recovery_attempts" : find_recovery_patterns ( transcript ) } Step 7: Monitor Eval Saturation
Detect when evals are no longer useful
def check_saturation ( results : list , window : int = 10 ) -
dict : recent = results [ - window : ] return { "pass_rate" : sum ( r [ "passed" ] for r in recent ) / len ( recent ) , "variance" : calculate_variance ( recent ) , "is_saturated" : all ( r [ "passed" ] for r in recent ) , "recommendation" : "Add harder tasks" if saturated else "Continue" } Step 8: Long-term Maintenance
Eval suite maintenance checklist
maintenance : weekly : - Review failed evals for false negatives - Check for flaky tests monthly : - Add new edge cases from production issues - Retire saturated evals - Update reference solutions quarterly : - Full benchmark recalibration - Team contribution review Step 4: Integrate with Production CI/CD Integration
GitHub Actions example
name : Agent Evals on : [ push , pull_request ] jobs : eval : runs-on : ubuntu - latest steps : - uses : actions/checkout@v4 - name : Run Evals run : | python run_evals.py --suite=core --mode=compact - name : Upload Results uses : actions/upload - artifact@v4 with : name : eval - results path : results/ Production Monitoring
Real-time eval sampling
class ProductionMonitor : def init ( self , sample_rate : float = 0.1 ) : self . sample_rate = sample_rate async def monitor ( self , request , response ) : if random . random ( ) < self . sample_rate : eval_result = await self . run_eval ( request , response ) self . log_result ( eval_result ) if eval_result [ "score" ] < self . threshold : self . alert ( "Low quality response detected" ) A/B Testing
Compare agent versions
- def
- run_ab_test
- (
- suite
- :
- str
- ,
- versions
- :
- list
- )
- -
- >
- dict
- :
- results
- =
- {
- }
- for
- version
- in
- versions
- :
- results
- [
- version
- ]
- =
- run_eval_suite
- (
- suite
- ,
- agent_version
- =
- version
- )
- return
- {
- "comparison"
- :
- compare_results
- (
- results
- )
- ,
- "winner"
- :
- determine_winner
- (
- results
- )
- ,
- "confidence"
- :
- calculate_confidence
- (
- results
- )
- }
- Best Practices
- Do's ✅
- Start with 20-50 representative tasks
- Use code-based graders when possible
- Focus on outcomes, not paths
- Read transcripts for debugging
- Monitor for eval saturation
- Balance positive/negative cases
- Isolate eval environments
- Version your eval suites
- Don'ts ❌
- Don't over-rely on model-based graders without calibration
- Don't ignore failed evals (false negatives exist)
- Don't grade on intermediate steps
- Don't skip transcript analysis
- Don't use production data without sanitization
- Don't let eval suites become stale
- Success Patterns
- Pattern 1: Graduated Eval Complexity
- Level 1: Unit evals (single capability)
- Level 2: Integration evals (combined capabilities)
- Level 3: End-to-end evals (full workflows)
- Level 4: Adversarial evals (edge cases)
- Pattern 2: Eval-Driven Development
- 1. Write eval task for new feature
- 2. Run eval (expect failure)
- 3. Implement feature
- 4. Run eval (expect pass)
- 5. Add to regression suite
- Pattern 3: Continuous Calibration
- Weekly: Review grader accuracy
- Monthly: Update rubrics based on feedback
- Quarterly: Full grader audit with human baseline
- Troubleshooting
- Problem: Eval scores at 100%
- Solution
-
- Add harder tasks, check for eval saturation (Step 7)
- Problem: Inconsistent model-based grader scores
- Solution
-
- Add more examples to rubric, use structured output, ensemble graders
- Problem: Evals too slow for CI
- Solution
-
- Use toon mode, parallelize, sample subset for PR checks
- Problem: Agent passes evals but fails in production
- Solution
- Add production failure cases to eval suite, increase diversity References Anthropic: Demystifying evals for AI agents SWE-bench WebArena τ2-Bench Examples Example 1: Simple Coding Agent Eval
Task definition
task
{ "id" : "fizzbuzz-001" , "prompt" : "Write a fizzbuzz function in Python" , "test_cases" : [ { "input" : 3 , "expected" : "Fizz" } , { "input" : 5 , "expected" : "Buzz" } , { "input" : 15 , "expected" : "FizzBuzz" } , { "input" : 7 , "expected" : "7" } ] }
Grader
def grade ( task , outcome ) : code = outcome [ "code" ] exec ( code )
In sandbox
for tc in task [ "test_cases" ] : if fizzbuzz ( tc [ "input" ] ) != tc [ "expected" ] : return 0.0 return 1.0 Example 2: Conversational Agent Eval with LLM Rubric task : id : "support-refund-001" scenario : | Customer wants refund for damaged product. Product: Laptop, Order: #12345, Damage: Screen crack expected_actions : - Acknowledge issue - Verify order - Offer resolution options max_turns : 5 grader : type : model model : claude - 3 - 5 - sonnet - 20241022 rubric : | Score 1-5 on each dimension: - Empathy: Did agent acknowledge customer frustration? - Resolution: Was a clear solution offered? - Efficiency: Was issue resolved in reasonable turns?