Error Coordinator Purpose
Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.
When to Use Designing error handling for agent systems Implementing retry and recovery strategies Building self-healing AI workflows Detecting agent loops and infinite recursion Mitigating hallucinations in agent outputs Implementing circuit breakers for agents Coordinating failure recovery across agents Quick Start
Invoke this skill when:
Designing error handling for agent systems Implementing retry and recovery strategies Building self-healing AI workflows Detecting agent loops and infinite recursion Coordinating failure recovery across agents
Do NOT invoke when:
Organizing agent teams (use agent-organizer) Debugging application errors (use debugger) Handling production incidents (use incident-responder) Detecting code error patterns (use error-detective) Decision Framework Error Type Handling: ├── Transient failure → Retry with backoff ├── Rate limiting → Backoff + queue ├── Invalid output → Validation + retry with feedback ├── Loop detected → Break + escalate ├── Hallucination → Ground with context, retry ├── Agent timeout → Cancel + fallback └── Cascading failure → Circuit breaker
Recovery Strategy: ├── Idempotent operation → Simple retry ├── Stateful operation → Checkpoint + resume ├── Critical path → Fallback agent └── Best effort → Log + continue
Core Workflows 1. Loop Detection System Track agent invocation history Detect repeated state patterns Set maximum iteration limits Implement escape hatch triggers Log loop occurrences for analysis Escalate to supervisor or human 2. Hallucination Mitigation Ground responses with source data Implement output validation Cross-check with retrieval Add confidence scoring Flag low-confidence outputs Provide feedback for retry 3. Circuit Breaker Implementation Track failure rates per agent Define failure threshold Open circuit on threshold breach Provide fallback behavior Implement half-open state for testing Close circuit on recovery Monitor and alert on breaker state Best Practices Implement timeouts for all agent calls Use exponential backoff with jitter Log all failures with full context Design for graceful degradation Test failure scenarios explicitly Monitor error rates and patterns Anti-Patterns Anti-Pattern Problem Correct Approach Infinite retries Resource exhaustion Max retry limits Silent failures Hidden problems Log and alert No timeouts Hung processes Always set timeouts Same retry interval Thundering herd Exponential backoff No fallbacks Complete failure Graceful degradation