Chaos Engineering & Resilience Testing
DEFINE steady state (normal metrics: error rate, latency, throughput) HYPOTHESIZE system continues in steady state during failure INJECT real-world failures (network, instance, disk, CPU) OBSERVE and measure deviation from steady state FIX weaknesses discovered, document runbooks, repeat
Quick Chaos Steps:
Start small: Dev → Staging → 1% prod → gradual rollout Define clear rollback triggers (error_rate > 5%) Measure blast radius, never exceed planned scope Document findings → runbooks → improved resilience
Critical Success Factors:
Controlled experiments with automatic rollback Steady state must be measurable Start in non-production, graduate to production Quick Reference Card When to Use Distributed systems validation Disaster recovery testing Building confidence in fault tolerance Pre-production resilience verification Failure Types to Inject Category Failures Tools Network Latency, packet loss, partition tc, toxiproxy Infrastructure Instance kill, disk failure, CPU Chaos Monkey Application Exceptions, slow responses, leaks Gremlin, LitmusChaos Dependencies Service outage, timeout WireMock Blast Radius Progression Dev (safe) → Staging → 1% prod → 10% → 50% → 100% ↓ ↓ ↓ ↓ Learn Validate Careful Full confidence
Steady State Metrics Metric Normal Alert Threshold Error rate < 0.1% > 1% p99 latency < 200ms > 500ms Throughput baseline -20% Chaos Experiment Structure // Chaos experiment definition const experiment = { name: 'Database latency injection', hypothesis: 'System handles 500ms DB latency gracefully', steadyState: { errorRate: '< 0.1%', p99Latency: '< 300ms' }, method: { type: 'network-latency', target: 'database', delay: '500ms', duration: '5m' }, rollback: { automatic: true, trigger: 'errorRate > 5%' } };
Agent-Driven Chaos // qe-chaos-engineer runs controlled experiments await Task("Chaos Experiment", { target: 'payment-service', failure: 'terminate-random-instance', blastRadius: '10%', duration: '5m', steadyStateHypothesis: { metric: 'success-rate', threshold: 0.99 }, autoRollback: true }, "qe-chaos-engineer");
// Validates: // - System recovers automatically // - Error rate stays within threshold // - No data loss // - Alerts triggered appropriately
Agent Coordination Hints Memory Namespace aqe/chaos-engineering/ ├── experiments/ - Experiment definitions & results ├── steady-states/ - Baseline measurements ├── runbooks/ - Generated recovery procedures └── blast-radius/ - Impact analysis
Fleet Coordination const chaosFleet = await FleetManager.coordinate({ strategy: 'chaos-engineering', agents: [ 'qe-chaos-engineer', // Experiment execution 'qe-performance-tester', // Baseline metrics 'qe-production-intelligence' // Production monitoring ], topology: 'sequential' });
Related Skills shift-right-testing - Production testing performance-testing - Load testing test-environment-management - Environment stability Remember
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.