Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
When to Use This Skill Creating incident response procedures Building service-specific runbooks Establishing escalation paths Documenting recovery procedures Responding to active incidents Onboarding on-call engineers Core Concepts 1. Incident Severity Levels Severity Impact Response Time Example SEV1 Complete outage, data loss 15 min Production down SEV2 Major degradation 30 min Critical feature broken SEV3 Minor impact 2 hours Non-critical bug SEV4 Minimal impact Next business day Cosmetic issue 2. Runbook Structure 1. Overview & Impact 2. Detection & Alerts 3. Initial Triage 4. Mitigation Steps 5. Root Cause Investigation 6. Resolution Procedures 7. Verification & Rollback 8. Communication Templates 9. Escalation Matrix
Runbook Templates Template 1: Service Outage Runbook
[Service Name] Outage Runbook
Overview
Service: Payment Processing Service Owner: Platform Team Slack: #payments-incidents PagerDuty: payments-oncall
Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
Detection
Alerts
payment_error_rate > 5%(PagerDuty)payment_latency_p99 > 2s(Slack)payment_success_rate < 95%(PagerDuty)
Dashboards
Initial Triage (First 5 Minutes)
1. Assess Scope
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
- Quick Health Checks Can you reach the service? curl -I https://api.company.com/payments/health Database connectivity? Check connection pool metrics External dependencies? Check Stripe, bank API status Recent changes? Check deploy history
- Initial Classification Symptom Likely Cause Go To Section All requests failing Service down Section 4.1 High latency Database/dependency Section 4.2 Partial failures Code bug Section 4.3 Spike in errors Traffic surge Section 4.4 Mitigation Procedures 4.1 Service Completely Down
Step 1: Check pod status
kubectl get pods -n payments
Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
4.2 High Latency
Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \ curl localhost:8080/metrics | grep db_pool
Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;"
Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \ STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
4.3 Partial Failures (Specific Errors)
Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \ grep -i error | sort | uniq -c | sort -rn | head -20
Step 2: Check error tracking
Go to Sentry: https://sentry.io/payments
Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"
4.4 Traffic Surge
Step 1: Check current request rate
kubectl top pods -n payments
Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
Step 3: Enable rate limiting
kubectl set env deployment/payment-service \ RATE_LIMIT_ENABLED=true \ RATE_LIMIT_RPS=1000 -n payments
Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress: - from: - ipBlock: cidr: 0.0.0.0/0 except: - 192.168.1.0/24 # Suspicious range EOF
Verification Steps
Verify service is healthy
curl -s https://api.company.com/payments/health | jq
Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
Smoke test critical flows
./scripts/smoke-test-payments.sh
Rollback Procedures
Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION
Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
Escalation Matrix Condition Escalate To Contact
15 min unresolved SEV1 Engineering Manager @manager (Slack) Data breach suspected Security Team #security-incidents Financial impact > $10k Finance + Legal @finance-oncall Customer communication needed Support Lead @support-lead Communication Templates Initial Notification (Internal) 🚨 INCIDENT: Payment Service Degradation
Severity: SEV2 Status: Investigating Impact: ~20% of payment requests failing Start Time: [TIME] Incident Commander: [NAME]
Current Actions: - Investigating root cause - Scaling up service - Monitoring dashboards
Updates in #payments-incidents
Status Update 📊 UPDATE: Payment Service Incident
Status: Mitigating Impact: Reduced to ~5% failure rate Duration: 25 minutes
Actions Taken: - Rolled back deployment v2.3.4 → v2.3.3 - Scaled service from 5 → 10 replicas
Next Steps: - Continuing to monitor - Root cause analysis in progress
ETA to Resolution: ~15 minutes
Resolution Notification ✅ RESOLVED: Payment Service Incident
Duration: 45 minutes Impact: ~5,000 affected transactions Root Cause: Memory leak in v2.3.4
Resolution: - Rolled back to v2.3.3 - Transactions auto-retried successfully
Follow-up: - Postmortem scheduled for [DATE] - Bug fix in progress
Template 2: Database Incident Runbook
```markdown
Database Incident Runbook
Quick Reference
| Issue | Command |
|---|---|
| Check connections | SELECT count(*) FROM pg_stat_activity; |
| Kill query | SELECT pg_terminate_backend(pid); |
| Check replication lag | SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp())); |
| Check locks | SELECT * FROM pg_locks WHERE NOT granted; |
Connection Pool Exhaustion
```sql -- Check current connections SELECT datname, usename, state, count() FROM pg_stat_activity GROUP BY datname, usename, state ORDER BY count() DESC;
-- Identify long-running connections SELECT pid, usename, datname, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;
-- Terminate idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
Replication Lag -- Check lag on replica SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp()) END AS lag_seconds;
-- If lag > 60s, consider: -- 1. Check network between primary/replica -- 2. Check replica disk I/O -- 3. Consider failover if unrecoverable
Disk Space Critical
Check disk usage
df -h /var/lib/postgresql/data
Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"
If emergency, delete old data or expand disk
Best Practices
Do's
- Keep runbooks updated - Review after every incident
- Test runbooks regularly - Game days, chaos engineering
- Include rollback steps - Always have an escape hatch
- Document assumptions - What must be true for steps to work
- Link to dashboards - Quick access during stress
Don'ts
- Don't assume knowledge - Write for 3 AM brain
- Don't skip verification - Confirm each step worked
- Don't forget communication - Keep stakeholders informed
- Don't work alone - Escalate early
- Don't skip postmortems - Learn from every incident