Runbook Creation Overview
Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.
When to Use Incident response procedures Standard operating procedures (SOPs) On-call playbooks System maintenance guides Disaster recovery procedures Deployment runbooks Escalation procedures Service restoration guides Incident Response Runbook Template
Incident Response Runbook
Quick Reference
Severity Levels: - P0 (Critical): Complete outage, data loss, security breach - P1 (High): Major feature down, significant user impact - P2 (Medium): Minor feature degradation, limited user impact - P3 (Low): Cosmetic issues, minimal user impact
Response Times: - P0: Immediate (24/7) - P1: 15 minutes (business hours), 1 hour (after hours) - P2: 4 hours (business hours) - P3: Next business day
Escalation Contacts: - On-call Engineer: PagerDuty rotation - Engineering Manager: +1-555-0100 - VP Engineering: +1-555-0101 - CTO: +1-555-0102
Table of Contents
- Service Down
- Database Issues
- High CPU/Memory Usage
- API Performance Degradation
- Security Incidents
- Data Loss Recovery
- Rollback Procedures
Service Down
Symptoms
- Health check endpoint returning 500 errors
- Users unable to access application
- Load balancer showing all instances unhealthy
- Alerts:
service_down,health_check_failed
Severity: P0 (Critical)
Initial Response (5 minutes)
- Acknowledge the incident ```bash # Acknowledge in PagerDuty # Post in #incidents Slack channel
Create incident channel
Create Slack channel: #incident-YYYY-MM-DD-service-down Post incident details and status updates
Assess impact
Check service status
kubectl get pods -n production
Check recent deployments
kubectl rollout history deployment/api -n production
Check logs
kubectl logs -f deployment/api -n production --tail=100
Investigation Steps Check Application Health
1. Check pod status
kubectl get pods -n production -l app=api
Expected output: All pods Running
NAME READY STATUS RESTARTS AGE
api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h
api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h
2. Check pod logs for errors
kubectl logs -f deployment/api -n production --tail=100 | grep -i error
3. Check application endpoints
curl -v https://api.example.com/health curl -v https://api.example.com/api/v1/status
4. Check database connectivity
kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"
Check Infrastructure
1. Check load balancer
aws elb describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:... \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \ --output table
2. Check DNS resolution
dig api.example.com nslookup api.example.com
3. Check SSL certificates
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \ openssl x509 -noout -dates
4. Check network connectivity
kubectl exec -it deployment/api -n production -- \ curl -v https://database.example.com:5432
Check Database
1. Check database connections
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
2. Check for locks
psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 "
3. Check database size
psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) "
4. Check long-running queries
psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "
Resolution Steps Option 1: Restart Pods (Quick Fix)
Restart all pods (rolling restart)
kubectl rollout restart deployment/api -n production
Watch restart progress
kubectl rollout status deployment/api -n production
Verify pods are healthy
kubectl get pods -n production -l app=api
Option 2: Scale Up (If Overload)
Check current replicas
kubectl get deployment api -n production
Scale up
kubectl scale deployment/api -n production --replicas=10
Watch scaling
kubectl get pods -n production -l app=api -w
Option 3: Rollback (If Bad Deploy)
Check deployment history
kubectl rollout history deployment/api -n production
Rollback to previous version
kubectl rollout undo deployment/api -n production
Rollback to specific revision
kubectl rollout undo deployment/api -n production --to-revision=5
Verify rollback
kubectl rollout status deployment/api -n production
Option 4: Database Connection Reset
If database connection pool exhausted
kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # Reload process, reset connections
Or restart database connection pool
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"
Verification
1. Check health endpoint
curl https://api.example.com/health
Expected:
2. Check API endpoints
curl https://api.example.com/api/v1/users
Expected: Valid JSON response
3. Check metrics
Visit https://grafana.example.com
Verify:
- Error rate < 1%
- Response time < 500ms
- All pods healthy
4. Check logs for errors
kubectl logs deployment/api -n production --tail=100 | grep -i error
Expected: No new errors
Communication
Initial Update (within 5 minutes):
🚨 INCIDENT: Service Down
Status: Investigating Severity: P0 Impact: All users unable to access application Start Time: 2025-01-15 14:30 UTC
We are investigating reports of users unable to access the application. Our team is working to identify the root cause.
Next update in 15 minutes.
Progress Update (every 15 minutes):
🔍 UPDATE: Service Down
Status: Identified Root Cause: Database connection pool exhausted Action: Restarting application pods ETA: 5 minutes
We have identified the issue and are implementing a fix.
Resolution Update:
✅ RESOLVED: Service Down
Status: Resolved Resolution: Restarted application pods, reset database connections Duration: 23 minutes
The service is now fully operational. We are monitoring closely and will conduct a post-mortem to prevent future occurrences.
Post-Incident
Create post-mortem document
Timeline of events Root cause analysis Action items to prevent recurrence
Update monitoring
Add alerts for this scenario Improve detection time
Update runbook
Document any new findings Add shortcuts for faster resolution Database Issues High Connection Count
Symptoms:
Database rejecting new connections Error: "too many connections" Alert: db_connections_high
Quick Fix:
1. Check connection count
psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name "
2. Kill idle connections
psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' "
3. Restart connection pools
kubectl rollout restart deployment/api -n production
Slow Queries
Symptoms:
API response times > 5 seconds Database CPU at 100% Alert: slow_query_detected
Investigation:
-- Find slow queries SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;
-- Check for missing indexes SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_scan DESC LIMIT 10;
-- Kill long-running query (if needed) SELECT pg_terminate_backend(12345); -- Replace with actual PID
High CPU/Memory Usage Symptoms Pods being OOMKilled Response times increasing Alert: high_memory_usage, high_cpu_usage Investigation
1. Check pod resources
kubectl top pods -n production
2. Check resource limits
kubectl describe pod
3. Check for memory leaks
kubectl logs deployment/api -n production | grep -i "out of memory"
4. Profile application (if needed)
kubectl exec -it
Run profiler: node --inspect, py-spy, etc.
Resolution
Option 1: Increase resources
kubectl set resources deployment/api -n production \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=1000m,memory=2Gi
Option 2: Scale horizontally
kubectl scale deployment/api -n production --replicas=6
Option 3: Restart problematic pods
kubectl delete pod
Rollback Procedures Application Rollback
1. List deployment history
kubectl rollout history deployment/api -n production
2. Check specific revision
kubectl rollout history deployment/api -n production --revision=5
3. Rollback to previous
kubectl rollout undo deployment/api -n production
4. Rollback to specific revision
kubectl rollout undo deployment/api -n production --to-revision=5
5. Verify rollback
kubectl rollout status deployment/api -n production kubectl get pods -n production
Database Rollback
1. Check migration status
npm run db:migrate:status
2. Rollback last migration
npm run db:migrate:undo
3. Rollback to specific migration
npm run db:migrate:undo --to 20250115120000-migration-name
4. Verify database state
psql $DATABASE_URL -c "\dt"
Escalation Path
Level 1 - On-call Engineer (You)
Initial response and investigation Attempt standard fixes from runbook
Level 2 - Senior Engineers
Escalate if not resolved in 30 minutes Escalate if issue is complex/unclear Contact via PagerDuty or Slack
Level 3 - Engineering Manager
Escalate if not resolved in 1 hour Escalate if cross-team coordination needed
Level 4 - VP Engineering / CTO
Escalate for P0 incidents > 2 hours Escalate for security breaches Escalate for data loss Useful Commands
Kubernetes
kubectl get pods -n production
kubectl logs -f
Database
psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"
AWS
aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ...
Monitoring URLs
Grafana: https://grafana.example.com
Datadog: https://app.datadoghq.com
PagerDuty: https://example.pagerduty.com
Status Page: https://status.example.com
Best Practices
✅ DO
- Include quick reference section at top
- Provide exact commands to run
- Document expected outputs
- Include verification steps
- Add communication templates
- Define severity levels clearly
- Document escalation paths
- Include useful links and contacts
- Keep runbooks up-to-date
- Test runbooks regularly
- Include screenshots/diagrams
- Document common gotchas
❌ DON'T
- Use vague instructions
- Skip verification steps
- Forget to document prerequisites
- Assume knowledge of tools
- Skip communication guidelines
- Forget to update after incidents