Runbook Creation Overview

Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.

When to Use Incident response procedures Standard operating procedures (SOPs) On-call playbooks System maintenance guides Disaster recovery procedures Deployment runbooks Escalation procedures Service restoration guides Incident Response Runbook Template

Incident Response Runbook

Quick Reference

Severity Levels: - P0 (Critical): Complete outage, data loss, security breach - P1 (High): Major feature down, significant user impact - P2 (Medium): Minor feature degradation, limited user impact - P3 (Low): Cosmetic issues, minimal user impact

Response Times: - P0: Immediate (24/7) - P1: 15 minutes (business hours), 1 hour (after hours) - P2: 4 hours (business hours) - P3: Next business day

Escalation Contacts: - On-call Engineer: PagerDuty rotation - Engineering Manager: +1-555-0100 - VP Engineering: +1-555-0101 - CTO: +1-555-0102

Service Down
Database Issues
High CPU/Memory Usage
API Performance Degradation
Security Incidents
Data Loss Recovery
Rollback Procedures

Service Down

Symptoms

Health check endpoint returning 500 errors
Users unable to access application
Load balancer showing all instances unhealthy
Alerts: service_down, health_check_failed

Severity: P0 (Critical)

Initial Response (5 minutes)

Acknowledge the incident ```bash # Acknowledge in PagerDuty # Post in #incidents Slack channel

Create incident channel

Create Slack channel: #incident-YYYY-MM-DD-service-down Post incident details and status updates

Assess impact

Check service status

kubectl get pods -n production

Check recent deployments

kubectl rollout history deployment/api -n production

Check logs

kubectl logs -f deployment/api -n production --tail=100

Investigation Steps Check Application Health

1. Check pod status

kubectl get pods -n production -l app=api

Expected output: All pods Running

NAME READY STATUS RESTARTS AGE

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

2. Check pod logs for errors

kubectl logs -f deployment/api -n production --tail=100 | grep -i error

3. Check application endpoints

curl -v https://api.example.com/health curl -v https://api.example.com/api/v1/status

4. Check database connectivity

kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"

Check Infrastructure

1. Check load balancer

aws elb describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:... \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \ --output table

2. Check DNS resolution

dig api.example.com nslookup api.example.com

3. Check SSL certificates

echo | openssl s_client -connect api.example.com:443 2>/dev/null | \ openssl x509 -noout -dates

4. Check network connectivity

kubectl exec -it deployment/api -n production -- \ curl -v https://database.example.com:5432

Check Database

1. Check database connections

psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

2. Check for locks

psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 "

3. Check database size

psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) "

4. Check long-running queries

psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "

Resolution Steps Option 1: Restart Pods (Quick Fix)

Restart all pods (rolling restart)

kubectl rollout restart deployment/api -n production

Watch restart progress

kubectl rollout status deployment/api -n production

Verify pods are healthy

kubectl get pods -n production -l app=api

Option 2: Scale Up (If Overload)

Check current replicas

kubectl get deployment api -n production

Scale up

kubectl scale deployment/api -n production --replicas=10

Watch scaling

kubectl get pods -n production -l app=api -w

Option 3: Rollback (If Bad Deploy)

Check deployment history

kubectl rollout history deployment/api -n production

Rollback to previous version

kubectl rollout undo deployment/api -n production

Rollback to specific revision

kubectl rollout undo deployment/api -n production --to-revision=5

Verify rollback

kubectl rollout status deployment/api -n production

Option 4: Database Connection Reset

If database connection pool exhausted

kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # Reload process, reset connections

Or restart database connection pool

psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"

Verification

1. Check health endpoint

curl https://api.example.com/health

Expected:

2. Check API endpoints

curl https://api.example.com/api/v1/users

Expected: Valid JSON response

3. Check metrics

Visit https://grafana.example.com

Verify:

- Error rate < 1%

- Response time < 500ms

- All pods healthy

4. Check logs for errors

kubectl logs deployment/api -n production --tail=100 | grep -i error

Expected: No new errors

Communication

Initial Update (within 5 minutes):

🚨 INCIDENT: Service Down

Status: Investigating Severity: P0 Impact: All users unable to access application Start Time: 2025-01-15 14:30 UTC

We are investigating reports of users unable to access the application. Our team is working to identify the root cause.

Next update in 15 minutes.

Progress Update (every 15 minutes):

🔍 UPDATE: Service Down

Status: Identified Root Cause: Database connection pool exhausted Action: Restarting application pods ETA: 5 minutes

We have identified the issue and are implementing a fix.

Resolution Update:

✅ RESOLVED: Service Down

Status: Resolved Resolution: Restarted application pods, reset database connections Duration: 23 minutes

The service is now fully operational. We are monitoring closely and will conduct a post-mortem to prevent future occurrences.

Post-Incident

Create post-mortem document

Timeline of events Root cause analysis Action items to prevent recurrence

Update monitoring

Add alerts for this scenario Improve detection time

Update runbook

Document any new findings Add shortcuts for faster resolution Database Issues High Connection Count

Symptoms:

Database rejecting new connections Error: "too many connections" Alert: db_connections_high

Quick Fix:

1. Check connection count

psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name "

2. Kill idle connections

psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' "

3. Restart connection pools

kubectl rollout restart deployment/api -n production

Slow Queries

Symptoms:

API response times > 5 seconds Database CPU at 100% Alert: slow_query_detected

Investigation:

-- Find slow queries SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;

-- Check for missing indexes SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_scan DESC LIMIT 10;

-- Kill long-running query (if needed) SELECT pg_terminate_backend(12345); -- Replace with actual PID

High CPU/Memory Usage Symptoms Pods being OOMKilled Response times increasing Alert: high_memory_usage, high_cpu_usage Investigation

1. Check pod resources

kubectl top pods -n production

2. Check resource limits

kubectl describe pod -n production | grep -A 5 Limits

3. Check for memory leaks

kubectl logs deployment/api -n production | grep -i "out of memory"

4. Profile application (if needed)

kubectl exec -it -n production -- sh

Run profiler: node --inspect, py-spy, etc.

Resolution

Option 1: Increase resources

kubectl set resources deployment/api -n production \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=1000m,memory=2Gi

Option 2: Scale horizontally

kubectl scale deployment/api -n production --replicas=6

Option 3: Restart problematic pods

kubectl delete pod -n production

Rollback Procedures Application Rollback

1. List deployment history

kubectl rollout history deployment/api -n production

2. Check specific revision

kubectl rollout history deployment/api -n production --revision=5

3. Rollback to previous

kubectl rollout undo deployment/api -n production

4. Rollback to specific revision

kubectl rollout undo deployment/api -n production --to-revision=5

5. Verify rollback

kubectl rollout status deployment/api -n production kubectl get pods -n production

Database Rollback

1. Check migration status

npm run db:migrate:status

2. Rollback last migration

npm run db:migrate:undo

3. Rollback to specific migration

npm run db:migrate:undo --to 20250115120000-migration-name

4. Verify database state

psql $DATABASE_URL -c "\dt"

Escalation Path

Level 1 - On-call Engineer (You)

Initial response and investigation Attempt standard fixes from runbook

Level 2 - Senior Engineers

Escalate if not resolved in 30 minutes Escalate if issue is complex/unclear Contact via PagerDuty or Slack

Level 3 - Engineering Manager

Escalate if not resolved in 1 hour Escalate if cross-team coordination needed

Level 4 - VP Engineering / CTO

Escalate for P0 incidents > 2 hours Escalate for security breaches Escalate for data loss Useful Commands

Kubernetes

kubectl get pods -n production kubectl logs -f -n production kubectl describe pod -n production kubectl exec -it -n production -- sh kubectl top pods -n production

Database

psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"

AWS

aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ...

Monitoring URLs

Grafana: https://grafana.example.com

Datadog: https://app.datadoghq.com

PagerDuty: https://example.pagerduty.com

Status Page: https://status.example.com

Best Practices

✅ DO

Include quick reference section at top
Provide exact commands to run
Document expected outputs
Include verification steps
Add communication templates
Define severity levels clearly
Document escalation paths
Include useful links and contacts
Keep runbooks up-to-date
Test runbooks regularly
Include screenshots/diagrams
Document common gotchas

❌ DON'T

Use vague instructions
Skip verification steps
Forget to document prerequisites
Assume knowledge of tools
Skip communication guidelines
Forget to update after incidents

安装

Incident Response Runbook

Quick Reference

Table of Contents

Service Down

Symptoms

Severity: P0 (Critical)

Initial Response (5 minutes)

Check service status

Check recent deployments

Check logs

1. Check pod status

Expected output: All pods Running

NAME READY STATUS RESTARTS AGE

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

2. Check pod logs for errors

3. Check application endpoints

4. Check database connectivity

1. Check load balancer

2. Check DNS resolution

3. Check SSL certificates

4. Check network connectivity

1. Check database connections

2. Check for locks

3. Check database size

4. Check long-running queries

Restart all pods (rolling restart)

Watch restart progress

Verify pods are healthy

Check current replicas

Scale up

Watch scaling

Check deployment history

Rollback to previous version

Rollback to specific revision

Verify rollback

If database connection pool exhausted

Or restart database connection pool

1. Check health endpoint

Expected:

2. Check API endpoints

Expected: Valid JSON response

3. Check metrics

Visit https://grafana.example.com

Verify:

- Error rate < 1%

- Response time < 500ms

- All pods healthy

4. Check logs for errors

Expected: No new errors

1. Check connection count

2. Kill idle connections

3. Restart connection pools

1. Check pod resources

2. Check resource limits

3. Check for memory leaks

4. Profile application (if needed)

Run profiler: node --inspect, py-spy, etc.

Option 1: Increase resources

Option 2: Scale horizontally

Option 3: Restart problematic pods

1. List deployment history

2. Check specific revision

3. Rollback to previous

4. Rollback to specific revision

5. Verify rollback

1. Check migration status

2. Rollback last migration

3. Rollback to specific migration

4. Verify database state

Kubernetes

Database

AWS

Monitoring URLs

Grafana: https://grafana.example.com

Datadog: https://app.datadoghq.com

PagerDuty: https://example.pagerduty.com

Status Page: https://status.example.com

Best Practices