Runbook Generator Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks. Runbook Structure runbook_template : metadata : title : "Runbook title" version : "1.0" last_updated : "2024-01-15" owner : "Team/Person" reviewers : [ "Name 1" , "Name 2" ] overview : purpose : "What this runbook accomplishes" scope : "Systems/services affected" audience : "Who should use this" prerequisites : access : - "AWS Console access" - "SSH key for production servers" - "Database credentials" tools : - "kubectl configured" - "AWS CLI installed" - "jq for JSON parsing" knowledge : - "Basic Kubernetes concepts" - "Understanding of service architecture" execution : estimated_time : "15-30 minutes" risk_level : "Medium" requires_change_ticket : true requires_approval : true can_be_automated : true steps : [ ]
Detailed steps below
verification : [ ]
How to confirm success
rollback : [ ]
How to undo changes
troubleshooting : [ ]
Common issues
contacts : primary_oncall : "PagerDuty" escalation : "Engineering Manager" subject_experts : [ "DBA Team" , "Platform Team" ] Standard Runbook Template
[Runbook Title] ** Version: ** 1.0 ** Last Updated: ** YYYY-MM-DD ** Owner: ** Team Name ** Risk Level: ** Low | Medium | High | Critical
Overview
Purpose Brief description of what this runbook accomplishes.
When to Use
Trigger condition 1
Trigger condition 2
Alert: "Alert Name" fires
Scope Systems and services affected: - Service A - Database B - External dependency C
Prerequisites
Required Access
[ ] Production AWS Console
[ ] Kubernetes cluster access
[ ] Database read/write permissions
Required Tools ```bash
Verify kubectl kubectl version --client
Verify AWS CLI aws sts get-caller-identity
Verify database connectivity psql -h $DB_HOST -U $DB_USER -c "SELECT 1" Required Knowledge Kubernetes pod management Service architecture overview Incident response process Pre-Execution Checklist Change ticket created: CHG-XXXXX Approval obtained from: [Name] Backup verified (if applicable) Stakeholders notified Maintenance window scheduled (if applicable) Execution Steps Step 1: [Action Name] Purpose: Why this step is necessary Command: kubectl get pods -n production -l app = myservice Expected Output: NAME READY STATUS RESTARTS AGE myservice-abc123-xyz 1/1 Running 0 2d myservice-def456-uvw 1/1 Running 0 2d Verification: Confirm all pods show STATUS=Running If unexpected: See Troubleshooting section Step 2: [Next Action] Purpose: Description Command:
Command with explanation
kubectl scale deployment myservice --replicas = 3 -n production Expected Output: deployment.apps/myservice scaled Verification:
Verify new replicas are running
kubectl get pods -n production -l app = myservice -w Wait for: All 3 pods to show Running status (typically 2-5 minutes) Post-Execution Verification Verify Service Health
Check deployment status
kubectl rollout status deployment/myservice -n production
Check service endpoints
kubectl get endpoints myservice -n production
Verify application health
curl -s https://api.example.com/health | jq . Expected: { "status" : "healthy" , "version" : "1.2.3" , "uptime" : "2h30m" } Verify Metrics Error rate returned to normal (<0.1%) Latency within SLA (<200ms p99) No new alerts firing Rollback Procedure When to Rollback Error rate exceeds 1% Latency exceeds 500ms p99 Critical functionality broken Rollback Steps
Rollback to previous deployment
kubectl rollout undo deployment/myservice -n production
Verify rollback
kubectl rollout status deployment/myservice -n production
Confirm previous version
kubectl get deployment myservice
-n
production
-o
jsonpath
=
'{.spec.template.spec.containers[0].image}'
Troubleshooting
Symptom
Likely Cause
Resolution
Pods stuck in Pending
Resource constraints
Check node capacity:
kubectl describe nodes
CrashLoopBackOff
Application error
Check logs:
kubectl logs -f
Check pod events
kubectl describe pod < pod-name
-n production
Check resource limits
kubectl top pods -n production Issue: Database connection failures
Verify database connectivity
kubectl exec -it < pod
-n production -- psql -h $DB_HOST -c "SELECT 1"
Check connection pool
kubectl logs < pod
-n production | grep -i "connection" Emergency Contacts Role Contact When to Engage On-call Engineer PagerDuty Any issue Database Team
dba-oncall
Database issues Platform Team
platform-oncall
Infrastructure issues Engineering Manager [Name] Escalation Change Log Version Date Author Changes 1.0 2024-01-15 Author Initial version Related Documentation Service Architecture Incident Response Process Monitoring Dashboard
Runbook Types
Incident Response Runbook
```yaml incident_runbook: sections: detection: alert_name: "High Error Rate - Payment Service" threshold: "Error rate > 5% for 5 minutes" severity: "P1" immediate_actions: - step: "Acknowledge alert" command: "In PagerDuty, acknowledge incident" time: "< 5 min" - step: "Assess impact" command: |
Check error rate
curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])" time: "< 2 min" - step: "Notify stakeholders" action: "Post in #incident-channel" template: | 🚨 INCIDENT: Payment Service High Errors Severity: P1 Status: Investigating Impact: Payment processing affected IC: @oncall investigation: - "Check recent deployments" - "Review error logs" - "Check dependent services" - "Review infrastructure metrics" mitigation: options: - name: "Rollback deployment" when: "Error started after deploy" command: "kubectl rollout undo deployment/payment -n prod" - name: "Scale up" when: "Load-related errors" command: "kubectl scale deployment/payment --replicas=10 -n prod" - name: "Enable circuit breaker" when: "Downstream dependency failing" command: "Toggle feature flag: payment.circuit_breaker=true" resolution: checklist: - "[ ] Error rate < 0.1%" - "[ ] No P1 alerts" - "[ ] Stakeholders notified" - "[ ] Incident documented" Deployment Runbook deployment_runbook : pre_deployment : checklist : - "[ ] Code review approved" - "[ ] CI/CD pipeline passed" - "[ ] Staging tested" - "[ ] Change ticket approved" - "[ ] Rollback plan documented" verification : - step : "Verify staging health" command : | curl -s https://staging.example.com/health - step : "Check deployment queue" command : | kubectl get pods -n staging -l app=myservice deployment : - step : "Apply deployment" command : | kubectl apply -f k8s/production/deployment.yaml - step : "Monitor rollout" command : | kubectl rollout status deployment/myservice -n production --timeout=10m - step : "Verify new version" command : | kubectl get deployment myservice -n production \ -o jsonpath='{.spec.template.spec.containers[0].image}' post_deployment : - step : "Smoke test" command : | ./scripts/smoke-test.sh production - step : "Monitor metrics" duration : "15 minutes" watch : - "Error rate" - "Latency p99" - "Request rate" - step : "Update ticket" action : "Mark CHG ticket as completed" Maintenance Runbook maintenance_runbook : log_rotation : schedule : "Weekly, Sunday 02:00 UTC" steps : - step : "Connect to server" command : | ssh admin@logs.example.com - step : "Rotate logs" command : | sudo logrotate -f /etc/logrotate.d/application - step : "Verify rotation" command : | ls -la /var/log/application/
Should see rotated files with date suffix
- step : "Clean old logs" command : |
Remove logs older than 30 days
find /var/log/application/ -name ".log." -mtime +30 -delete
step : "Verify disk space" command : | df -h /var/log
Should show > 20% free
database_maintenance
:
schedule
:
"Monthly, first Sunday 03:00 UTC"
steps
:
-
step
:
"Check table sizes"
command
:
|
psql -c "
SELECT tablename,
pg_size_pretty(pg_total_relation_size(tablename::text))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::text) DESC
LIMIT 10;
"
-
step
:
"Run VACUUM ANALYZE"
command
:
|
psql -c "VACUUM ANALYZE;"
-
step
:
"Reindex if needed"
command
:
|
psql -c "REINDEX DATABASE mydb;"
Writing Guidelines
principles
:
clarity
:
-
"Use active voice"
-
"Be explicit, never assume"
-
"One action per step"
completeness
:
-
"Include all commands"
-
"Show expected output"
-
"Document verification"
safety
:
-
"Test in non-prod first"
-
"Include rollback steps"
-
"Document risks"
formatting
:
commands
:
-
"Use code blocks with language"
-
"Include full paths"
-
"Add comments for complex commands"
steps
:
-
"Number sequentially"
-
"Include purpose"
-
"Show expected result"
-
"Note time estimate"
variables
:
format
:
"$VARIABLE_NAME or