Runbook Generator Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks. Runbook Structure runbook_template : metadata : title : "Runbook title" version : "1.0" last_updated : "2024-01-15" owner : "Team/Person" reviewers : [ "Name 1" , "Name 2" ] overview : purpose : "What this runbook accomplishes" scope : "Systems/services affected" audience : "Who should use this" prerequisites : access : - "AWS Console access" - "SSH key for production servers" - "Database credentials" tools : - "kubectl configured" - "AWS CLI installed" - "jq for JSON parsing" knowledge : - "Basic Kubernetes concepts" - "Understanding of service architecture" execution : estimated_time : "15-30 minutes" risk_level : "Medium" requires_change_ticket : true requires_approval : true can_be_automated : true steps : [ ]

Detailed steps below

verification : [ ]

How to confirm success

rollback : [ ]

How to undo changes

troubleshooting : [ ]

Common issues

contacts : primary_oncall : "PagerDuty" escalation : "Engineering Manager" subject_experts : [ "DBA Team" , "Platform Team" ] Standard Runbook Template

[Runbook Title] ** Version: ** 1.0 ** Last Updated: ** YYYY-MM-DD ** Owner: ** Team Name ** Risk Level: ** Low | Medium | High | Critical

Overview

Purpose Brief description of what this runbook accomplishes.

When to Use

Trigger condition 1

Trigger condition 2

Alert: "Alert Name" fires

Scope Systems and services affected: - Service A - Database B - External dependency C

Prerequisites

Required Access

[ ] Production AWS Console

[ ] Kubernetes cluster access

[ ] Database read/write permissions

Required Tools ```bash

Verify kubectl kubectl version --client

Verify AWS CLI aws sts get-caller-identity

Verify database connectivity psql -h $DB_HOST -U $DB_USER -c "SELECT 1" Required Knowledge Kubernetes pod management Service architecture overview Incident response process Pre-Execution Checklist Change ticket created: CHG-XXXXX Approval obtained from: [Name] Backup verified (if applicable) Stakeholders notified Maintenance window scheduled (if applicable) Execution Steps Step 1: [Action Name] Purpose: Why this step is necessary Command: kubectl get pods -n production -l app = myservice Expected Output: NAME READY STATUS RESTARTS AGE myservice-abc123-xyz 1/1 Running 0 2d myservice-def456-uvw 1/1 Running 0 2d Verification: Confirm all pods show STATUS=Running If unexpected: See Troubleshooting section Step 2: [Next Action] Purpose: Description Command:

Command with explanation

kubectl scale deployment myservice --replicas = 3 -n production Expected Output: deployment.apps/myservice scaled Verification:

Verify new replicas are running

kubectl get pods -n production -l app = myservice -w Wait for: All 3 pods to show Running status (typically 2-5 minutes) Post-Execution Verification Verify Service Health

Check deployment status

kubectl rollout status deployment/myservice -n production

Check service endpoints

kubectl get endpoints myservice -n production

Verify application health

curl -s https://api.example.com/health | jq . Expected: { "status" : "healthy" , "version" : "1.2.3" , "uptime" : "2h30m" } Verify Metrics Error rate returned to normal (<0.1%) Latency within SLA (<200ms p99) No new alerts firing Rollback Procedure When to Rollback Error rate exceeds 1% Latency exceeds 500ms p99 Critical functionality broken Rollback Steps

Rollback to previous deployment

kubectl rollout undo deployment/myservice -n production

Verify rollback

kubectl rollout status deployment/myservice -n production

Confirm previous version

kubectl get deployment myservice -n production -o jsonpath = '{.spec.template.spec.containers[0].image}' Troubleshooting Symptom Likely Cause Resolution Pods stuck in Pending Resource constraints Check node capacity: kubectl describe nodes CrashLoopBackOff Application error Check logs: kubectl logs -f ImagePullBackOff Registry auth issue Verify secret: kubectl get secret regcred Connection refused Service not ready Wait for readiness probe, check endpoints Common Issues Issue: Deployment times out

Check pod events

kubectl describe pod < pod-name

-n production

Check resource limits

kubectl top pods -n production Issue: Database connection failures

Verify database connectivity

kubectl exec -it < pod

-n production -- psql -h $DB_HOST -c "SELECT 1"

Check connection pool

kubectl logs < pod

-n production | grep -i "connection" Emergency Contacts Role Contact When to Engage On-call Engineer PagerDuty Any issue Database Team

dba-oncall

Database issues Platform Team

platform-oncall

Infrastructure issues Engineering Manager [Name] Escalation Change Log Version Date Author Changes 1.0 2024-01-15 Author Initial version Related Documentation Service Architecture Incident Response Process Monitoring Dashboard

Runbook Types

Incident Response Runbook

```yaml incident_runbook: sections: detection: alert_name: "High Error Rate - Payment Service" threshold: "Error rate > 5% for 5 minutes" severity: "P1" immediate_actions: - step: "Acknowledge alert" command: "In PagerDuty, acknowledge incident" time: "< 5 min" - step: "Assess impact" command: |

Check error rate

curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])" time: "< 2 min" - step: "Notify stakeholders" action: "Post in #incident-channel" template: | 🚨 INCIDENT: Payment Service High Errors Severity: P1 Status: Investigating Impact: Payment processing affected IC: @oncall investigation: - "Check recent deployments" - "Review error logs" - "Check dependent services" - "Review infrastructure metrics" mitigation: options: - name: "Rollback deployment" when: "Error started after deploy" command: "kubectl rollout undo deployment/payment -n prod" - name: "Scale up" when: "Load-related errors" command: "kubectl scale deployment/payment --replicas=10 -n prod" - name: "Enable circuit breaker" when: "Downstream dependency failing" command: "Toggle feature flag: payment.circuit_breaker=true" resolution: checklist: - "[ ] Error rate < 0.1%" - "[ ] No P1 alerts" - "[ ] Stakeholders notified" - "[ ] Incident documented" Deployment Runbook deployment_runbook : pre_deployment : checklist : - "[ ] Code review approved" - "[ ] CI/CD pipeline passed" - "[ ] Staging tested" - "[ ] Change ticket approved" - "[ ] Rollback plan documented" verification : - step : "Verify staging health" command : | curl -s https://staging.example.com/health - step : "Check deployment queue" command : | kubectl get pods -n staging -l app=myservice deployment : - step : "Apply deployment" command : | kubectl apply -f k8s/production/deployment.yaml - step : "Monitor rollout" command : | kubectl rollout status deployment/myservice -n production --timeout=10m - step : "Verify new version" command : | kubectl get deployment myservice -n production \ -o jsonpath='{.spec.template.spec.containers[0].image}' post_deployment : - step : "Smoke test" command : | ./scripts/smoke-test.sh production - step : "Monitor metrics" duration : "15 minutes" watch : - "Error rate" - "Latency p99" - "Request rate" - step : "Update ticket" action : "Mark CHG ticket as completed" Maintenance Runbook maintenance_runbook : log_rotation : schedule : "Weekly, Sunday 02:00 UTC" steps : - step : "Connect to server" command : | ssh admin@logs.example.com - step : "Rotate logs" command : | sudo logrotate -f /etc/logrotate.d/application - step : "Verify rotation" command : | ls -la /var/log/application/

Should see rotated files with date suffix

- step : "Clean old logs" command : |

Remove logs older than 30 days

find /var/log/application/ -name ".log." -mtime +30 -delete

step : "Verify disk space" command : | df -h /var/log

Should show > 20% free

database_maintenance : schedule : "Monthly, first Sunday 03:00 UTC" steps : - step : "Check table sizes" command : | psql -c " SELECT tablename, pg_size_pretty(pg_total_relation_size(tablename::text)) FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(tablename::text) DESC LIMIT 10; " - step : "Run VACUUM ANALYZE" command : | psql -c "VACUUM ANALYZE;" - step : "Reindex if needed" command : | psql -c "REINDEX DATABASE mydb;" Writing Guidelines principles : clarity : - "Use active voice" - "Be explicit, never assume" - "One action per step" completeness : - "Include all commands" - "Show expected output" - "Document verification" safety : - "Test in non-prod first" - "Include rollback steps" - "Document risks" formatting : commands : - "Use code blocks with language" - "Include full paths" - "Add comments for complex commands" steps : - "Number sequentially" - "Include purpose" - "Show expected result" - "Note time estimate" variables : format : "$VARIABLE_NAME or " document : "List all variables at start" Quality Checklist validation : structure : - "[ ] Clear title and metadata" - "[ ] Prerequisites listed" - "[ ] Steps numbered and clear" - "[ ] Expected outputs included" - "[ ] Verification steps present" - "[ ] Rollback documented" - "[ ] Troubleshooting section" - "[ ] Contacts listed" testing : - "[ ] All commands tested" - "[ ] Outputs verified" - "[ ] Rollback tested" - "[ ] Time estimates accurate" maintenance : - "[ ] Version number updated" - "[ ] Change log maintained" - "[ ] Quarterly review scheduled" - "[ ] Owner assigned" Лучшие практики Test everything — каждая команда должна быть проверена Show expected output — пользователь должен знать что увидит Include rollback — всегда план отката Keep updated — ревью каждый квартал Version control — runbooks в git Automate when possible — автоматизируй повторяющиеся процедуры

runbook-generator

安装