Backup/Restore Runbook Generator
Create reliable disaster recovery procedures for your databases.
Backup Strategy
Database Backup Strategy
Backup Types
1. Full Backup (Daily)
- When: 2:00 AM UTC
- Retention: 30 days
- Storage: S3
s3://backups/full/ - Size: ~50 GB
- Duration: ~45 minutes
2. Incremental Backup (Hourly)
- When: Every hour
- Retention: 7 days
- Storage: S3
s3://backups/incremental/ - Size: ~500 MB
- Duration: ~5 minutes
3. Transaction Log Backup (Every 15 min)
- When: Every 15 minutes
- Retention: 3 days
- Storage: S3
s3://backups/wal/ - Point-in-time recovery capability
Backup Automation
PostgreSQL
#!/bin/bash
# scripts/backup-postgres.sh
set -e
# Configuration
DB_NAME="production"
DB_USER="postgres"
DB_HOST="postgres.example.com"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://my-backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"
# Create backup directory
mkdir -p $BACKUP_DIR
echo "🔄 Starting backup: $FILENAME"
# Full backup with pg_dump
pg_dump \
--host=$DB_HOST \
--username=$DB_USER \
--dbname=$DB_NAME \
--format=custom \
--compress=9 \
--file=$BACKUP_DIR/$FILENAME \
--verbose
# Verify backup
if [ -f "$BACKUP_DIR/$FILENAME" ]; then
SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1)
echo "✅ Backup created: $SIZE"
else
echo "❌ Backup failed"
exit 1
fi
# Upload to S3
echo "📤 Uploading to S3..."
aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/ \
--storage-class STANDARD_IA
# Verify upload
if aws s3 ls $S3_BUCKET/$FILENAME; then
echo "✅ Uploaded to S3"
else
echo "❌ S3 upload failed"
exit 1
fi
# Cleanup old local backups (keep last 7 days)
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete
echo "🗑️ Cleaned up old local backups"
# Send notification
curl -X POST $SLACK_WEBHOOK \
-H 'Content-Type: application/json' \
-d "{\"text\": \"✅ Database backup complete: $FILENAME ($SIZE)\"}"
echo "✅ Backup complete!"
MySQL
!/bin/bash
scripts/backup-mysql.sh
set -e
DB_NAME="production" DB_USER="root" DB_PASSWORD=$MYSQL_PASSWORD DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"
echo "🔄 Starting MySQL backup..."
Backup with mysqldump
mysqldump \ --user=$DB_USER \ --password=$DB_PASSWORD \ --single-transaction \ --quick \ --lock-tables=false \ --databases $DB_NAME \ | gzip > /var/backups/mysql/$FILENAME
Upload to S3
aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/
echo "✅ Backup complete!"
Restore Procedures Full Restore
!/bin/bash
scripts/restore-postgres.sh
set -e
BACKUP_FILE=$1 RESTORE_DB="production_restored"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: ./restore-postgres.sh
echo "🔄 Starting restore from: $BACKUP_FILE"
1. Download from S3
echo "📥 Downloading backup..." aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/
2. Create new database
echo "🗄️ Creating database..." psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"
3. Restore backup
echo "🔄 Restoring data..." pg_restore \ --host=$DB_HOST \ --username=postgres \ --dbname=$RESTORE_DB \ --verbose \ /tmp/$BACKUP_FILE
4. Verify restore
echo "✅ Verifying restore..." TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo " Tables restored: $TABLE_COUNT"
ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;") echo " User rows: $ROW_COUNT"
echo "✅ Restore complete!" echo " Database: $RESTORE_DB" echo " To use: UPDATE application config to point to $RESTORE_DB"
Point-in-Time Recovery (PITR)
!/bin/bash
scripts/pitr-restore.sh
TARGET_TIME=$1 # Format: 2024-01-15 14:30:00
echo "🔄 Point-in-Time Restore to: $TARGET_TIME"
1. Restore base backup
echo "📦 Restoring base backup..." pg_basebackup -D /var/lib/postgresql/data -X stream
2. Configure recovery
cat > /var/lib/postgresql/data/recovery.conf << EOF restore_command = 'aws s3 cp s3://my-backups/wal/%f %p' recovery_target_time = '$TARGET_TIME' recovery_target_action = 'promote' EOF
3. Start PostgreSQL
echo "🚀 Starting PostgreSQL in recovery mode..." systemctl start postgresql
4. Wait for recovery
while ! pg_isready; do echo " Waiting for recovery..." sleep 5 done
echo "✅ PITR complete!"
Validation Checks
!/bin/bash
scripts/validate-restore.sh
DB=$1
echo "🔍 Validating restore..."
1. Check table count
TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo "Tables: $TABLES"
if [ "$TABLES" -lt 10 ]; then echo "❌ Too few tables restored" exit 1 fi
2. Check row counts
for table in users products orders; do ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;") echo " $table: $ROWS rows"
if [ "$ROWS" -lt 1 ]; then echo "❌ Table $table is empty" exit 1 fi done
3. Check constraints
CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';") echo "Foreign keys: $CONSTRAINTS"
4. Check indexes
INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';") echo "Indexes: $INDEXES"
5. Test query performance
START=$(date +%s%N) psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null END=$(date +%s%N) DURATION=$(( (END - START) / 1000000 )) echo "Query performance: ${DURATION}ms"
if [ "$DURATION" -gt 1000 ]; then echo "⚠️ Slow query - missing indexes?" fi
echo "✅ Validation complete!"
Disaster Recovery Runbook
Disaster Recovery Runbook
Incident Response
1. Assess Situation (5 minutes)
- [ ] Identify incident severity (P0/P1/P2)
- [ ] Determine data loss window
- [ ] Notify stakeholders
Contacts:
- DBA On-Call: [phone]
- Engineering Lead: [phone]
- CTO: [phone]
2. Stop the Bleeding (10 minutes)
- [ ] Enable maintenance mode
- [ ] Stop writes to corrupted database
- [ ] Preserve evidence (logs, backups)
# Enable maintenance mode
kubectl scale deployment/api --replicas=0
- Identify Recovery Point (15 minutes) Determine last good backup Check backup integrity Calculate data loss
List available backups
aws s3 ls s3://my-backups/postgres/ | tail -20
Check backup size
aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable
- Prepare Recovery Environment (30 minutes) Spin up new database instance Configure networking Test connectivity
Create RDS instance
aws rds create-db-instance \ --db-instance-identifier production-recovery \ --db-instance-class db.r6g.xlarge \ --engine postgres \ --master-username postgres \ --master-user-password [secure-password]
- Execute Restore (1-2 hours) Download backup from S3 Run restore script Apply transaction logs (if PITR) Verify data integrity
Run restore
./scripts/restore-postgres.sh production_20240115_020000.sql.gz
Validate
./scripts/validate-restore.sh production_restored
- Validate and Test (30 minutes) Run validation scripts Test critical queries Verify row counts Check data consistency
- Cutover (15 minutes) Update application config Point DNS to new database Disable maintenance mode Monitor for errors
Update connection string
kubectl set env deployment/api DATABASE_URL=postgresql://...
Scale up
kubectl scale deployment/api --replicas=3
- Post-Recovery (1 hour) Monitor system health Verify user reports Document incident Schedule postmortem Recovery Time Objective (RTO) Scenario Target Actual Full restore 2 hours [measured] PITR restore 3 hours [measured] Region failover 15 minutes [measured] Recovery Point Objective (RPO) Backup Type Data Loss Window Full backup 24 hours Incremental 1 hour Transaction logs 15 minutes
Automated Backup Monitoring
```typescript // scripts/monitor-backups.ts import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';
const s3 = new S3Client({ region: 'us-east-1' });
async function checkBackupHealth() { const bucket = 'my-backups'; const prefix = 'postgres/';
// List recent backups const command = new ListObjectsV2Command({ Bucket: bucket, Prefix: prefix, MaxKeys: 10, });
const response = await s3.send(command); const backups = response.Contents || [];
// Check last backup age const latestBackup = backups[0]; const age = Date.now() - new Date(latestBackup.LastModified!).getTime(); const ageHours = age / (1000 * 60 * 60);
if (ageHours > 25) { console.error('❌ No backup in last 24 hours!'); // Send alert await sendSlackAlert('No recent database backup!'); process.exit(1); }
// Check backup size const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB if (size < 10) { console.error('⚠️ Backup size suspiciously small'); }
console.log('✅ Backup health check passed');
console.log(Latest: ${latestBackup.Key});
console.log(Age: ${ageHours.toFixed(1)} hours);
console.log(Size: ${size.toFixed(2)} GB);
}
checkBackupHealth();
Role Assignments
DR Team Roles
Database Administrator (Primary)
- Execute restore procedures
- Verify data integrity
- Monitor recovery progress
Engineering Lead
- Coordinate response
- Communicate with stakeholders
- Make cutover decisions
DevOps Engineer
- Provision infrastructure
- Update application configs
- Monitor system health
Product Manager
- Assess business impact
- Prioritize recovery
- Customer communication
Escalation Path
- DBA on-call →
- Engineering Lead →
- CTO →
- CEO (P0 incidents only)
Best Practices Test restores regularly: Quarterly DR drills Automate backups: Never rely on manual processes Multiple locations: Cross-region backup storage Monitor backup health: Alert on failures Document procedures: Keep runbook updated Encrypt backups: Protect sensitive data Version control: Track backup script changes Output Checklist Backup automation scripts Restore procedures documented Validation checks defined PITR procedure (if applicable) DR runbook created Role assignments documented RTO/RPO defined Backup monitoring configured