Scalability Playbook
Systematic approach to identifying and resolving scalability bottlenecks.
Bottleneck Analysis Current System Profile Traffic: 1,000 req/min Users: 10,000 active Data: 100GB database Response time: p95 = 500ms
Identified Bottlenecks 1. Database Queries
Symptom: Slow page loads (2-3s) Measurement: Query time p95 = 800ms Impact: HIGH - affects all reads Trigger: When p95 >500ms
- Single Server
Symptom: High CPU (>80%) Measurement: Load average >4 Impact: MEDIUM - intermittent slowdowns Trigger: When CPU >70%
- No Caching
Symptom: Repeated DB queries Measurement: Cache hit rate = 0% Impact: MEDIUM - unnecessary load Trigger: When query volume >10k/min
Scaling Strategies (Ordered) Level 1: Quick Wins (Days) 1.1 Add Database Indexes
Problem: Slow queries Solution:
CREATE INDEX idx_users_email ON users(email); CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
Expected Impact: 80% faster queries Cost: $0 Effort: 1 day
1.2 Enable Query Caching
Problem: Repeated queries Solution: Redis cache layer
const cached = await redis.get(user:${userId});
if (cached) return JSON.parse(cached);
const user = await db.users.findById(userId);
await redis.setex(user:${userId}, 3600, JSON.stringify(user));
Expected Impact: 60% reduction in DB load Cost: $50/month Effort: 2 days
Level 2: Horizontal Scaling (Weeks) 2.1 Add Read Replicas
Problem: Read-heavy workload Solution: Route reads to replicas
Write Load: Primary DB Read Load: 3x Read Replicas
Expected Impact: 3x read capacity Cost: $300/month Effort: 1 week
2.2 Load Balancer + Multiple Servers
Problem: Single point of failure Solution:
ALB ├── Server 1 ├── Server 2 └── Server 3
Expected Impact: 3x throughput Cost: $400/month Effort: 1 week
Level 3: Architecture Changes (Months) 3.1 CDN for Static Assets
Problem: Slow asset delivery Solution: CloudFront CDN Expected Impact: 90% faster asset loads Cost: $100/month Effort: 1 week
3.2 Async Processing
Problem: Slow sync operations Solution: Background job queues
// Before: Sync await sendEmail(user); await processPayment(order); await updateAnalytics(event); return response; // Waits 5+ seconds
// After: Async await queue.add("send-email", { userId }); await queue.add("process-payment", { orderId }); await queue.add("update-analytics", { event }); return response; // Returns immediately
Expected Impact: 80% faster responses Cost: $50/month (SQS) Effort: 2 weeks
Level 4: Data Layer Optimization (Months) 4.1 Database Sharding
Problem: Single DB too large Solution: Shard by user_id
Shard 1: user_id 0-24999 Shard 2: user_id 25000-49999 Shard 3: user_id 50000-74999 Shard 4: user_id 75000-99999
Expected Impact: 4x capacity Cost: $1,200/month Effort: 2 months
4.2 Event-Driven Architecture
Problem: Tight coupling, cascading failures Solution: Message broker (Kafka)
Service A → Kafka → Service B ↘ ↗ Service C
Expected Impact: Better isolation, resilience Cost: $500/month Effort: 3 months
Scaling Triggers | Metric | Current | Warning | Critical | Action |
| ---------------- | ------- | ------- | -------- | ----------------------- |
| CPU | 40% | 70% | 85% | Add servers |
| Memory | 50% | 75% | 90% | Upgrade instances |
| DB Connections | 20 | 40 | 50 | Add read replicas |
| Query Time (p95) | 200ms | 500ms | 1000ms | Add indexes |
| Queue Depth | 100 | 1000 | 5000 | Add workers |
| Error Rate | 0.1% | 1% | 5% | Investigate immediately |
Phased Scaling Plan Phase 1: Current → 10x (0-3 months)
Target: 10,000 req/min, 100K users
Actions:
Add database indexes (Week 1) Implement Redis caching (Week 2) Add 3x read replicas (Week 4) Horizontal scale app servers (Week 6) CDN for static assets (Week 8)
Cost: $500 → $1,000/month
Phase 2: 10x → 100x (3-12 months)
Target: 100,000 req/min, 1M users
Actions:
Database sharding (Month 4-6) Multi-region deployment (Month 6-8) Microservices extraction (Month 8-12) Event-driven architecture (Month 10-12)
Cost: $1,000 → $10,000/month
Phase 3: 100x → 1000x (12-24 months)
Target: 1M req/min, 10M users
Actions:
Global CDN (Month 13) Advanced caching (L1/L2) (Month 14-15) Custom DB solutions (Month 16-18) Edge computing (Month 18-20)
Cost: $10,000 → $100,000/month
Load Testing Plan
Current baseline
hey -n 10000 -c 100 https://api.example.com/users
Target 10x
hey -n 100000 -c 1000 https://api.example.com/users
Measure:
- Requests/sec
- p50, p95, p99 latency
- Error rate
- Resource utilization
Cost-Benefit Analysis | Strategy | Cost/Month | Expected Impact | ROI | Priority |
| ------------- | ---------- | ------------------ | --- | -------- |
| DB Indexes | $0 | 80% faster queries | ∞ | HIGH |
| Redis Cache | $50 | 60% less DB load | 12x | HIGH |
| Read Replicas | $300 | 3x capacity | 10x | MEDIUM |
| Load Balancer | $400 | 3x throughput | 7x | MEDIUM |
| DB Sharding | $1,200 | 4x capacity | 3x | LOW |
Best Practices Measure first: Don't optimize blindly Low-hanging fruit: Start with easy wins Load test: Validate before production Monitor continuously: Set up alerts Plan ahead: Scale before hitting limits Cost-conscious: ROI-driven decisions Incremental: Small, safe changes Output Checklist Current system profile Bottlenecks identified and measured Scaling strategies ordered by effort Triggers defined for each action Phased plan (1x → 10x → 100x) Cost estimates per phase Load testing plan Monitoring dashboard Rollback procedures