Postmortem Writer

Document incidents for learning and improvement.

Postmortem Template

Postmortem: API Outage - Database Connection Pool Exhausted

Date: 2024-01-15 Authors: Jane Doe (On-call), John Smith (DBA) Status: Complete Severity: P1 (Critical)

Summary

On January 15, 2024, our API experienced a complete outage for 25 minutes (14:32 - 14:57 UTC) affecting 100% of users. The root cause was database connection pool exhaustion triggered by a connection leak introduced in deployment v2.3.4.

Impact:

Duration: 25 minutes
Users affected: ~50,000
Requests failed: ~125,000
Revenue impact: ~$15,000

Timeline (All times UTC)

| Time | Event |

| ----- | ------------------------------------------------ |

| 14:15 | v2.3.4 deployed to production |

| 14:32 | First CloudWatch alarm: HighErrorRate |

| 14:33 | PagerDuty alert sent to on-call (Jane) |

| 14:35 | Jane acknowledges, begins investigation |

| 14:38 | Identified: Database connection pool at 100% |

| 14:40 | Attempted: Kill long-running queries (no effect) |

| 14:43 | Decision: Rollback to v2.3.3 |

| 14:45 | Rollback initiated |

| 14:47 | Rollback complete, connections dropping |

| 14:50 | Error rate returning to normal |

| 14:57 | All systems recovered, incident closed |

| 15:30 | Postmortem meeting scheduled |

Root Cause

A code change in v2.3.4 introduced a connection leak in the user profile endpoint. The new caching layer was not properly releasing database connections after queries completed.

Code diff: ```diff

await prisma.user.findUnique({ where: { id } });
const client = await pool.connect();
const user = await client.query('SELECT * FROM users WHERE id = $1', [id]);
// Missing: client.release() ❌ ```

Contributing Factors

Insufficient testing: Load tests didn't catch the leak
Tests only ran for 5 minutes
Not enough concurrent connections to exhaust pool
Missing monitoring: No alerts on connection pool metrics
Had alarms for query latency
No alarms for active connections count
Inadequate code review: Reviewer didn't spot missing release()
PR approved without running locally
No checklist for connection management
Deployment process: No gradual rollout
Deployed to 100% of production immediately
No canary deployment

What Went Well

✅ Fast detection: Alert fired within 3 minutes
✅ Clear runbook: DBA runbook had exact steps to follow
✅ Quick decision: Made rollback decision in 8 minutes
✅ Communication: Status page updated every 5 minutes
✅ Rollback capability: Automated rollback took <2 minutes

What Went Wrong

❌ Code review missed bug: Connection leak not caught
❌ Testing gaps: Load tests insufficient duration
❌ No canary: Deployed to all instances at once
❌ Late detection: 17 minutes between deploy and alert

Action Items

| --------------------------------------------- | ------- | ---------- | -------- | -------------- |

Lessons Learned

Code review checklists work: Need specific items for common issues
Load tests need realistic duration: 5min insufficient for leaks
Gradual rollouts catch issues: 10% canary would have limited impact
Monitoring gaps are dangerous: Add metrics before you need them
Runbooks save time: Clear procedures enabled fast response

[2023-11-20] Database CPU spike (similar connection pool issue)
[2023-08-15] Memory leak in cache layer

Prevention

To prevent similar incidents:

✅ Add connection management to code review checklist
✅ Monitor connection pool utilization
✅ Extend load test duration
✅ Implement canary deployments
✅ Add automated connection leak detection

Appendix

Monitoring Graphs

[Insert graphs of connection pool, error rates, latency during incident]

Communication Log

[Insert status page updates and customer communication]

Code Fix

PR #1235: Fix connection leak in user profile endpoint ```typescript const client = await pool.connect(); try { const user = await client.query('SELECT * FROM users WHERE id = $1', [id]); return user; } finally { client.release(); // ✅ Always release } ```

Postmortem Best Practices

Blameless Postmortem Guidelines

Do ✅

Focus on systems and processes, not people
Use timeline with exact timestamps
Identify contributing factors, not just root cause
Create actionable items with owners and dates
Document what went well (positive reinforcement)
Share widely for organizational learning

Don't ❌

Blame individuals or teams
Hide or minimize the incident
Skip the postmortem (even for small incidents)
Create action items without owners
Forget to follow up on action items
Make it a blame session

Template Sections

Summary (2-3 sentences)
Impact (numbers: users, revenue, duration)
Timeline (chronological events)
Root Cause (technical explanation)
Contributing Factors (broader context)
What Went Well (positive reinforcement)
What Went Wrong (improvement areas)
Action Items (concrete, owned, dated)
Lessons Learned (key takeaways)

Output Checklist Timeline created Root cause identified Contributing factors documented Action items with owners Lessons learned captured Postmortem meeting held Document shared widely Follow-up scheduled ENDFILE

postmortem-writer

安装

Postmortem: API Outage - Database Connection Pool Exhausted

Summary

Timeline (All times UTC)

Root Cause

Contributing Factors

What Went Well

What Went Wrong

Action Items

Lessons Learned

Prevention

Appendix

Monitoring Graphs

Communication Log

Code Fix

Blameless Postmortem Guidelines

Do ✅

Don't ❌

Template Sections

安装

Postmortem: API Outage - Database Connection Pool Exhausted

Summary

Timeline (All times UTC)

Root Cause

Contributing Factors

What Went Well

What Went Wrong

Action Items

Lessons Learned

Related Incidents

Prevention

Appendix

Monitoring Graphs

Communication Log

Code Fix

Blameless Postmortem Guidelines

Do ✅

Don't ❌

Template Sections