Root Cause Analysis Overview

Root cause analysis (RCA) identifies underlying reasons for failures, enabling permanent solutions rather than temporary fixes.

When to Use Production incidents Customer-impacting issues Repeated problems Unexpected failures Performance degradation Instructions 1. The 5 Whys Technique Example: Website Down

Symptom: Website returned 503 Service Unavailable

Why 1: Why was website down? Answer: Database connection pool exhausted

Why 2: Why was connection pool exhausted? Answer: Queries taking too long, connections not released

Why 3: Why were queries slow? Answer: Missing index on frequently queried column

Why 4: Why was index missing? Answer: Performance testing didn't use production-like data volume

Why 5: Why wasn't production-like data used? Answer: Load testing environment doesn't mirror production

Root Cause: Load testing environment under-provisioned

Solution: Update load testing environment with production-like data

Prevention: Establish environment parity requirements

Systematic RCA Process Step 1: Gather Facts
When did issue occur?
Who detected it?
How many users affected?
What error messages?
What system changes deployed?
Check logs, metrics, alerts
Determine impact scope

Step 2: Reproduce - Can we reproduce consistently? - What are the exact steps? - What environment (prod, staging)? - Can we isolate to component? - Set up test case

Step 3: Identify Contributing Factors - Direct cause - Indirect/enabling factors - System vulnerabilities - Procedural gaps - Knowledge gaps

Step 4: Determine Root Cause - Use 5 Whys technique - Ask "why did this control fail?" - Look for systemic issues - Separate root cause from symptoms

Step 5: Develop Solutions - Immediate: Fix the symptom - Short-term: Prevent recurrence - Long-term: Systemic fix - Prioritize by impact/effort

Step 6: Implement & Verify - Implement solutions - Test in staging - Deploy carefully - Verify improvement - Monitor metrics

Step 7: Document & Share - Write RCA report - Document lesson learned - Share with team - Update procedures - Training if needed

RCA Report Template RCA Report:

Incident: Database connection failure (2024-01-15, 14:30-15:15)

Impact: - Duration: 45 minutes - Users affected: 5,000 (10% of user base) - Revenue lost: ~$2,000 - Severity: P1 (Critical)

Timeline: 14:30: Automated monitoring alert: High error rate (20%) 14:32: On-call engineer notified 14:35: Identified database connection error in logs 14:40: Restarted database connection pool 14:42: Service recovered, error rate returned to 0.1% 14:50: Incident declared resolved 15:15: Full recovery verified

Root Cause: Poorly optimized query introduced in release 2.5.0 caused queries to take 10x longer. Connection pool exhausted as connections weren't released quickly.

Contributing Factors: 1. No query performance testing pre-deployment 2. Load testing environment doesn't match production volume 3. No alerting on query duration 4. Connection pool timeout set too high

Solutions: Immediate (Done): - Rolled back problematic query optimization

Short-term (1 week): - Added query performance alerts (>1s) - Added index for slow query - Set query timeout to 5 seconds

Long-term (1 month): - Updated load testing with production-like data - Implement performance benchmarks in CI/CD - Improve monitoring for connection pool health - Training on query optimization

Prevention: - Query performance regression tests - Load testing with production data - Connection pool metrics monitoring - Code review of database changes

Root Cause Analysis Techniques Fishbone Diagram:

Main problem: Slow API Response

Branches:

Code: - Inefficient algorithm - Missing cache - Unnecessary queries

Data: - Large dataset - Missing index - Slow database

Infrastructure: - Low CPU capacity - Slow network - Disk I/O bottleneck

Process: - No monitoring - No load testing - Manual deployments

People: - Lack of knowledge - Lack of tools - No peer review

Systemic vs. Individual Causes:

Individual: "Developer used inefficient code" Fix: Training Risk: Happens again with different person

Systemic: "No code review process" Fix: Implement mandatory code review Risk: Prevents similar issues

Prefer systemic solutions for prevention

Follow-Up & Prevention After RCA:
Track Action Items
Assign owner
Set deadline
Follow up in retrospective
Prevent Recurrence
Automated tests
Monitoring/alerts
Procedural changes
Training
Monitor Metrics
Track similar incidents
Verify fix effectiveness
Monitor preventive measures
Catch early warnings
Share Learnings
Document incident
Share with team
Industry sharing if relevant
Update procedures

Checklist:

[ ] Incident details documented [ ] Timeline established [ ] Logs reviewed [ ] Metrics analyzed [ ] Root cause identified (via 5 Whys) [ ] Contributing factors listed [ ] Immediate actions completed [ ] Short-term solutions planned [ ] Long-term solutions identified [ ] Solutions prioritized [ ] RCA report written [ ] Team debriefing scheduled [ ] Action items assigned [ ] Prevention measures planned [ ] Follow-up scheduled

Key Points Distinguish symptom from root cause Use 5 Whys technique systematically Look for systemic issues, not individual blame Focus on prevention, not just fixing Document thoroughly for team learning Assign clear ownership for solutions Follow up to verify effectiveness Use RCA to drive improvements

root-cause-analysis

安装