Alerting & Dashboard Builder
Build effective alerts and dashboards based on SLOs.
SLO Definition slos: - name: api_availability objective: 99.9% window: 30d sli: | sum(rate(http_requests_total{status_code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- name: api_latency objective: 95% # 95% of requests under 500ms window: 30d sli: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) < 0.5
Alert Rules groups: - name: slo_alerts rules: # Fast burn (1% budget in 1h) - alert: AvailabilitySLOFastBurn expr: | (1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > 0.01 for: 5m labels: severity: critical annotations: summary: "Burning 1% error budget per hour" runbook: "https://runbooks.example.com/availability-fast-burn"
# Slow burn (10% budget in 24h)
- alert: AvailabilitySLOSlowBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
sum(rate(http_requests_total[24h])))) > 0.001
for: 1h
labels:
severity: warning
annotations:
summary: "Burning error budget slowly"
Dashboard Template { "title": "Service Health Dashboard", "rows": [ { "title": "Golden Signals", "panels": [ { "title": "Request Rate", "query": "sum(rate(http_requests_total[5m]))", "type": "graph" }, { "title": "Error Rate", "query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))", "type": "graph" }, { "title": "Latency (p50, p95, p99)", "queries": [ "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))", "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))" ] }, { "title": "Saturation (CPU, Memory)", "queries": [ "rate(process_cpu_seconds_total[5m])", "process_resident_memory_bytes" ] } ] }, { "title": "SLO Tracking", "panels": [ { "title": "Error Budget Remaining", "query": "1 - ((1 - 0.999) - (1 - slo_availability))" } ] } ] }
What to Do When Alert Fires
Alert Response Guide
HighErrorRate
What it means: More than 5% of requests are failing
First steps:
- Check recent deployments (rollback if needed)
- Review error logs for patterns
- Check dependent services health
- Verify database connectivity
Escalation: If not resolved in 15 min, page on-call lead
HighLatency
What it means: p95 latency above 2 seconds
First steps:
- Check database query performance
- Review recent code changes
- Check cache hit rates
- Look for slow external API calls
Temporary mitigation:
- Scale up instances
- Enable aggressive caching
LowAvailability
What it means: Availability below 99.5%
First steps:
- Check infrastructure (AWS status page)
- Review load balancer health checks
- Check for DDoS activity
- Verify auto-scaling functioning
Output Checklist SLOs defined Alert rules configured Dashboards created Runbooks linked Response guides documented ENDFILE