Alerting & Dashboard Builder

Build effective alerts and dashboards based on SLOs.

SLO Definition slos: - name: api_availability objective: 99.9% window: 30d sli: | sum(rate(http_requests_total{status_code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

name: api_latency objective: 95% # 95% of requests under 500ms window: 30d sli: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) < 0.5

Alert Rules groups: - name: slo_alerts rules: # Fast burn (1% budget in 1h) - alert: AvailabilitySLOFastBurn expr: | (1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > 0.01 for: 5m labels: severity: critical annotations: summary: "Burning 1% error budget per hour" runbook: "https://runbooks.example.com/availability-fast-burn"

  # Slow burn (10% budget in 24h)
  - alert: AvailabilitySLOSlowBurn
    expr: |
      (1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
      sum(rate(http_requests_total[24h])))) > 0.001
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Burning error budget slowly"

Dashboard Template { "title": "Service Health Dashboard", "rows": [ { "title": "Golden Signals", "panels": [ { "title": "Request Rate", "query": "sum(rate(http_requests_total[5m]))", "type": "graph" }, { "title": "Error Rate", "query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))", "type": "graph" }, { "title": "Latency (p50, p95, p99)", "queries": [ "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))", "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))" ] }, { "title": "Saturation (CPU, Memory)", "queries": [ "rate(process_cpu_seconds_total[5m])", "process_resident_memory_bytes" ] } ] }, { "title": "SLO Tracking", "panels": [ { "title": "Error Budget Remaining", "query": "1 - ((1 - 0.999) - (1 - slo_availability))" } ] } ] }

What to Do When Alert Fires

Alert Response Guide

HighErrorRate

What it means: More than 5% of requests are failing

First steps:

Check recent deployments (rollback if needed)
Review error logs for patterns
Check dependent services health
Verify database connectivity

Escalation: If not resolved in 15 min, page on-call lead

HighLatency

What it means: p95 latency above 2 seconds

First steps:

Check database query performance
Review recent code changes
Check cache hit rates
Look for slow external API calls

Temporary mitigation:

Scale up instances
Enable aggressive caching

LowAvailability

What it means: Availability below 99.5%

First steps:

Check infrastructure (AWS status page)
Review load balancer health checks
Check for DDoS activity
Verify auto-scaling functioning

Output Checklist SLOs defined Alert rules configured Dashboards created Runbooks linked Response guides documented ENDFILE

alerting-dashboard-builder

安装

Alert Response Guide

HighErrorRate

HighLatency

LowAvailability