SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
When to Use Define service reliability targets Measure user-perceived reliability Implement error budgets Create SLO-based alerts Track reliability goals SLI/SLO/SLA Hierarchy SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement
Defining SLIs Common SLI Types 1. Availability SLI
Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
- Latency SLI
Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
- Durability SLI
Successful writes / Total writes
sum(storage_writes_successful_total) / sum(storage_writes_total)
Reference: See references/slo-definitions.md
Setting SLO Targets Availability SLO Examples SLO % Downtime/Month Downtime/Year 99% 7.2 hours 3.65 days 99.9% 43.2 minutes 8.76 hours 99.95% 21.6 minutes 4.38 hours 99.99% 4.32 minutes 52.56 minutes Choose Appropriate SLOs
Consider:
User expectations Business requirements Current performance Cost of reliability Competitor benchmarks
Example SLOs:
slos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
- name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
Error Budget Calculation Error Budget Formula Error Budget = 1 - SLO Target
Example:
SLO: 99.9% availability Error Budget: 0.1% = 43.2 minutes/month Current Error: 0.05% = 21.6 minutes/month Remaining Budget: 50% Error Budget Policy error_budget_policy: - remaining_budget: 100% action: Normal development velocity - remaining_budget: 50% action: Consider postponing risky changes - remaining_budget: 10% action: Freeze non-critical changes - remaining_budget: 0% action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
SLO Implementation Prometheus Recording Rules
SLI Recording Rules
groups: - name: sli_rules interval: 30s rules: # Availability SLI - record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
-
name: slo_rules interval: 5m rules: # SLO compliance (1 = meeting SLO, 0 = violating)
-
record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999
-
record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage) - record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate - record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)
-
SLO Alerting Rules groups: - name: slo_alerts interval: 1m rules: # Fast burn: 14.4x rate, 1 hour window # Consumes 2% error budget in 1 hour - alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
SLO Dashboard
Grafana Dashboard Structure:
┌────────────────────────────────────┐ │ SLO Compliance (Current) │ │ ✓ 99.95% (Target: 99.9%) │ ├────────────────────────────────────┤ │ Error Budget Remaining: 65% │ │ ████████░░ 65% │ ├────────────────────────────────────┤ │ SLI Trend (28 days) │ │ [Time series graph] │ ├────────────────────────────────────┤ │ Burn Rate Analysis │ │ [Burn rate by time window] │ └────────────────────────────────────┘
Example Queries:
Current SLO compliance
sli:http_availability:ratio * 100
Error budget remaining
slo:http_availability:error_budget_remaining
Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)
Multi-Window Burn Rate Alerts
Combination of short and long windows reduces false positives
rules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical
SLO Review Process Weekly Review Current SLO compliance Error budget status Trend analysis Incident impact Monthly Review SLO achievement Error budget usage Incident postmortems SLO adjustments Quarterly Review SLO relevance Target adjustments Process improvements Tooling enhancements Best Practices Start with user-facing services Use multiple SLIs (availability, latency, etc.) Set achievable SLOs (don't aim for 100%) Implement multi-window alerts to reduce noise Track error budget consistently Review SLOs regularly Document SLO decisions Align with business goals Automate SLO reporting Use SLOs for prioritization Reference Files assets/slo-template.md - SLO definition template references/slo-definitions.md - SLO definition patterns references/error-budget.md - Error budget calculations Related Skills prometheus-configuration - For metric collection grafana-dashboards - For SLO visualization