slo-implementation

安装量: 2.9K
排名: #756

安装

npx skills add https://github.com/wshobson/agents --skill slo-implementation

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use Define service reliability targets Measure user-perceived reliability Implement error budgets Create SLO-based alerts Track reliability goals SLI/SLO/SLA Hierarchy SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement

Defining SLIs Common SLI Types 1. Availability SLI

Successful requests / Total requests

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  1. Latency SLI

Requests below latency threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

  1. Durability SLI

Successful writes / Total writes

sum(storage_writes_successful_total) / sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets Availability SLO Examples SLO % Downtime/Month Downtime/Year 99% 7.2 hours 3.65 days 99.9% 43.2 minutes 8.76 hours 99.95% 21.6 minutes 4.38 hours 99.99% 4.32 minutes 52.56 minutes Choose Appropriate SLOs

Consider:

User expectations Business requirements Current performance Cost of reliability Competitor benchmarks

Example SLOs:

slos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  • name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation Error Budget Formula Error Budget = 1 - SLO Target

Example:

SLO: 99.9% availability Error Budget: 0.1% = 43.2 minutes/month Current Error: 0.05% = 21.6 minutes/month Remaining Budget: 50% Error Budget Policy error_budget_policy: - remaining_budget: 100% action: Normal development velocity - remaining_budget: 50% action: Consider postponing risky changes - remaining_budget: 10% action: Freeze non-critical changes - remaining_budget: 0% action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation Prometheus Recording Rules

SLI Recording Rules

groups: - name: sli_rules interval: 30s rules: # Availability SLI - record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  # Latency SLI (requests < 500ms)
  - record: sli:http_latency:ratio
    expr: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))
  • name: slo_rules interval: 5m rules: # SLO compliance (1 = meeting SLO, 0 = violating)

    • record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999

    • record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99

    # Error budget remaining (percentage) - record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

    # Error budget burn rate - record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)

SLO Alerting Rules groups: - name: slo_alerts interval: 1m rules: # Fast burn: 14.4x rate, 1 hour window # Consumes 2% error budget in 1 hour - alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate"

  # Slow burn: 6x rate, 6 hour window
  # Consumes 5% error budget in 6 hours
  - alert: SLOErrorBudgetBurnSlow
    expr: |
      slo:http_availability:burn_rate_6h > 6
      and
      slo:http_availability:burn_rate_30m > 6
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Slow error budget burn detected"
      description: "Error budget burning at {{ $value }}x rate"

  # Error budget exhausted
  - alert: SLOErrorBudgetExhausted
    expr: slo:http_availability:error_budget_remaining < 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "SLO error budget exhausted"
      description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐ │ SLO Compliance (Current) │ │ ✓ 99.95% (Target: 99.9%) │ ├────────────────────────────────────┤ │ Error Budget Remaining: 65% │ │ ████████░░ 65% │ ├────────────────────────────────────┤ │ SLI Trend (28 days) │ │ [Time series graph] │ ├────────────────────────────────────┤ │ Burn Rate Analysis │ │ [Burn rate by time window] │ └────────────────────────────────────┘

Example Queries:

Current SLO compliance

sli:http_availability:ratio * 100

Error budget remaining

slo:http_availability:error_budget_remaining

Days until error budget exhausted (at current burn rate)

(slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

Combination of short and long windows reduces false positives

rules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical

SLO Review Process Weekly Review Current SLO compliance Error budget status Trend analysis Incident impact Monthly Review SLO achievement Error budget usage Incident postmortems SLO adjustments Quarterly Review SLO relevance Target adjustments Process improvements Tooling enhancements Best Practices Start with user-facing services Use multiple SLIs (availability, latency, etc.) Set achievable SLOs (don't aim for 100%) Implement multi-window alerts to reduce noise Track error budget consistently Review SLOs regularly Document SLO decisions Align with business goals Automate SLO reporting Use SLOs for prioritization Reference Files assets/slo-template.md - SLO definition template references/slo-definitions.md - SLO definition patterns references/error-budget.md - Error budget calculations Related Skills prometheus-configuration - For metric collection grafana-dashboards - For SLO visualization

返回排行榜