Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill Setting up distributed tracing across services Implementing service mesh metrics and dashboards Debugging latency and error issues Defining SLOs for service communication Visualizing service dependencies Troubleshooting mesh connectivity Core Concepts 1. Three Pillars of Observability ┌─────────────────────────────────────────────────────┐ │ Observability │ ├─────────────────┬─────────────────┬─────────────────┤ │ Metrics │ Traces │ Logs │ │ │ │ │ │ • Request rate │ • Span context │ • Access logs │ │ • Error rate │ • Latency │ • Error details │ │ • Latency P50 │ • Dependencies │ • Debug info │ │ • Saturation │ • Bottlenecks │ • Audit trail │ └─────────────────┴─────────────────┴─────────────────┘

Golden Signals for Mesh Signal Description Alert Threshold Latency Request duration P50, P99 P99 > 500ms Traffic Requests per second Anomaly detection Errors 5xx error rate > 1% Saturation Resource utilization > 80% Templates Template 1: Istio with Prometheus & Grafana

Install Prometheus

apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istio-mesh namespace: istio-system spec: selector: matchLabels: app: istiod endpoints: - port: http-monitoring interval: 15s

Template 2: Key Istio Metrics Queries

Request rate by service

sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

Error rate (5xx)

sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

P99 latency

histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

TCP connections

sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

Request size

histogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

Template 3: Jaeger Distributed Tracing

Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411

Jaeger deployment

apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.50 ports: - containerPort: 5775 # UDP - containerPort: 6831 # Thrift - containerPort: 6832 # Thrift - containerPort: 5778 # Config - containerPort: 16686 # UI - containerPort: 14268 # HTTP - containerPort: 14250 # gRPC - containerPort: 9411 # Zipkin env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411"

Template 4: Linkerd Viz Dashboard

Install Linkerd viz extension

linkerd viz install | kubectl apply -f -

Access dashboard

linkerd viz dashboard

CLI commands for observability

Top requests

linkerd viz top deploy/my-app

Per-route metrics

linkerd viz routes deploy/my-app --to deploy/backend

Live traffic inspection

linkerd viz tap deploy/my-app --to deploy/backend

Service edges (dependencies)

linkerd viz edges deployment -n my-namespace

Template 5: Grafana Dashboard JSON { "dashboard": { "title": "Service Mesh Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Error Rate", "type": "gauge", "targets": [ { "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100" } ], "fieldConfig": { "defaults": { "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 1, "color": "yellow" }, { "value": 5, "color": "red" } ] } } } }, { "title": "P99 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Service Topology", "type": "nodeGraph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)" } ] } ] } }

Template 6: Kiali Service Mesh Visualization

Kiali installation

apiVersion: kiali.io/v1alpha1 kind: Kiali metadata: name: kiali namespace: istio-system spec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: - "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000

Template 7: OpenTelemetry Integration

OpenTelemetry Collector for mesh

apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411

processors:
  batch:
    timeout: 10s

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp, zipkin]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Istio Telemetry v2 with OTel

apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-default namespace: istio-system spec: tracing: - providers: - name: otel randomSamplingPercentage: 10

Alerting Rules apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: mesh-alerts namespace: istio-system spec: groups: - name: mesh.rules rules: - alert: HighErrorRate expr: | sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name) / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.destination_service_name }}"

    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
        by (le, destination_service_name)) > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High P99 latency for {{ $labels.destination_service_name }}"

    - alert: MeshCertExpiring
      expr: |
        (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
      labels:
        severity: warning
      annotations:
        summary: "Mesh certificate expiring in less than 7 days"

Best Practices Do's Sample appropriately - 100% in dev, 1-10% in prod Use trace context - Propagate headers consistently Set up alerts - For golden signals Correlate metrics/traces - Use exemplars Retain strategically - Hot/cold storage tiers Don'ts Don't over-sample - Storage costs add up Don't ignore cardinality - Limit label values Don't skip dashboards - Visualize dependencies Don't forget costs - Monitor observability costs Resources Istio Observability Linkerd Observability OpenTelemetry Kiali

service-mesh-observability

安装