Canary Deployment Overview

Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds.

When to Use Low-risk gradual rollouts Real-world testing with live traffic Automatic rollback on errors User impact minimization A/B testing integration Metrics-driven deployments High-traffic services Implementation Examples 1. Istio-based Canary Deployment

canary-deployment-istio.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v1 namespace: production spec: replicas: 3 selector: matchLabels: app: myapp version: v1 template: metadata: labels: app: myapp version: v1 spec: containers: - name: myapp image: myrepo/myapp:1.0.0 ports: - containerPort: 8080

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v2 namespace: production spec: replicas: 1 # Start with minimal replicas for canary selector: matchLabels: app: myapp version: v2 template: metadata: labels: app: myapp version: v2 spec: containers: - name: myapp image: myrepo/myapp:2.0.0 ports: - containerPort: 8080

apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp namespace: production spec: hosts: - myapp http: # Canary: 5% to v2, 95% to v1 - match: - headers: user-agent: regex: ".Chrome." # Test with Chrome route: - destination: host: myapp subset: v2 weight: 100 timeout: 10s

# Default route with traffic split
- route:
    - destination:
        host: myapp
        subset: v1
      weight: 95
    - destination:
        host: myapp
        subset: v2
      weight: 5
  timeout: 10s

apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp namespace: production spec: host: myapp trafficPolicy: connectionPool: http: http1MaxPendingRequests: 100 maxRequestsPerConnection: 2

subsets: - name: v1 labels: version: v1

- name: v2
  labels:
    version: v2
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 30s

apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp progressDeadlineSeconds: 300 service: port: 80

analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 stepWeightPromotion: 10

metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m

- name: request-duration
  thresholdRange:
    max: 500
  interval: 30s

webhooks: - name: acceptance-test url: http://flagger-loadtester/ timeout: 30s metadata: type: smoke cmd: "curl -sd 'test' http://myapp-canary/api/test"

- name: load-test
  url: http://flagger-loadtester/
  timeout: 5s
  metadata:
    cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
    logCmdOutput: "true"

# Automatic rollback on failure skipAnalysis: false

Kubernetes Native Canary Script

!/bin/bash

canary-rollout.sh - Canary deployment with k8s native tools

set -euo pipefail

NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" NEW_VERSION="${3:-latest}" CANARY_WEIGHT=10 MAX_WEIGHT=100 STEP_WEIGHT=10 CHECK_INTERVAL=60 MAX_ERROR_RATE=0.05

echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"

Get current replicas

CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE" \ -o jsonpath='{.spec.replicas}') CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))

echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"

Create canary deployment

kubectl set image deployment/${DEPLOYMENT}-canary \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \ -n "$NAMESPACE"

Scale up canary gradually

CURRENT_WEIGHT=$CANARY_WEIGHT while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"

# Update ingress or service to split traffic kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \ -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'

# Wait and check metrics echo "Monitoring metrics for ${CHECK_INTERVAL}s..." sleep $CHECK_INTERVAL

# Check error rate ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" -- \ curl -s http://localhost:8080/metrics | grep http_requests_total | \ awk '{print $2}' || echo "0")

if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then echo "ERROR: Error rate exceeded threshold: $ERROR_RATE" echo "Rolling back canary deployment..." kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \ -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}' exit 1 fi

CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT)) done

Promote canary to stable

echo "Canary deployment successful! Promoting to stable..." kubectl set image deployment/${DEPLOYMENT} \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \ -n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary deployment complete!"

Metrics-Based Canary Analysis

canary-monitoring.yaml

apiVersion: v1 kind: ConfigMap metadata: name: canary-analysis namespace: production data: analyze.sh: | #!/bin/bash set -euo pipefail

CANARY_DEPLOYMENT="${1:-myapp-canary}"
STABLE_DEPLOYMENT="${2:-myapp-stable}"
THRESHOLD="${3:-0.05}"  # 5% error rate threshold
NAMESPACE="production"

echo "Analyzing canary metrics..."

# Query Prometheus for metrics
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
  jq -r '.data.result[0].value[1]' || echo "0")

STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
  jq -r '.data.result[0].value[1]' || echo "0")

CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
  jq -r '.data.result[0].value[1]' || echo "0")

STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
  jq -r '.data.result[0].value[1]' || echo "0")

echo "Canary Error Rate: $CANARY_ERROR_RATE"
echo "Stable Error Rate: $STABLE_ERROR_RATE"
echo "Canary P95 Latency: ${CANARY_LATENCY}s"
echo "Stable P95 Latency: ${STABLE_LATENCY}s"

# Check if canary is within acceptable range
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
  echo "FAIL: Canary error rate exceeds threshold"
  exit 1
fi

if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
  echo "FAIL: Canary latency is 20% higher than stable"
  exit 1
fi

echo "PASS: Canary meets quality criteria"
exit 0

apiVersion: batch/v1 kind: Job metadata: name: canary-analysis namespace: production spec: template: spec: containers: - name: analyzer image: curlimages/curl:latest command: - sh - -c - | apk add --no-cache bc jq bash /scripts/analyze.sh volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: canary-analysis restartPolicy: Never

Automated Canary Promotion

!/bin/bash

promote-canary.sh - Automatically promote successful canary

set -euo pipefail

NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" MAX_DURATION="${3:-600}" # Max 10 minutes for canary

start_time=$(date +%s)

echo "Starting automated canary promotion for $DEPLOYMENT"

while true; do current_time=$(date +%s) elapsed=$((current_time - start_time))

if [ $elapsed -gt $MAX_DURATION ]; then echo "ERROR: Canary exceeded max duration" exit 1 fi

# Check canary health CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.status.readyReplicas}') CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.spec.replicas}')

if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then echo "Waiting for canary pods to be ready..." sleep 10 continue fi

# Run analysis if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then echo "Canary analysis passed! Promoting to stable..."

# Merge canary into stable
kubectl set image deployment/${DEPLOYMENT} \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
  -n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary promoted successfully!"
exit 0

else echo "Canary analysis failed. Rolling back..." exit 1 fi done

Canary Best Practices ✅ DO Start with small traffic percentage (5-10%) Monitor key metrics continuously Increase gradually based on metrics Implement automatic rollback Run load tests on canary Test with real user traffic Set appropriate thresholds Document rollback procedures ❌ DON'T Rush through canary phases Ignore metrics Mix canary and stable versions Deploy to all users at once Skip rollback testing Use artificial load only Set unrealistic thresholds Deploy unvalidated changes Metrics to Monitor Error Rate: 5xx errors increase Latency: P95/P99 response time Throughput: Requests per second Resource Usage: CPU, memory Business Metrics: Conversion rate, revenue User Experience: Session duration, bounce rate Resources Flagger Canary Deployments Istio Traffic Management Kubernetes Canary Pattern

安装

canary-deployment-istio.yaml

!/bin/bash

canary-rollout.sh - Canary deployment with k8s native tools

Get current replicas

Create canary deployment

Scale up canary gradually

Promote canary to stable

canary-monitoring.yaml

!/bin/bash

promote-canary.sh - Automatically promote successful canary