Canary Deployment Overview
Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds.
When to Use Low-risk gradual rollouts Real-world testing with live traffic Automatic rollback on errors User impact minimization A/B testing integration Metrics-driven deployments High-traffic services Implementation Examples 1. Istio-based Canary Deployment
canary-deployment-istio.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v1 namespace: production spec: replicas: 3 selector: matchLabels: app: myapp version: v1 template: metadata: labels: app: myapp version: v1 spec: containers: - name: myapp image: myrepo/myapp:1.0.0 ports: - containerPort: 8080
apiVersion: apps/v1 kind: Deployment metadata: name: myapp-v2 namespace: production spec: replicas: 1 # Start with minimal replicas for canary selector: matchLabels: app: myapp version: v2 template: metadata: labels: app: myapp version: v2 spec: containers: - name: myapp image: myrepo/myapp:2.0.0 ports: - containerPort: 8080
apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp namespace: production spec: hosts: - myapp http: # Canary: 5% to v2, 95% to v1 - match: - headers: user-agent: regex: ".Chrome." # Test with Chrome route: - destination: host: myapp subset: v2 weight: 100 timeout: 10s
# Default route with traffic split
- route:
- destination:
host: myapp
subset: v1
weight: 95
- destination:
host: myapp
subset: v2
weight: 5
timeout: 10s
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp namespace: production spec: host: myapp trafficPolicy: connectionPool: http: http1MaxPendingRequests: 100 maxRequestsPerConnection: 2
subsets: - name: v1 labels: version: v1
- name: v2
labels:
version: v2
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 30s
apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: myapp namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: myapp progressDeadlineSeconds: 300 service: port: 80
analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 stepWeightPromotion: 10
metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 30s
webhooks: - name: acceptance-test url: http://flagger-loadtester/ timeout: 30s metadata: type: smoke cmd: "curl -sd 'test' http://myapp-canary/api/test"
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
logCmdOutput: "true"
# Automatic rollback on failure skipAnalysis: false
- Kubernetes Native Canary Script
!/bin/bash
canary-rollout.sh - Canary deployment with k8s native tools
set -euo pipefail
NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" NEW_VERSION="${3:-latest}" CANARY_WEIGHT=10 MAX_WEIGHT=100 STEP_WEIGHT=10 CHECK_INTERVAL=60 MAX_ERROR_RATE=0.05
echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"
Get current replicas
CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE" \ -o jsonpath='{.spec.replicas}') CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))
echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"
Create canary deployment
kubectl set image deployment/${DEPLOYMENT}-canary \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \ -n "$NAMESPACE"
Scale up canary gradually
CURRENT_WEIGHT=$CANARY_WEIGHT while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"
# Update ingress or service to split traffic kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \ -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'
# Wait and check metrics echo "Monitoring metrics for ${CHECK_INTERVAL}s..." sleep $CHECK_INTERVAL
# Check error rate ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" -- \ curl -s http://localhost:8080/metrics | grep http_requests_total | \ awk '{print $2}' || echo "0")
if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then echo "ERROR: Error rate exceeded threshold: $ERROR_RATE" echo "Rolling back canary deployment..." kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \ -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}' exit 1 fi
CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT)) done
Promote canary to stable
echo "Canary deployment successful! Promoting to stable..." kubectl set image deployment/${DEPLOYMENT} \ ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \ -n "$NAMESPACE"
kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m
echo "Canary deployment complete!"
- Metrics-Based Canary Analysis
canary-monitoring.yaml
apiVersion: v1 kind: ConfigMap metadata: name: canary-analysis namespace: production data: analyze.sh: | #!/bin/bash set -euo pipefail
CANARY_DEPLOYMENT="${1:-myapp-canary}"
STABLE_DEPLOYMENT="${2:-myapp-stable}"
THRESHOLD="${3:-0.05}" # 5% error rate threshold
NAMESPACE="production"
echo "Analyzing canary metrics..."
# Query Prometheus for metrics
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
jq -r '.data.result[0].value[1]' || echo "0")
STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
jq -r '.data.result[0].value[1]' || echo "0")
CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
jq -r '.data.result[0].value[1]' || echo "0")
STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
jq -r '.data.result[0].value[1]' || echo "0")
echo "Canary Error Rate: $CANARY_ERROR_RATE"
echo "Stable Error Rate: $STABLE_ERROR_RATE"
echo "Canary P95 Latency: ${CANARY_LATENCY}s"
echo "Stable P95 Latency: ${STABLE_LATENCY}s"
# Check if canary is within acceptable range
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
echo "FAIL: Canary error rate exceeds threshold"
exit 1
fi
if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
echo "FAIL: Canary latency is 20% higher than stable"
exit 1
fi
echo "PASS: Canary meets quality criteria"
exit 0
apiVersion: batch/v1 kind: Job metadata: name: canary-analysis namespace: production spec: template: spec: containers: - name: analyzer image: curlimages/curl:latest command: - sh - -c - | apk add --no-cache bc jq bash /scripts/analyze.sh volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: canary-analysis restartPolicy: Never
- Automated Canary Promotion
!/bin/bash
promote-canary.sh - Automatically promote successful canary
set -euo pipefail
NAMESPACE="${1:-production}" DEPLOYMENT="${2:-myapp}" MAX_DURATION="${3:-600}" # Max 10 minutes for canary
start_time=$(date +%s)
echo "Starting automated canary promotion for $DEPLOYMENT"
while true; do current_time=$(date +%s) elapsed=$((current_time - start_time))
if [ $elapsed -gt $MAX_DURATION ]; then echo "ERROR: Canary exceeded max duration" exit 1 fi
# Check canary health CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.status.readyReplicas}') CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \ -o jsonpath='{.spec.replicas}')
if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then echo "Waiting for canary pods to be ready..." sleep 10 continue fi
# Run analysis if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then echo "Canary analysis passed! Promoting to stable..."
# Merge canary into stable
kubectl set image deployment/${DEPLOYMENT} \
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
-o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
-n "$NAMESPACE"
kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m
echo "Canary promoted successfully!"
exit 0
else echo "Canary analysis failed. Rolling back..." exit 1 fi done
Canary Best Practices ✅ DO Start with small traffic percentage (5-10%) Monitor key metrics continuously Increase gradually based on metrics Implement automatic rollback Run load tests on canary Test with real user traffic Set appropriate thresholds Document rollback procedures ❌ DON'T Rush through canary phases Ignore metrics Mix canary and stable versions Deploy to all users at once Skip rollback testing Use artificial load only Set unrealistic thresholds Deploy unvalidated changes Metrics to Monitor Error Rate: 5xx errors increase Latency: P95/P99 response time Throughput: Requests per second Resource Usage: CPU, memory Business Metrics: Conversion rate, revenue User Experience: Session duration, bounce rate Resources Flagger Canary Deployments Istio Traffic Management Kubernetes Canary Pattern