devops-flow

Description: Infrastructure as Code, CI/CD pipeline automation, and deployment management

Category: DevOps & Deployment

Complexity: High (multi-cloud + orchestration + automation)

Purpose

Automate infrastructure provisioning, CI/CD pipelines, and deployment processes based on SPEC documents and ADR decisions. Ensures consistent, repeatable, and secure deployments across environments.

Capabilities 1. Infrastructure as Code (IaC) Terraform: Cloud-agnostic infrastructure provisioning CloudFormation: AWS-native infrastructure Ansible: Configuration management and provisioning Pulumi: Modern IaC with standard programming languages Kubernetes manifests: Container orchestration 2. CI/CD Pipeline Generation GitHub Actions: Workflow automation GitLab CI: Pipeline configuration Jenkins: Pipeline as code CircleCI: Cloud-native CI/CD Azure DevOps: Microsoft ecosystem integration 3. Container Configuration Dockerfile: Container image definition Docker Compose: Multi-container applications Kubernetes: Production orchestration Helm charts: Kubernetes package management Container registry: Image storage and versioning 4. Deployment Strategies Blue-Green: Zero-downtime deployments Canary: Gradual rollout with monitoring Rolling: Sequential instance updates Feature flags: Progressive feature enablement Rollback procedures: Automated failure recovery 5. Environment Management Environment separation: dev, staging, production Configuration management: Environment-specific configs Secret management: Vault, AWS Secrets Manager, etc. Infrastructure versioning: State management Cost optimization: Resource tagging and monitoring 6. Monitoring & Observability Logging: Centralized log aggregation Metrics: Performance and health monitoring Alerting: Incident response automation Tracing: Distributed request tracking Dashboards: Real-time visualization 7. Security & Compliance Security scanning: Container and infrastructure Compliance checks: Policy enforcement Access control: IAM and RBAC Network security: Firewall rules, VPC configuration Audit logging: Change tracking 8. Disaster Recovery Backup automation: Data and configuration backups Recovery procedures: Automated restoration Failover: Multi-region redundancy Data replication: Cross-region sync RTO/RPO: Recovery objectives implementation DevOps Workflow graph TD A[SPEC Document] --> B[Extract Requirements] B --> C{Infrastructure Needed?} C -->|Yes| D[Generate IaC Templates] C -->|No| E[Generate CI/CD Pipeline]

D --> F[Terraform/CloudFormation]
F --> G[Validate Infrastructure Code]
G --> H{Validation Pass?}
H -->|No| I[Report Issues]
H -->|Yes| J[Generate Deployment Pipeline]

E --> J
J --> K[CI/CD Configuration]
K --> L[Add Build Stage]
L --> M[Add Test Stage]
M --> N[Add Security Scan]
N --> O[Add Deploy Stage]

O --> P[Environment Strategy]
P --> Q{Deployment Type}
Q -->|Blue-Green| R[Generate Blue-Green Config]
Q -->|Canary| S[Generate Canary Config]
Q -->|Rolling| T[Generate Rolling Config]

R --> U[Add Monitoring]
S --> U
T --> U

U --> V[Add Rollback Procedure]
V --> W[Generate Documentation]
W --> X[Review & Deploy]

I --> X

Usage Instructions Generate Infrastructure from SPEC devops-flow generate-infra \ --spec specs/SPEC-API-V1.md \ --cloud aws \ --output infrastructure/

Generated Terraform structure:

infrastructure/ ├── main.tf # Main configuration ├── variables.tf # Input variables ├── outputs.tf # Output values ├── providers.tf # Cloud provider config ├── modules/ │ ├── vpc/ # Network infrastructure │ ├── compute/ # EC2, Lambda, etc. │ ├── database/ # RDS, DynamoDB │ └── storage/ # S3, EBS └── environments/ ├── dev.tfvars # Development config ├── staging.tfvars # Staging config └── prod.tfvars # Production config

Generate CI/CD Pipeline devops-flow generate-pipeline \ --type github-actions \ --language python \ --deploy-strategy blue-green \ --output .github/workflows/

Generated GitHub Actions workflow:

name: CI/CD Pipeline

on: push: branches: [main, develop] pull_request: branches: [main]

env: PYTHON_VERSION: '3.11' AWS_REGION: us-east-1

jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: ${{ env.PYTHON_VERSION }} - name: Install dependencies run: | pip install ruff mypy - name: Run linters run: | ruff check . mypy .

test: needs: lint runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: | pytest --cov=. --cov-report=xml - name: Upload coverage uses: codecov/codecov-action@v3

security: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Security scan run: | bandit -r . -f json -o security-report.json - name: Upload security report uses: actions/upload-artifact@v3 with: name: security-report path: security-report.json

build: needs: [lint, test, security] runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Build Docker image run: | docker build -t app:${{ github.sha }} . - name: Push to registry run: | docker push registry.example.com/app:${{ github.sha }}

deploy-staging: needs: build if: github.ref == 'refs/heads/develop' runs-on: ubuntu-latest environment: staging steps: - name: Deploy to staging run: | aws ecs update-service \ --cluster staging-cluster \ --service app-service \ --force-new-deployment

deploy-production: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production steps: - name: Deploy blue-green run: | # Deploy to green environment ./scripts/deploy-green.sh # Run smoke tests ./scripts/smoke-tests.sh # Switch traffic to green ./scripts/switch-traffic.sh # Keep blue for rollback

Generate Kubernetes Configuration devops-flow generate-k8s \ --spec specs/SPEC-API-V1.md \ --replicas 3 \ --output k8s/

Generated Kubernetes manifests:

k8s/deployment.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: api-service labels: app: api version: v1 spec: replicas: 3 selector: matchLabels: app: api template: metadata: labels: app: api version: v1 spec: containers: - name: api image: registry.example.com/api:latest ports: - containerPort: 8000 env: - name: DATABASE_URL valueFrom: secretKeyRef: name: api-secrets key: database-url resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 5 periodSeconds: 5

k8s/service.yaml

apiVersion: v1 kind: Service metadata: name: api-service spec: selector: app: api ports: - protocol: TCP port: 80 targetPort: 8000 type: LoadBalancer

k8s/hpa.yaml

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-service minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70

Generate Deployment Scripts from REQ devops-flow generate-deployment-scripts \ --req docs/07_REQ/REQ-NN.md \ --spec docs/09_SPEC/SPEC-NN.yaml \ --output scripts/

Generated shell scripts structure:

scripts/ ├── setup.sh # Initial environment setup ├── install.sh # Application installation ├── deploy.sh # Main deployment orchestration ├── rollback.sh # Rollback to previous version ├── health-check.sh # Health verification └── cleanup.sh # Cleanup old versions

Script Generation Logic:

Parse REQ Section 9.5.3 for script requirements Parse SPEC deployment section for technical details Apply script standards (Bash 4.0+, error handling, logging) Reference cloud provider from REQ @adr tags Use environment-specific configurations from REQ 9.5.2

Example generated script (setup.sh):

!/bin/bash

set -euo pipefail

Setup environment for deployment

LOG_FILE="logs/deployment_$(date +%Y%m%d_%H%M%S).log" mkdir -p logs

log() { echo "[$(date +%Y-%m-%d %H:%M:%S)] $*" | tee -a "$LOG_FILE" }

log "Starting environment setup..."

Install dependencies

if [ ! -f .tool-versions ]; then log "Installing Python dependencies..." pip install -r requirements.txt fi

Configure environment variables

if [ -f .env.deployment ]; then log "Loading deployment environment variables..." export $(cat .env.deployment | grep -v '^#' | xargs) fi

log "Environment setup complete" exit 0

Example generated script (deploy.sh):

!/bin/bash

set -euo pipefail

Main deployment orchestration script

LOG_FILE="logs/deployment_$(date +%Y%m%d_%H%M%S).log" ENVIRONMENT="${1:-staging}"

log() { echo "[$(date +%Y-%m-%d %H:%M:%S)] $*" | tee -a "$LOG_FILE" }

Step 1: Setup

log "Running setup..." ./scripts/setup.sh

Step 2: Install

log "Installing application..." ./scripts/install.sh --env "$ENVIRONMENT"

Step 3: Deploy

log "Deploying application..." if [ "$ENVIRONMENT" = "production" ]; then ./scripts/deploy-prod.sh else ./scripts/deploy-staging.sh fi

Step 4: Health check

log "Running health check..." ./scripts/health-check.sh --env "$ENVIRONMENT"

if [ $? -eq 0 ]; then log "Deployment successful" else log "Deployment failed, initiating rollback..." ./scripts/rollback.sh --env "$ENVIRONMENT" exit 1 fi

Example generated script (health-check.sh):

!/bin/bash

set -euo pipefail

Health verification script

HEALTH_URL="${1:-http://localhost:8000/health/live}" TIMEOUT=60 RETRIES=3

log() { echo "[$(date +%Y-%m-%d %H:%M:%S)] $*" }

log "Starting health check..."

for i in $(seq 1 $RETRIES); do log "Attempt $i of $RETRIES..." RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT "$HEALTH_URL")

if [ "$RESPONSE" = "200" ]; then log "Health check passed" exit 0 fi

log "Health check failed, sleeping before retry..." sleep 5 done

log "Health check failed after $RETRIES attempts" exit 1

Generate Ansible Playbooks from REQ devops-flow generate-ansible-playbooks \ --req docs/07_REQ/REQ-NN.md \ --spec docs/09_SPEC/SPEC-NN.yaml \ --output ansible/

Generated Ansible playbooks structure:

ansible/ ├── provision_infra.yml # Infrastructure provisioning ├── configure_instances.yml # Instance configuration ├── deploy_app.yml # Application deployment ├── configure_monitoring.yml # Monitoring setup ├── configure_security.yml # Security hardening └── backup_restore.yml # Backup/restore procedures

Playbook Generation Logic:

Parse REQ Section 9.5.4 for playbook requirements Parse Section 9.5.1 for infrastructure configuration Apply Ansible standards (2.9+, modular roles, idempotency) Reference cloud provider from REQ @adr tags Use environment-specific variables from REQ 9.5.2

Example generated playbook (provision_infra.yml):

name: Provision Infrastructure hosts: localhost gather_facts: no vars_files:
- "environments/{{ target_env }}.yml"

tasks: - name: Create VPC ec2_vpc_net: name: "{{ vpc_name }}" cidr_block: "{{ vpc_cidr }}" region: "{{ aws_region }}" tags: Project: "{{ project_name }}" Environment: "{{ target_env }}" ManagedBy: "Ansible"

- name: Create security groups
  ec2_security_group:
    name: "{{ security_group_name }}"
    description: "Security group for {{ application_name }}"
    vpc_id: "{{ vpc.vpc_id }}"
    rules:
      - proto: tcp
        from_port: 80
        to_port: 80
        cidr_ip: 0.0.0.0/0
      - proto: tcp
        from_port: 443
        to_port: 443
        cidr_ip: 0.0.0.0/0
    region: "{{ aws_region }}"
    tags:
      Project: "{{ project_name }}"
      Environment: "{{ target_env }}"

- name: Create RDS instance
  rds:
    db_name: "{{ db_name }}"
    engine: postgres
    engine_version: "{{ db_version }}"
    instance_type: "{{ db_instance_class }}"
    allocated_storage: "{{ db_storage_gb }}"
    username: "{{ db_username }}"
    password: "{{ db_password }}"
    vpc_security_group_ids:
      - "{{ security_group.group_id }}"
    subnet_group_name: "{{ db_subnet_group }}"
    backup_retention_period: "{{ backup_retention_days }}"
    multi_az: true
    region: "{{ aws_region }}"
    tags:
      Project: "{{ project_name }}"
      Environment: "{{ target_env }}"
      ManagedBy: "Ansible"

Example generated playbook (deploy_app.yml):

name: Deploy Application hosts: app_servers gather_facts: yes become: yes vars_files:
- "environments/{{ target_env }}.yml"

tasks: - name: Ensure application directory exists file: path: "{{ app_directory }}" state: directory mode: '0755' owner: "{{ app_user }}" group: "{{ app_group }}"

- name: Copy application code
  synchronize:
    src: "{{ app_source_directory }}/"
    dest: "{{ app_directory }}/"
    delete: yes
    recursive: yes

- name: Install Python dependencies
  pip:
    requirements: "{{ app_directory }}/requirements.txt"
    virtualenv: "{{ app_venv }}"
    state: present

- name: Configure application
  template:
    src: "templates/{{ target_env }}_config.yml"
    dest: "{{ app_directory }}/config.yml"
    owner: "{{ app_user }}"
    group: "{{ app_group }}"
    mode: '0640'

- name: Restart application service
  systemd:
    name: "{{ app_service_name }}"
    state: restarted
    daemon_reload: yes
  notify: Run Health Check

- name: Wait for application to be ready
  wait_for:
    port: 8000
    host: "{{ inventory_hostname }}"
    timeout: 300

handlers: - name: Run Health Check uri: url: "http://localhost:8000/health/ready" method: GET status_code: 200 register: health_check

Infrastructure Templates AWS Infrastructure (Terraform)

main.tf

terraform { required_version = ">= 1.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } backend "s3" { bucket = "terraform-state-bucket" key = "infrastructure/terraform.tfstate" region = "us-east-1" } }

provider "aws" { region = var.aws_region default_tags { tags = { Project = var.project_name Environment = var.environment ManagedBy = "Terraform" } } }

VPC Module

module "vpc" { source = "./modules/vpc"

vpc_cidr = var.vpc_cidr availability_zones = var.availability_zones public_subnet_cidrs = var.public_subnet_cidrs private_subnet_cidrs = var.private_subnet_cidrs }

ECS Cluster

resource "aws_ecs_cluster" "main" { name = "${var.project_name}-${var.environment}-cluster"

setting { name = "containerInsights" value = "enabled" } }

Application Load Balancer

resource "aws_lb" "main" { name = "${var.project_name}-${var.environment}-alb" internal = false load_balancer_type = "application" security_groups = [aws_security_group.alb.id] subnets = module.vpc.public_subnet_ids

enable_deletion_protection = var.environment == "production" }

RDS Database

resource "aws_db_instance" "main" { identifier = "${var.project_name}-${var.environment}-db" engine = "postgres" engine_version = "15.3" instance_class = var.db_instance_class

allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage storage_encrypted = true

db_name = var.db_name username = var.db_username password = random_password.db_password.result

vpc_security_group_ids = [aws_security_group.db.id] db_subnet_group_name = aws_db_subnet_group.main.name

backup_retention_period = var.environment == "production" ? 30 : 7 skip_final_snapshot = var.environment != "production"

tags = { Name = "${var.project_name}-${var.environment}-db" } }

Docker Configuration

Dockerfile

FROM python:3.11-slim as base

WORKDIR /app

Install system dependencies

RUN apt-get update && apt-get install -y \ gcc \ libpq-dev \ && rm -rf /var/lib/apt/lists/*

Copy requirements

COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt

Copy application code

COPY . .

Create non-root user

RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app USER appuser

Expose port

EXPOSE 8000

Health check

HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \ CMD python -c "import requests; requests.get('http://localhost:8000/health')"

Run application

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Multi-stage build for smaller image

FROM base as production ENV ENVIRONMENT=production RUN pip install --no-cache-dir gunicorn CMD ["gunicorn", "main:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000"]

Docker Compose (Local Development)

docker-compose.yml

version: '3.8'

services: api: build: context: . dockerfile: Dockerfile target: base ports: - "8000:8000" environment: - DATABASE_URL=postgresql://user:password@db:5432/appdb - REDIS_URL=redis://redis:6379/0 - ENVIRONMENT=development volumes: - .:/app depends_on: db: condition: service_healthy redis: condition: service_started command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload

db: image: postgres:15 environment: POSTGRES_USER: user POSTGRES_PASSWORD: password POSTGRES_DB: appdb ports: - "5432:5432" volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U user"] interval: 10s timeout: 5s retries: 5

redis: image: redis:7-alpine ports: - "6379:6379" volumes: - redis_data:/data

nginx: image: nginx:alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - api

volumes: postgres_data: redis_data:

Deployment Strategies Blue-Green Deployment

!/bin/bash

deploy-blue-green.sh

set -e

BLUE_ENV="production-blue" GREEN_ENV="production-green" CURRENT_ENV=$(get_active_environment)

if [ "$CURRENT_ENV" == "$BLUE_ENV" ]; then TARGET_ENV="$GREEN_ENV" OLD_ENV="$BLUE_ENV" else TARGET_ENV="$BLUE_ENV" OLD_ENV="$GREEN_ENV" fi

echo "Deploying to $TARGET_ENV (current: $OLD_ENV)"

Deploy to target environment

deploy_to_environment "$TARGET_ENV"

Run smoke tests

if ! run_smoke_tests "$TARGET_ENV"; then echo "Smoke tests failed, rolling back" exit 1 fi

Switch traffic

switch_load_balancer "$TARGET_ENV"

Monitor for 5 minutes

monitor_environment "$TARGET_ENV" 300

If all good, keep old environment for quick rollback

echo "Deployment successful. Old environment $OLD_ENV kept for rollback."

Canary Deployment

k8s/canary-deployment.yaml

apiVersion: v1 kind: Service metadata: name: api-service spec: selector: app: api ports: - port: 80 targetPort: 8000

Stable version (90% traffic)

apiVersion: apps/v1 kind: Deployment metadata: name: api-stable spec: replicas: 9 selector: matchLabels: app: api version: stable template: metadata: labels: app: api version: stable spec: containers: - name: api image: registry.example.com/api:v1.0.0

Canary version (10% traffic)

apiVersion: apps/v1 kind: Deployment metadata: name: api-canary spec: replicas: 1 selector: matchLabels: app: api version: canary template: metadata: labels: app: api version: canary spec: containers: - name: api image: registry.example.com/api:v1.1.0

Monitoring & Observability Prometheus Configuration

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s

scrape_configs: - job_name: 'api-service' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: api

job_name: 'node-exporter' static_configs:
- targets: ['node-exporter:9100']

alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093']

rule_files: - /etc/prometheus/alerts/*.yml

Alert Rules

alerts/api-alerts.yml

groups: - name: api-alerts interval: 30s rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "{{ $labels.instance }} has error rate {{ $value }}"

  - alert: HighLatency
    expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"
      description: "95th percentile latency is {{ $value }}s"

  - alert: PodDown
    expr: up{job="api-service"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Pod is down"
      description: "{{ $labels.instance }} has been down for 2 minutes"

Security Configuration Network Security

security-groups.tf

ALB Security Group

resource "aws_security_group" "alb" { name_prefix = "${var.project_name}-alb-" vpc_id = module.vpc.vpc_id

ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] description = "HTTPS from internet" }

egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

Application Security Group

resource "aws_security_group" "app" { name_prefix = "${var.project_name}-app-" vpc_id = module.vpc.vpc_id

ingress { from_port = 8000 to_port = 8000 protocol = "tcp" security_groups = [aws_security_group.alb.id] description = "From ALB" }

egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

Database Security Group

resource "aws_security_group" "db" { name_prefix = "${var.project_name}-db-" vpc_id = module.vpc.vpc_id

ingress { from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.app.id] description = "From application" } }

Tool Access

Required tools:

Read: Read SPEC documents and ADRs Write: Generate infrastructure and pipeline files Bash: Execute Terraform, Docker, kubectl commands Grep: Search for configuration patterns

Required software:

Terraform / OpenTofu Docker / Podman kubectl / helm aws-cli / gcloud / az-cli Ansible (optional) Integration Points With doc-flow Extract infrastructure requirements from SPEC documents Validate ADR compliance in infrastructure code Generate deployment documentation With security-audit Security scanning of infrastructure code Vulnerability assessment of containers Compliance validation With test-automation Integration with CI/CD for automated testing Deployment smoke tests Infrastructure validation tests With analytics-flow Deployment metrics and trends Infrastructure cost tracking Performance monitoring integration Best Practices Infrastructure as Code: All infrastructure versioned in Git Immutable infrastructure: Replace, don't modify Environment parity: Dev/staging/prod consistency Secret management: Never commit secrets Monitoring from day one: Observability built-in Automated rollbacks: Fast failure recovery Cost optimization: Tag resources, monitor spending Security by default: Least privilege, encryption Documentation: Runbooks for common operations Disaster recovery: Regular backup testing Success Criteria Zero manual infrastructure provisioning Deployment time < 15 minutes Rollback time < 5 minutes Zero-downtime deployments Infrastructure drift detection automated Security compliance 100% Cost variance < 10% from budget Notes Generated configurations require review before production use Cloud provider credentials must be configured separately State management (Terraform) requires backend configuration Multi-region deployments require additional configuration Cost estimation available with terraform plan

安装

k8s/deployment.yaml

k8s/service.yaml

k8s/hpa.yaml

!/bin/bash

Setup environment for deployment

Install dependencies

Configure environment variables

!/bin/bash

Main deployment orchestration script

Step 1: Setup

Step 2: Install

Step 3: Deploy

Step 4: Health check

!/bin/bash

Health verification script

main.tf

VPC Module

ECS Cluster

Application Load Balancer

RDS Database

Dockerfile

Install system dependencies

Copy requirements

Copy application code

Create non-root user

Expose port

Health check

Run application

Multi-stage build for smaller image

docker-compose.yml

!/bin/bash

deploy-blue-green.sh

Deploy to target environment

Run smoke tests

Switch traffic

Monitor for 5 minutes

If all good, keep old environment for quick rollback

k8s/canary-deployment.yaml

Stable version (90% traffic)

Canary version (10% traffic)

prometheus.yml

alerts/api-alerts.yml

security-groups.tf

ALB Security Group

Application Security Group

Database Security Group