Cloud Devops Expert

Core Services:

Compute: EC2, Lambda (serverless), ECS/EKS (containers), Fargate Storage: S3 (object), EBS (block), EFS (file system) Database: RDS (relational), DynamoDB (NoSQL), Aurora (MySQL/PostgreSQL) Networking: VPC, ALB/NLB, CloudFront (CDN), Route 53 (DNS) Monitoring: CloudWatch (metrics, logs, alarms)

Best Practices:

Use AWS Organizations for multi-account management Implement least privilege with IAM roles and policies Enable CloudTrail for audit logging Use AWS Config for compliance and resource tracking Tag all resources for cost allocation and management GCP (Google Cloud Platform) Patterns

Core Services:

Compute: Compute Engine (VMs), Cloud Functions (serverless), GKE (Kubernetes) Storage: Cloud Storage (object), Persistent Disk (block) Database: Cloud SQL, Cloud Spanner, Firestore Networking: VPC, Cloud Load Balancing, Cloud CDN Monitoring: Cloud Monitoring, Cloud Logging

Best Practices:

Use Google Cloud Identity for centralized identity management Implement VPC Service Controls for security perimeters Enable Cloud Audit Logs for compliance Use labels for resource organization and billing Azure Patterns

Core Services:

Compute: Virtual Machines, Azure Functions, AKS (Kubernetes), Container Instances Storage: Blob Storage, Azure Files, Managed Disks Database: Azure SQL, Cosmos DB (NoSQL), PostgreSQL/MySQL Networking: Virtual Network, Application Gateway, Front Door (CDN) Monitoring: Azure Monitor, Log Analytics

Best Practices:

Use Azure AD for identity and access management Implement Azure Policy for governance Enable Azure Security Center for threat protection Use resource groups for logical organization Terraform Best Practices

Project Structure:

terraform/ ├── environments/ │ ├── dev/ │ │ ├── main.tf │ │ ├── variables.tf │ │ └── terraform.tfvars │ ├── staging/ │ └── prod/ ├── modules/ │ ├── vpc/ │ ├── eks/ │ └── rds/ └── global/ └── backend.tf

Code Organization:

Use modules for reusable infrastructure components Separate environments with workspaces or directories Store state remotely (S3 + DynamoDB for AWS, GCS for GCP, Azure Blob for Azure) Use variables for environment-specific values Never commit secrets (use AWS Secrets Manager, HashiCorp Vault, etc.)

Terraform Workflow:

Initialize

terraform init

Plan (review changes)

terraform plan -out=tfplan

Apply (execute changes)

terraform apply tfplan

Destroy (when needed)

terraform destroy

Best Practices:

Use terraform fmt for consistent formatting Use terraform validate to check syntax Implement state locking to prevent concurrent modifications Use terraform import for existing resources Version pin providers: required_version = "~> 1.5" Use data sources for referencing existing resources Implement depends_on for explicit resource dependencies Kubernetes Deployment Patterns

Deployment Strategies:

Rolling Update: Gradual replacement of pods (default) Blue/Green: Run two identical environments, switch traffic Canary: Gradual traffic shift to new version Recreate: Terminate old pods before creating new ones (downtime)

Resource Management:

apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:v1.0.0 resources: requests: memory: '256Mi' cpu: '250m' limits: memory: '512Mi' cpu: '500m' livenessProbe: httpGet: path: /health port: 8080 readinessProbe: httpGet: path: /ready port: 8080

Best Practices:

Use namespaces for environment/team isolation Implement RBAC for access control Define resource requests and limits Use liveness and readiness probes Use ConfigMaps and Secrets for configuration Implement Pod Security Policies (PSP) or Pod Security Standards (PSS) Use Horizontal Pod Autoscaler (HPA) for auto-scaling CI/CD Pipeline Patterns

GitHub Actions Example:

name: CI/CD Pipeline

on: push: branches: [main, develop] pull_request: branches: [main]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: npm test

build: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Build Docker image run: docker build -t myapp:${{ github.sha }} . - name: Push to registry run: docker push myapp:${{ github.sha }}

deploy: needs: build runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - name: Deploy to Kubernetes run: kubectl set image deployment/myapp myapp=myapp:${{ github.sha }}

Best Practices:

Implement automated testing (unit, integration, e2e) Use matrix builds for multi-platform testing Cache dependencies to speed up builds Use secrets management for sensitive data Implement deployment gates and approvals for production Use semantic versioning for releases Implement rollback strategies Infrastructure as Code (IaC) Principles

Version Control:

Store all infrastructure code in Git Use pull requests for code review Implement branch protection rules Tag releases for production deployments

Testing:

Use terraform plan to preview changes Implement policy-as-code with Sentinel, OPA, or Checkov Use tflint for Terraform linting Test modules in isolation

Documentation:

Document module inputs and outputs Maintain README files for each module Use terraform-docs to auto-generate documentation Monitoring and Observability

The Three Pillars:

Metrics (Prometheus + Grafana)

Use Prometheus for metrics collection Define SLIs (Service Level Indicators) Set up alerting rules Create Grafana dashboards for visualization

Logs (ELK Stack, CloudWatch, Cloud Logging)

Centralize logs from all services Implement structured logging (JSON format) Use log aggregation and parsing Set up log-based alerts

Traces (Jaeger, Zipkin, X-Ray)

Implement distributed tracing Track request flow across microservices Identify performance bottlenecks Correlate traces with logs and metrics

Observability Best Practices:

Define SLOs (Service Level Objectives) and SLAs Implement health check endpoints Use APM (Application Performance Monitoring) tools Set up on-call rotations and runbooks Practice incident response procedures Container Orchestration (Kubernetes)

Helm Charts:

Use Helm for package management Create reusable chart templates Use values files for environment-specific configuration Version and publish charts to chart repository

Kubernetes Operators:

Automate operational tasks Manage complex stateful applications Examples: Prometheus Operator, Postgres Operator

Service Mesh (Istio, Linkerd):

Implement traffic management (canary, blue/green) Enable mutual TLS for service-to-service communication Implement circuit breakers and retries Observe traffic with distributed tracing Cost Optimization

AWS Cost Optimization:

Use Reserved Instances or Savings Plans for predictable workloads Implement auto-scaling to match demand Use S3 lifecycle policies to transition to cheaper storage classes Enable Cost Explorer and set up budgets Right-size instances based on usage metrics

Multi-Cloud Cost Management:

Use tags/labels for cost allocation Implement chargeback models for team accountability Use spot/preemptible instances for non-critical workloads Monitor unused resources (idle VMs, unattached volumes) Cloudflare Developer Platform

Cloudflare Workers & Pages:

Edge computing platform for serverless functions Deploy at the edge (close to users globally) Use Workers KV for edge key-value storage Use Durable Objects for stateful applications

Cloudflare Primitives:

R2: S3-compatible object storage (no egress fees) D1: SQLite-based serverless database KV: Key-value storage (globally distributed) AI: Run AI inference at the edge Queues: Message queuing service Vectorize: Vector database for embeddings

Configuration (wrangler.toml):

name = "my-worker" main = "src/index.ts" compatibility_date = "2024-01-01"

[[ kv_namespaces ]] binding = "MY_KV" id = "xxx"

[[ r2_buckets ]] binding = "MY_BUCKET" bucket_name = "my-bucket"

[[ d1_databases ]] binding = "DB" database_name = "my-db" database_id = "xxx"

Consolidated Skills

This expert skill consolidates 1 individual skills:

cloudflare-developer-tools-rule Related Skills docker-compose - Container orchestration and multi-container application management Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing: Record any new patterns or exceptions discovered.

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

cloud-devops-expert

安装

Initialize

Plan (review changes)

Apply (execute changes)

Destroy (when needed)