Senior Cloud Architect
Expert-level cloud architecture and infrastructure design.
Core Competencies Multi-cloud architecture AWS, GCP, Azure platforms Cloud-native design patterns Cost optimization Security and compliance Migration strategies Disaster recovery Infrastructure automation Cloud Platform Comparison Service AWS GCP Azure Compute EC2, ECS, EKS GCE, GKE VMs, AKS Serverless Lambda Cloud Functions Azure Functions Storage S3 Cloud Storage Blob Storage Database RDS, DynamoDB Cloud SQL, Spanner SQL DB, CosmosDB ML SageMaker Vertex AI Azure ML CDN CloudFront Cloud CDN Azure CDN AWS Architecture Well-Architected Framework
Pillars:
Operational Excellence
Infrastructure as Code Monitoring and observability Incident response Continuous improvement
Security
Identity and access management Data protection Infrastructure protection Incident response
Reliability
Fault tolerance Disaster recovery Change management Failure testing
Performance Efficiency
Right-sizing resources Monitoring performance Trade-off decisions Keeping current
Cost Optimization
Cost awareness Right-sizing Reserved capacity Efficient resources
Sustainability
Region selection Efficient algorithms Hardware utilization Data management Reference Architecture ┌─────────────────────────────────────────────────────────────┐ │ Route 53 (DNS) │ └─────────────────────────────┬───────────────────────────────┘ │ ┌─────────────────────────────▼───────────────────────────────┐ │ CloudFront (CDN) │ │ WAF (Web Application Firewall) │ └─────────────────────────────┬───────────────────────────────┘ │ ┌─────────────────────────────▼───────────────────────────────┐ │ Application Load Balancer │ └──────────┬───────────────────────────────────┬──────────────┘ │ │ ┌──────────▼──────────┐ ┌──────────▼──────────┐ │ ECS/EKS Cluster │ │ ECS/EKS Cluster │ │ (AZ-a) │ │ (AZ-b) │ └──────────┬──────────┘ └──────────┬──────────┘ │ │ ┌──────────▼───────────────────────────────────▼──────────┐ │ ElastiCache (Redis) │ └─────────────────────────────┬───────────────────────────┘ │ ┌─────────────────────────────▼───────────────────────────┐ │ RDS Multi-AZ │ │ (Primary + Standby) │ └─────────────────────────────────────────────────────────┘
Terraform AWS Module module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.0"
name = "${var.project}-${var.environment}" cidr = var.vpc_cidr
azs = ["${var.region}a", "${var.region}b", "${var.region}c"] private_subnets = var.private_subnets public_subnets = var.public_subnets
enable_nat_gateway = true single_nat_gateway = var.environment != "production" enable_dns_hostnames = true enable_dns_support = true
tags = local.common_tags }
module "eks" { source = "terraform-aws-modules/eks/aws" version = "~> 19.0"
cluster_name = "${var.project}-${var.environment}" cluster_version = "1.28"
vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets
cluster_endpoint_public_access = true cluster_endpoint_private_access = true
eks_managed_node_groups = { main = { instance_types = var.node_instance_types min_size = var.node_min_size max_size = var.node_max_size desired_size = var.node_desired_size } }
tags = local.common_tags }
module "rds" { source = "terraform-aws-modules/rds/aws" version = "~> 6.0"
identifier = "${var.project}-${var.environment}"
engine = "postgres" engine_version = "15" family = "postgres15" major_engine_version = "15" instance_class = var.db_instance_class
allocated_storage = var.db_allocated_storage max_allocated_storage = var.db_max_allocated_storage
db_name = var.db_name username = var.db_username port = 5432
multi_az = var.environment == "production" db_subnet_group_name = module.vpc.database_subnet_group vpc_security_group_ids = [module.security_group.security_group_id]
backup_retention_period = var.environment == "production" ? 30 : 7 skip_final_snapshot = var.environment != "production"
tags = local.common_tags }
Cost Optimization Reserved vs On-Demand vs Spot Type Discount Commitment Use Case On-Demand 0% None Variable workloads Reserved 30-72% 1-3 years Steady-state Savings Plans 30-72% 1-3 years Flexible compute Spot 60-90% None Fault-tolerant Cost Optimization Strategies
Right-sizing:
def analyze_utilization(instance_id: str, days: int = 14): """Analyze CPU/memory utilization for right-sizing recommendations.""" cloudwatch = boto3.client('cloudwatch')
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.now() - timedelta(days=days),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average', 'Maximum']
)
avg_cpu = sum(p['Average'] for p in metrics['Datapoints']) / len(metrics['Datapoints'])
max_cpu = max(p['Maximum'] for p in metrics['Datapoints'])
if avg_cpu < 10 and max_cpu < 30:
return 'downsize'
elif avg_cpu > 80:
return 'upsize'
else:
return 'optimal'
Cost Allocation Tags:
required_tags: - Environment: production|staging|development - Project: project-name - Owner: team-name - CostCenter: cost-center-id
automation: - Untagged resources alert after 24 hours - Auto-terminate development resources after 7 days - Weekly cost reports by tag
Cost Dashboard ┌─────────────────────────────────────────────────────────────┐ │ Monthly Cost Summary │ ├─────────────────────────────────────────────────────────────┤ │ Total: $45,231 vs Last Month: +5% │ │ │ │ By Service: By Environment: │ │ ├── EC2: $18,500 (41%) ├── Production: $38,000 │ │ ├── RDS: $12,000 (27%) ├── Staging: $4,500 │ │ ├── S3: $3,200 (7%) └── Development: $2,731 │ │ ├── Lambda: $1,800 (4%) │ │ └── Other: $9,731 (21%) Savings Opportunity: $8,200 │ │ │ │ Recommendations: │ │ • Convert 12 instances to Reserved (save $4,200/mo) │ │ • Delete 5 unused EBS volumes (save $180/mo) │ │ • Resize 8 over-provisioned instances (save $1,800/mo) │ └─────────────────────────────────────────────────────────────┘
Disaster Recovery DR Strategies Strategy RTO RPO Cost Backup & Restore Hours Hours $ Pilot Light Minutes Minutes $$ Warm Standby Minutes Seconds $$$ Multi-Site Active Seconds Near-zero $$$$ Multi-Region Architecture ┌────────────────────────────────────────────────────────────┐ │ Global Load Balancer │ │ (Route 53 / Cloud DNS) │ └──────────────┬─────────────────────────────┬───────────────┘ │ │ ┌──────────────▼──────────────┐ ┌────────────▼──────────────┐ │ Primary Region │ │ Secondary Region │ │ (us-east-1) │ │ (us-west-2) │ │ │ │ │ │ ┌──────────────────────┐ │ │ ┌──────────────────────┐ │ │ │ Application Layer │ │ │ │ Application Layer │ │ │ │ (Active) │ │ │ │ (Standby/Active) │ │ │ └──────────┬───────────┘ │ │ └──────────┬───────────┘ │ │ │ │ │ │ │ │ ┌──────────▼───────────┐ │ │ ┌──────────▼───────────┐ │ │ │ Database │──┼─┼──│ Database │ │ │ │ (Primary) │ │ │ │ (Read Replica) │ │ │ └──────────────────────┘ │ │ └──────────────────────┘ │ └────────────────────────────┘ └────────────────────────────┘ │ │ Cross-Region Replication ▼ ┌──────────────────────┐ │ S3 Backup │ │ (Multi-Region) │ └──────────────────────┘
Backup Strategy backup_policy: database: frequency: continuous retention: 35 days cross_region: true encryption: aws/rds
application_data: frequency: daily retention: 90 days versioning: enabled lifecycle: - transition_to_ia: 30 days - transition_to_glacier: 90 days - expiration: 365 days
configuration: frequency: on_change retention: unlimited storage: git + s3
Security Architecture Network Security ┌─────────────────────────────────────────────────────────────┐ │ VPC │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Public Subnet │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │ │ │ │ NAT GW │ │ ALB │ │ Bastion │ │ │ │ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────────▼───────────────────────────┐ │ │ │ Private Subnet │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │ │ │ │ App Tier │ │ App Tier │ │ App Tier │ │ │ │ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────────▼───────────────────────────┐ │ │ │ Data Subnet │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │ │ │ │ │ RDS │ │ Redis │ │ Elasticsearch│ │ │ │ │ └─────────────┘ └─────────────┘ └───────────────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘
IAM Best Practices { "Version": "2012-10-17", "Statement": [ { "Sid": "LeastPrivilegeExample", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::my-bucket/uploads/*", "Condition": { "StringEquals": { "aws:PrincipalTag/Team": "engineering" }, "IpAddress": { "aws:SourceIp": ["10.0.0.0/8"] } } } ] }
Reference Materials references/aws_patterns.md - AWS architecture patterns references/gcp_patterns.md - GCP architecture patterns references/multi_cloud.md - Multi-cloud strategies references/cost_optimization.md - Cost optimization guide Scripts
Infrastructure cost analyzer
python scripts/cost_analyzer.py --account production --period monthly
DR validation
python scripts/dr_test.py --region us-west-2 --type failover
Security audit
python scripts/security_audit.py --framework cis --output report.html
Resource inventory
python scripts/inventory.py --accounts all --format csv