AWS ECS
Amazon Elastic Container Service (ECS) is a fully managed container orchestration service. Run containers on AWS Fargate (serverless) or EC2 instances.
Table of Contents Core Concepts Common Patterns CLI Reference Best Practices Troubleshooting References Core Concepts Cluster
Logical grouping of tasks or services. Can contain Fargate tasks, EC2 instances, or both.
Task Definition
Blueprint for your application. Defines containers, resources, networking, and IAM roles.
Task
Running instance of a task definition. Can run standalone or as part of a service.
Service
Maintains desired count of tasks. Handles deployments, load balancing, and auto scaling.
Launch Types Type Description Use Case Fargate Serverless, pay per task Most workloads EC2 Self-managed instances GPU, Windows, specific requirements Common Patterns Create a Fargate Cluster
AWS CLI:
Create cluster
aws ecs create-cluster --cluster-name my-cluster
With capacity providers
aws ecs create-cluster \ --cluster-name my-cluster \ --capacity-providers FARGATE FARGATE_SPOT \ --default-capacity-provider-strategy \ capacityProvider=FARGATE,weight=1 \ capacityProvider=FARGATE_SPOT,weight=1
Register Task Definition cat > task-definition.json << 'EOF' { "family": "web-app", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "256", "memory": "512", "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole", "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole", "containerDefinitions": [ { "name": "web", "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest", "portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ], "environment": [ {"name": "NODE_ENV", "value": "production"} ], "secrets": [ { "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/web-app", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "ecs" } }, "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 } } ] } EOF
aws ecs register-task-definition --cli-input-json file://task-definition.json
Create Service with Load Balancer aws ecs create-service \ --cluster my-cluster \ --service-name web-service \ --task-definition web-app:1 \ --desired-count 2 \ --launch-type FARGATE \ --network-configuration "awsvpcConfiguration={ subnets=[subnet-12345678,subnet-87654321], securityGroups=[sg-12345678], assignPublicIp=DISABLED }" \ --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/web-tg/1234567890123456,containerName=web,containerPort=8080" \ --health-check-grace-period-seconds 60
Run Standalone Task aws ecs run-task \ --cluster my-cluster \ --task-definition my-batch-job:1 \ --launch-type FARGATE \ --network-configuration "awsvpcConfiguration={ subnets=[subnet-12345678], securityGroups=[sg-12345678], assignPublicIp=ENABLED }"
Update Service (Deploy New Image)
Register new task definition with updated image
aws ecs register-task-definition --cli-input-json file://task-definition.json
Update service to use new version
aws ecs update-service \ --cluster my-cluster \ --service web-service \ --task-definition web-app:2 \ --force-new-deployment
Auto Scaling
Register scalable target
aws application-autoscaling register-scalable-target \ --service-namespace ecs \ --resource-id service/my-cluster/web-service \ --scalable-dimension ecs:service:DesiredCount \ --min-capacity 2 \ --max-capacity 10
Target tracking policy
aws application-autoscaling put-scaling-policy \ --service-namespace ecs \ --resource-id service/my-cluster/web-service \ --scalable-dimension ecs:service:DesiredCount \ --policy-name cpu-target-tracking \ --policy-type TargetTrackingScaling \ --target-tracking-scaling-policy-configuration '{ "TargetValue": 70.0, "PredefinedMetricSpecification": { "PredefinedMetricType": "ECSServiceAverageCPUUtilization" }, "ScaleOutCooldown": 60, "ScaleInCooldown": 120 }'
CLI Reference Cluster Management Command Description aws ecs create-cluster Create cluster aws ecs describe-clusters Get cluster details aws ecs list-clusters List clusters aws ecs delete-cluster Delete cluster Task Definitions Command Description aws ecs register-task-definition Create task definition aws ecs describe-task-definition Get task definition aws ecs list-task-definitions List task definitions aws ecs deregister-task-definition Deregister version Services Command Description aws ecs create-service Create service aws ecs update-service Update service aws ecs describe-services Get service details aws ecs delete-service Delete service Tasks Command Description aws ecs run-task Run standalone task aws ecs stop-task Stop running task aws ecs describe-tasks Get task details aws ecs list-tasks List tasks Best Practices Security Use task roles for AWS API access (not access keys) Use execution roles for ECR/Secrets access Store secrets in Secrets Manager or Parameter Store Use private subnets with NAT gateway Enable CloudTrail for API auditing Performance Right-size CPU/memory — monitor and adjust Use Fargate Spot for fault-tolerant workloads (70% savings) Enable container insights for monitoring Use service discovery for internal communication Reliability Deploy across multiple AZs Configure health checks properly Set appropriate deregistration delay Use circuit breaker for deployments aws ecs update-service \ --cluster my-cluster \ --service web-service \ --deployment-configuration '{ "deploymentCircuitBreaker": { "enable": true, "rollback": true } }'
Cost Optimization Use Fargate Spot for batch workloads Right-size task resources Scale to zero when not needed Use capacity providers for mixed Fargate/Spot Troubleshooting Task Fails to Start
Check:
View stopped tasks
aws ecs describe-tasks \ --cluster my-cluster \ --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status STOPPED --query 'taskArns[0]' --output text)
Common causes:
Image not found (ECR permissions) Secrets access denied Network configuration (subnets, security groups) Resource limits exceeded Container Keeps Restarting
Debug:
Check CloudWatch logs
aws logs get-log-events \ --log-group-name /ecs/web-app \ --log-stream-name "ecs/web/abc123"
Check task details
aws ecs describe-tasks \ --cluster my-cluster \ --tasks task-arn \ --query 'tasks[0].containers[0].{reason:reason,exitCode:exitCode}'
Causes:
Health check failing Application crashing Out of memory Service Stuck Deploying
Check deployment status
aws ecs describe-services \ --cluster my-cluster \ --services web-service \ --query 'services[0].deployments'
Check events
aws ecs describe-services \ --cluster my-cluster \ --services web-service \ --query 'services[0].events[:5]'
Causes:
Health check failing on new tasks Not enough capacity Target group health checks failing Cannot Pull Image from ECR
Check execution role has:
{ "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage" ], "Resource": "*" }
Also check:
VPC endpoint for ECR (if private subnet) NAT gateway (if private subnet) Security group allows HTTPS outbound References ECS Developer Guide ECS API Reference ECS CLI Reference boto3 ECS