kubernetes-specialist

安装量: 85
排名: #9361

安装

npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill kubernetes-specialist
Kubernetes Specialist
Purpose
Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.
When to Use
Designing Kubernetes cluster architecture for production workloads
Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
Troubleshooting cluster issues (networking, storage, performance)
Planning Kubernetes upgrades or multi-cluster strategies
Optimizing resource utilization and cost in Kubernetes environments
Setting up service mesh (Istio, Linkerd) and observability
Implementing Kubernetes security and RBAC policies
Quick Start
Invoke this skill when:
Designing Kubernetes cluster architecture for production workloads
Implementing Helm charts, operators, or GitOps workflows
Troubleshooting cluster issues (networking, storage, performance)
Planning Kubernetes upgrades or multi-cluster strategies
Optimizing resource utilization and cost in Kubernetes environments
Do NOT invoke when:
Simple Docker container needs (use docker commands directly)
Cloud infrastructure provisioning (use cloud-architect instead)
Application code debugging (use backend-developer/frontend-developer)
Database-specific issues (use database-administrator instead)
Decision Framework
Deployment Strategy Selection
├─ Zero downtime required?
│ ├─ Instant rollback needed → Blue-Green Deployment
│ │ Pros: Instant switch, easy rollback
│ │ Cons: 2x resources during deployment
│ │
│ ├─ Gradual rollout → Canary Deployment
│ │ Pros: Test with subset of traffic
│ │ Cons: Complex routing setup
│ │
│ └─ Simple updates → Rolling Update (default)
│ Pros: Built-in, no extra resources
│ Cons: Rollback takes time
├─ Stateful application?
│ ├─ Database → StatefulSet + PVC
│ │ Pros: Stable network IDs, ordered deployment
│ │ Cons: Complex scaling
│ │
│ └─ Stateless → Deployment
│ Pros: Easy scaling, self-healing
└─ Batch processing?
├─ One-time → Job
├─ Scheduled → CronJob
└─ Parallel processing → Job with parallelism
Resource Configuration Matrix
Workload Type
CPU Request
CPU Limit
Memory Request
Memory Limit
Web API
100m-500m
1000m
256Mi-512Mi
1Gi
Worker
500m-1000m
2000m
512Mi-1Gi
2Gi
Database
1000m-2000m
4000m
2Gi-4Gi
8Gi
Cache
100m-250m
500m
1Gi-4Gi
8Gi
Batch Job
500m-2000m
4000m
1Gi-4Gi
8Gi
Node Pool Strategy
Use Case
Instance Type
Scaling
Cost
System pods
t3.large (3 nodes)
Fixed
Low
Applications
m5.xlarge
Auto 3-20
Medium
Batch/Spot
m5.large-2xlarge
Auto 0-50
Very Low
GPU workloads
p3.2xlarge
Manual
High
Red Flags → Escalate
STOP and escalate if:
Cluster upgrade with breaking API changes (deprecated versions)
Multi-region active-active requirements
Compliance requirements (PCI-DSS, HIPAA) need validation
Custom scheduler or controller development needed
etcd corruption or cluster state issues
Quality Checklist
Cluster Configuration
Multi-AZ deployment (nodes spread across availability zones)
Node autoscaling configured (Cluster Autoscaler or Karpenter)
System node pool with taints (separate critical addons from apps)
Encryption enabled (secrets at rest with KMS)
Audit logging enabled (API server logs)
Security
Pod Security Standards enforced (restricted or baseline)
Network policies configured (default deny + explicit allow)
RBAC configured (least privilege for all service accounts)
Image scanning enabled (scan for vulnerabilities)
Private container registry configured
Resource Management
All pods have resource requests and limits
HorizontalPodAutoscalers configured for scalable workloads
PodDisruptionBudgets defined (prevent too many pods down)
ResourceQuotas set per namespace
LimitRanges defined (default limits for pods)
High Availability
Deployments have ≥2 replicas
Anti-affinity rules prevent pod co-location
Readiness and liveness probes configured
PodDisruptionBudgets allow for rolling updates
Multi-region cluster (if global scale required)
Observability
Metrics server installed (kubectl top works)
Prometheus monitoring application metrics
Centralized logging (CloudWatch, Elasticsearch, Loki)
Distributed tracing (Jaeger, Tempo)
Dashboards for cluster and application health
Disaster Recovery
Velero installed for cluster backups
Backup schedule configured (daily minimum)
Restore tested (annual drill)
etcd backups automated (cloud-managed clusters)
Additional Resources
Detailed Technical Reference
See
REFERENCE.md
Code Examples & Patterns
See EXAMPLES.md
返回排行榜