Azure AKS Agent CLI Skill Overview
The Agentic CLI for Azure Kubernetes Service (AKS) is an AI-powered troubleshooting and insights tool (currently in preview) that brings advanced diagnostics directly to your terminal. It allows you to ask natural language questions about your cluster's health, configuration, and issues without requiring deep Kubernetes expertise or knowledge of complex command syntax.
Primary Command: az aks agent
Quick Reference Installation
Prerequisites: Azure CLI version 2.76 or higher
az version
Install the extension (takes 5-10 minutes)
az extension add --name aks-agent --debug
Verify installation
az extension list az aks agent --help
Initialize LLM configuration (interactive wizard)
az aks agent-init
Remove extension if needed
az extension remove --name aks-agent --debug
Basic Usage
Get cluster credentials first
az aks get-credentials --resource-group
Start interactive troubleshooting
az aks agent -g
Ask a specific question
az aks agent -g
Non-interactive mode (batch processing)
az aks agent -g
Workflow Decision Tree What do you need to do? ├── Cluster Health Check? │ └── Use: az aks agent --query "What's the health status of my cluster?" ├── Troubleshoot Pod Issues? │ └── Use: az aks agent --query "Why are my pods failing?" ├── Networking Problems? │ └── Use: az aks agent --query "Diagnose networking issues" ├── Storage Issues? │ └── Use: az aks agent --query "Check storage configuration" ├── Security/RBAC Issues? │ └── Use: az aks agent --query "Review RBAC configuration" ├── Node Pool Problems? │ └── Use: az aks agent --query "Check node pool health" └── Configuration Review? └── Use: az aks agent --query "Review cluster configuration"
Command Reference Core Commands Command Description az aks agent Start interactive AI-powered troubleshooting az aks agent-init Initialize LLM provider configuration az aks agent --help Show help and available options Command Parameters Parameter Description Default -g, --resource-group Resource group name Required -n, --name AKS cluster name Required --api-key LLM API key From env or config --config-file Config file path ~/.azure/aksAgent.config --max-steps Max investigation steps 10 --model LLM model specification From config --no-interactive Run in batch mode false --show-tool-output Display tool call outputs false --refresh-toolsets Refresh toolsets status false LLM Model Specifications
Azure OpenAI
--model "azure/gpt-4o" --model "azure/gpt-4o-mini"
OpenAI
--model "gpt-4o" --model "gpt-4o-mini"
Anthropic
--model "anthropic/claude-sonnet-4" --model "anthropic/claude-3-5-sonnet"
Gemini
--model "gemini/gemini-pro"
Configuration Environment Variables
Azure OpenAI API Key
export AZURE_API_KEY="your-azure-openai-key"
OpenAI API Key
export OPENAI_API_KEY="your-openai-key"
Anthropic API Key
export ANTHROPIC_API_KEY="your-anthropic-key"
Config File Structure (~/.azure/aksAgent.config)
Azure OpenAI Configuration
llm_provider: azure
azure_api_base: https://
OR OpenAI Configuration
llm_provider: openai model: gpt-4o
OR Anthropic Configuration
llm_provider: anthropic model: claude-sonnet-4
Azure OpenAI Requirements Deployment name: Must match model name Minimum TPM: 1,000,000+ (Tokens Per Minute) Minimum context size: 128,000+ tokens API Base Format: https://{endpoint}.openai.azure.com/ (NOT AI Foundry URI) Common Use Cases Cluster Health Analysis
General health check
az aks agent -g myRG -n myCluster --query "What's the overall health of my cluster?"
Node status
az aks agent -g myRG -n myCluster --query "Are all nodes healthy and ready?"
Resource utilization
az aks agent -g myRG -n myCluster --query "Show me resource utilization across nodes"
Pod Troubleshooting
Failed pods analysis
az aks agent -g myRG -n myCluster --query "Why are pods in CrashLoopBackOff?"
Pending pods
az aks agent -g myRG -n myCluster --query "Why are some pods stuck in Pending state?"
OOMKilled pods
az aks agent -g myRG -n myCluster --query "Investigate OOMKilled containers"
Networking Issues
Network policy review
az aks agent -g myRG -n myCluster --query "Are there network policies blocking traffic?"
DNS troubleshooting
az aks agent -g myRG -n myCluster --query "Diagnose DNS resolution issues"
Service connectivity
az aks agent -g myRG -n myCluster --query "Why can't pods reach external services?"
Storage Troubleshooting
PVC issues
az aks agent -g myRG -n myCluster --query "Why are PersistentVolumeClaims pending?"
Storage class review
az aks agent -g myRG -n myCluster --query "Review storage class configuration"
Security Analysis
RBAC review
az aks agent -g myRG -n myCluster --query "Are RBAC permissions configured correctly?"
Security best practices
az aks agent -g myRG -n myCluster --query "What security improvements do you recommend?"
AKS Events Reference Viewing Cluster Events
Get cluster credentials first
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER
List all events
kubectl get events
Filter by namespace
kubectl get events --namespace default
Watch auto-repair events
kubectl get events --field-selector=source=aks-auto-repair --watch
Detailed pod events
kubectl describe pod $POD_NAME
Event Types Type Description Normal Routine operations and expected activities Warning Potentially problematic situations requiring attention Common Event Reasons Reason Description FailedScheduling Pod failed to be scheduled on a node CrashLoopBackOff Container is in a restart loop Scheduled Pod successfully assigned to a node Pulled Container image successfully pulled Created Container created Started Container started OOMKilled Container killed due to out of memory Event Fields Field Description type Warning or Normal reason Short reason code message Human-readable description namespace Kubernetes namespace firstSeen First observation timestamp lastSeen Most recent observation object Associated Kubernetes object Best Practices Effective Query Strategies
Start broad, then narrow
Start with general health
"What's wrong with my cluster?"
Then focus on specific issues
"Why are pods in namespace X failing?"
Provide context about symptoms
"Pods are restarting frequently in the production namespace" "Services are experiencing intermittent timeouts"
Ask for specific recommendations
"What changes do you recommend to improve cluster performance?" "How can I fix the networking issues you identified?"
Request historical analysis
"What patterns do you see in recent pod failures?" "Have there been any unusual events in the last 24 hours?"
Security Considerations Ensure proper RBAC permissions are configured Use Azure AD integration for authentication Follow principle of least privilege Audit command usage through Azure activity logs Service account tokens for automation Integration Tips Combine with traditional monitoring: Use alongside Azure Monitor and Container Insights Proactive monitoring: Run health checks regularly Document findings: Save important diagnostic outputs Enable Container Insights: For events beyond 1-hour retention Troubleshooting the Agent Installation Issues
Verify Azure CLI version
az version
Upgrade Azure CLI if needed
az upgrade
Force reinstall extension
az extension remove --name aks-agent az extension add --name aks-agent --debug
Authentication Issues
Verify Azure login
az account show
Re-authenticate
az login
Check subscription
az account set --subscription
LLM Connection Issues
Reinitialize LLM configuration
az aks agent-init
Check API key environment variable
echo $AZURE_API_KEY
Test with explicit API key
az aks agent -g myRG -n myCluster --api-key "your-key"
Rate Limiting Symptom: Slow responses or errors Solution: Increase TPM quota in Azure OpenAI deployment Minimum recommended: 1,000,000 TPM Important Notes Preview Feature: This is currently in preview with limited warranty coverage Not for Production Critical: Not recommended for production-critical decision making Event Retention: Kubernetes events only persist for 1 hour by default Context Window: Requires 128,000+ token context for optimal performance Authentication: Always authenticate with az login before using Resources References Core References references/cli-commands.md - Complete CLI command reference references/troubleshooting.md - Extended troubleshooting guide references/examples.md - Practical usage examples Diagnostics & Monitoring references/diagnostics.md - AKS Diagnose and Solve Problems guide references/monitoring.md - Comprehensive AKS monitoring guide references/control-plane-metrics.md - Control plane metrics (API Server, etcd) Troubleshooting Guides references/kubelet-logs.md - Kubelet logs access and analysis references/memory-saturation.md - Memory saturation identification and resolution references/node-auto-repair.md - Node auto-repair process and monitoring references/api-server-etcd.md - API server and etcd troubleshooting External Documentation AKS Agent Overview AKS Agent Installation AKS Events AKS Diagnostics Monitor AKS Control Plane Metrics Kubelet Logs Memory Saturation Node Auto-Repair API Server/etcd Troubleshooting AKS Monitoring Reference AKS Agent Troubleshooting AKS Agent FAQ GitHub Repository Example Config