Kubernetes Debugging Skill Overview Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow. Trigger Phrases Use this skill when requests resemble: "My pod is in CrashLoopBackOff ; help me find the root cause." "Service DNS works in one pod but not another." "Deployment rollout is stuck." "Pods are Pending and not scheduling." "Cluster health looks degraded after a change." "PVC is pending and pods cannot mount storage." Prerequisites Run from the skill directory ( devops-skills-plugin/skills/k8s-debug ) so relative script paths work as written. Required kubectl installed and configured. An active cluster context. Read access to namespaces, pods, events, services, and nodes. Quick preflight: kubectl config current-context kubectl auth can-i get pods -A kubectl auth can-i get events -A kubectl get ns Optional but Recommended jq for more precise filtering in ./scripts/cluster_health.sh . Metrics API ( metrics-server ) for kubectl top . In-container debug tools ( nslookup , getent , curl , wget , ip ) for deep network tests. Fallback behavior: If optional tools are missing, scripts continue and print warnings with reduced output. If kubectl top is unavailable, continue with kubectl describe and events. When to Use This Skill Use this skill for: Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled) Service connectivity or DNS resolution issues Network policy or ingress problems Volume and storage mount failures Deployment rollout issues Cluster health or performance degradation Resource exhaustion (CPU/memory) Configuration problems (ConfigMaps, Secrets, RBAC) Safety Rules for Disruptive Commands Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback. Commands requiring explicit confirmation: kubectl delete pod ... --force --grace-period=0 kubectl drain ... kubectl rollout restart ... kubectl rollout undo ... kubectl debug ... --copy-to=... Before disruptive actions:
Snapshot current state for rollback and incident notes
- kubectl get deploy,rs,pod,svc
- -n
- <
- namespace
- >
- -o
- wide
- kubectl get pod
- <
- pod-name
- >
- -n
- <
- namespace
- >
- -o
- yaml
- >
- before-
- <
- pod-name
- >
- .yaml
- kubectl get events
- -n
- <
- namespace
- >
- --sort-by
- =
- '.lastTimestamp'
- >
- before-events.txt
- Reference Navigation Map
- Load only the section needed for the observed symptom.
- Symptom / Need
- Open
- Start section
- You need an end-to-end diagnosis path
- ./references/troubleshooting_workflow.md
- General Debugging Workflow
- Pod state is
- Pending
- ,
- CrashLoopBackOff
- , or
- ImagePullBackOff
- ./references/troubleshooting_workflow.md
- Pod Lifecycle Troubleshooting
- Service reachability or DNS failure
- ./references/troubleshooting_workflow.md
- Network Troubleshooting Workflow
- Node pressure or performance regression
- ./references/troubleshooting_workflow.md
- Resource and Performance Workflow
- PVC / PV / storage class issues
- ./references/troubleshooting_workflow.md
- Storage Troubleshooting Workflow
- Quick symptom-to-fix lookup
- ./references/common_issues.md
- matching issue heading
- Post-mortem fix options for known issues
- ./references/common_issues.md
- Solutions
- sections
- Scripts Overview
- Script
- Purpose
- Required args
- Optional args
- Output
- Fallback behavior
- ./scripts/cluster_health.sh
- Cluster-wide health snapshot (nodes, workloads, events, common failure states)
- None
- --strict
- ,
- K8S_REQUEST_TIMEOUT
- env var
- Sectioned report to stdout
- Continues on check failures, tracks them in summary and exit code
- ./scripts/network_debug.sh
- Pod-centric network and DNS diagnostics
- (
- defaults to
- default
- )
- --strict
- ,
- --insecure
- ,
- K8S_REQUEST_TIMEOUT
- env var
- Sectioned report to stdout
- Uses secure API probe by default; insecure TLS requires explicit
- --insecure
- ./scripts/pod_diagnostics.py
- Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context)
- -n/--namespace
- ,
- -o/--output
- Sectioned report to stdout or file
- Fails fast on missing access; skips optional metrics/log blocks with clear messages
- Script Exit Codes
- ./scripts/cluster_health.sh
- and
- ./scripts/network_debug.sh
- share the same contract:
- 0
-
- checks completed with no check failures (warnings allowed unless
- --strict
- is set).
- 1
-
- one or more checks failed, or warnings occurred in
- --strict
- mode.
- 2
-
- blocked preconditions (for example: missing
- kubectl
- , no active context, inaccessible namespace/pod).
- Deterministic Debugging Workflow
- Follow this systematic approach for any Kubernetes issue:
- 1. Preflight and Scope
- kubectl config current-context
- kubectl get ns
- kubectl auth can-i get pods
- -n
- <
- namespace
- >
- If preflight fails, stop and fix access/context first.
- 2. Identify the Problem Layer
- Categorize the issue:
- Application Layer
-
- Application crashes, errors, bugs
- Pod Layer
-
- Pod not starting, restarting, or pending
- Service Layer
-
- Network connectivity, DNS issues
- Node Layer
-
- Node not ready, resource exhaustion
- Cluster Layer
-
- Control plane issues, API problems
- Storage Layer
-
- Volume mount failures, PVC issues
- Configuration Layer
- ConfigMap, Secret, RBAC issues
3. Gather Diagnostics with the Right Script
Use the appropriate diagnostic script based on scope:
Pod-Level Diagnostics
Use
./scripts/pod_diagnostics.py
for comprehensive pod analysis:
python3 ./scripts/pod_diagnostics.py
<
pod-name
-n < namespace
This script gathers: Pod status and description Pod events Container logs (current and previous) Resource usage Node information YAML configuration Output can be saved for analysis: python3 ./scripts/pod_diagnostics.py < pod-name
-n < namespace
-o diagnostics.txt Cluster-Level Health Check Use ./scripts/cluster_health.sh for overall cluster diagnostics: ./scripts/cluster_health.sh
cluster-health- $( date +%Y%m%d-%H%M%S ) .txt This script checks: Cluster info and version Node status and resources Pods across all namespaces Failed/pending pods Recent events Deployments, services, statefulsets, daemonsets PVCs and PVs Component health Common error states (CrashLoopBackOff, ImagePullBackOff) Network Diagnostics Use ./scripts/network_debug.sh for connectivity issues: ./scripts/network_debug.sh < namespace
< pod-name
or force warning sensitivity / insecure TLS only when explicitly needed:
- ./scripts/network_debug.sh
- --strict
- <
- namespace
- >
- <
- pod-name
- >
- ./scripts/network_debug.sh
- --insecure
- <
- namespace
- >
- <
- pod-name
- >
- This script analyzes:
- Pod network configuration
- DNS setup and resolution
- Service endpoints
- Network policies
- Connectivity tests
- CoreDNS logs
- 4. Follow Issue-Specific Reference Workflow
- Based on the identified issue, consult
- ./references/troubleshooting_workflow.md
- :
- Pod Pending
-
- Resource/scheduling workflow
- CrashLoopBackOff
-
- Application crash workflow
- ImagePullBackOff
-
- Image pull workflow
- Service issues
-
- Network connectivity workflow
- DNS failures
-
- DNS troubleshooting workflow
- Resource exhaustion
-
- Performance investigation workflow
- Storage issues
-
- PVC binding workflow
- Deployment stuck
- Rollout workflow
5. Apply Targeted Fixes
Refer to
./references/common_issues.md
for symptom-specific fixes.
6. Verify and Close
Run final verification:
kubectl get pods
-n
<
namespace
-o wide kubectl get events -n < namespace
--sort-by
'.lastTimestamp' | tail -20 kubectl rollout status deployment/ < name
-n < namespace
Issue is done when user-visible behavior is healthy and no new critical warning events appear. Example Flows Example 1: CrashLoopBackOff in payments Namespace python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail = 100 kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions. Example 2: Service DNS/Connectivity Failure ./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm kubectl get svc checkout-api -n checkout kubectl get endpoints checkout-api -n checkout kubectl get networkpolicies -n checkout Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md . Essential Manual Commands Pod Debugging
View pod status
kubectl get pods -n < namespace
-o wide
Detailed pod information
kubectl describe pod < pod-name
-n < namespace
View logs
kubectl logs < pod-name
-n < namespace
kubectl logs < pod-name
-n < namespace
--previous
Previous container
kubectl logs < pod-name
-n < namespace
-c < container
Specific container
Execute commands in pod
kubectl exec < pod-name
-n < namespace
-it -- /bin/sh
Get pod YAML
kubectl get pod < pod-name
-n < namespace
-o yaml Service and Network Debugging
Check services
kubectl get svc -n < namespace
kubectl describe svc < service-name
-n < namespace
Check endpoints
kubectl get endpoints -n < namespace
Test DNS
kubectl exec < pod-name
-n < namespace
-- nslookup kubernetes.default
View events
kubectl get events -n < namespace
--sort-by
'.lastTimestamp' Resource Monitoring
Node resources
kubectl top nodes kubectl describe nodes
Pod resources
kubectl top pods -n < namespace
kubectl top pod < pod-name
-n < namespace
--containers Emergency Operations
Restart deployment
kubectl rollout restart deployment/ < name
-n < namespace
Rollback deployment
kubectl rollout undo deployment/ < name
-n < namespace
Force delete stuck pod
kubectl delete pod < pod-name
-n < namespace
--force --grace-period = 0
Drain node (maintenance)
kubectl drain < node-name
--ignore-daemonsets --delete-emptydir-data
Cordon node (prevent scheduling)
- kubectl cordon
- <
- node-name
- >
- Completion Criteria
- Troubleshooting session is complete when all are true:
- Cluster context and namespace are confirmed.
- Relevant diagnostic script output is captured.
- Root cause is identified and tied to evidence (events/logs/config/state).
- Any disruptive action was preceded by snapshot and rollback plan.
- Fix verification commands show healthy state.
- Reference path used (
- ./references/troubleshooting_workflow.md
- or
- ./references/common_issues.md
- ) is documented in notes.
- Related Tools
- Useful additional tools for Kubernetes debugging:
- kubectl-debug
-
- Advanced debugging plugin
- stern
-
- Multi-pod log tailing
- kubectx/kubens
-
- Context and namespace switching
- k9s
-
- Terminal UI for Kubernetes
- lens
-
- Desktop IDE for Kubernetes
- Prometheus/Grafana
-
- Monitoring and alerting
- Jaeger/Zipkin
- Distributed tracing