k8s-debug

安装量: 42
排名: #17208

安装

npx skills add https://github.com/akin-ozer/cc-devops-skills --skill k8s-debug

Kubernetes Debugging Skill Overview Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow. Trigger Phrases Use this skill when requests resemble: "My pod is in CrashLoopBackOff ; help me find the root cause." "Service DNS works in one pod but not another." "Deployment rollout is stuck." "Pods are Pending and not scheduling." "Cluster health looks degraded after a change." "PVC is pending and pods cannot mount storage." Prerequisites Run from the skill directory ( devops-skills-plugin/skills/k8s-debug ) so relative script paths work as written. Required kubectl installed and configured. An active cluster context. Read access to namespaces, pods, events, services, and nodes. Quick preflight: kubectl config current-context kubectl auth can-i get pods -A kubectl auth can-i get events -A kubectl get ns Optional but Recommended jq for more precise filtering in ./scripts/cluster_health.sh . Metrics API ( metrics-server ) for kubectl top . In-container debug tools ( nslookup , getent , curl , wget , ip ) for deep network tests. Fallback behavior: If optional tools are missing, scripts continue and print warnings with reduced output. If kubectl top is unavailable, continue with kubectl describe and events. When to Use This Skill Use this skill for: Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled) Service connectivity or DNS resolution issues Network policy or ingress problems Volume and storage mount failures Deployment rollout issues Cluster health or performance degradation Resource exhaustion (CPU/memory) Configuration problems (ConfigMaps, Secrets, RBAC) Safety Rules for Disruptive Commands Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback. Commands requiring explicit confirmation: kubectl delete pod ... --force --grace-period=0 kubectl drain ... kubectl rollout restart ... kubectl rollout undo ... kubectl debug ... --copy-to=... Before disruptive actions:

Snapshot current state for rollback and incident notes

kubectl get deploy,rs,pod,svc
-n
<
namespace
>
-o
wide
kubectl get pod
<
pod-name
>
-n
<
namespace
>
-o
yaml
>
before-
<
pod-name
>
.yaml
kubectl get events
-n
<
namespace
>
--sort-by
=
'.lastTimestamp'
>
before-events.txt
Reference Navigation Map
Load only the section needed for the observed symptom.
Symptom / Need
Open
Start section
You need an end-to-end diagnosis path
./references/troubleshooting_workflow.md
General Debugging Workflow
Pod state is
Pending
,
CrashLoopBackOff
, or
ImagePullBackOff
./references/troubleshooting_workflow.md
Pod Lifecycle Troubleshooting
Service reachability or DNS failure
./references/troubleshooting_workflow.md
Network Troubleshooting Workflow
Node pressure or performance regression
./references/troubleshooting_workflow.md
Resource and Performance Workflow
PVC / PV / storage class issues
./references/troubleshooting_workflow.md
Storage Troubleshooting Workflow
Quick symptom-to-fix lookup
./references/common_issues.md
matching issue heading
Post-mortem fix options for known issues
./references/common_issues.md
Solutions
sections
Scripts Overview
Script
Purpose
Required args
Optional args
Output
Fallback behavior
./scripts/cluster_health.sh
Cluster-wide health snapshot (nodes, workloads, events, common failure states)
None
--strict
,
K8S_REQUEST_TIMEOUT
env var
Sectioned report to stdout
Continues on check failures, tracks them in summary and exit code
./scripts/network_debug.sh
Pod-centric network and DNS diagnostics
(
defaults to
default
)
--strict
,
--insecure
,
K8S_REQUEST_TIMEOUT
env var
Sectioned report to stdout
Uses secure API probe by default; insecure TLS requires explicit
--insecure
./scripts/pod_diagnostics.py
Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context)
-n/--namespace
,
-o/--output
Sectioned report to stdout or file
Fails fast on missing access; skips optional metrics/log blocks with clear messages
Script Exit Codes
./scripts/cluster_health.sh
and
./scripts/network_debug.sh
share the same contract:
0
checks completed with no check failures (warnings allowed unless
--strict
is set).
1
one or more checks failed, or warnings occurred in
--strict
mode.
2
blocked preconditions (for example: missing
kubectl
, no active context, inaccessible namespace/pod).
Deterministic Debugging Workflow
Follow this systematic approach for any Kubernetes issue:
1. Preflight and Scope
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods
-n
<
namespace
>
If preflight fails, stop and fix access/context first.
2. Identify the Problem Layer
Categorize the issue:
Application Layer
Application crashes, errors, bugs
Pod Layer
Pod not starting, restarting, or pending
Service Layer
Network connectivity, DNS issues
Node Layer
Node not ready, resource exhaustion
Cluster Layer
Control plane issues, API problems
Storage Layer
Volume mount failures, PVC issues
Configuration Layer
ConfigMap, Secret, RBAC issues 3. Gather Diagnostics with the Right Script Use the appropriate diagnostic script based on scope: Pod-Level Diagnostics Use ./scripts/pod_diagnostics.py for comprehensive pod analysis: python3 ./scripts/pod_diagnostics.py < pod-name

-n < namespace

This script gathers: Pod status and description Pod events Container logs (current and previous) Resource usage Node information YAML configuration Output can be saved for analysis: python3 ./scripts/pod_diagnostics.py < pod-name

-n < namespace

-o diagnostics.txt Cluster-Level Health Check Use ./scripts/cluster_health.sh for overall cluster diagnostics: ./scripts/cluster_health.sh

cluster-health- $( date +%Y%m%d-%H%M%S ) .txt This script checks: Cluster info and version Node status and resources Pods across all namespaces Failed/pending pods Recent events Deployments, services, statefulsets, daemonsets PVCs and PVs Component health Common error states (CrashLoopBackOff, ImagePullBackOff) Network Diagnostics Use ./scripts/network_debug.sh for connectivity issues: ./scripts/network_debug.sh < namespace

< pod-name

or force warning sensitivity / insecure TLS only when explicitly needed:

./scripts/network_debug.sh
--strict
<
namespace
>
<
pod-name
>
./scripts/network_debug.sh
--insecure
<
namespace
>
<
pod-name
>
This script analyzes:
Pod network configuration
DNS setup and resolution
Service endpoints
Network policies
Connectivity tests
CoreDNS logs
4. Follow Issue-Specific Reference Workflow
Based on the identified issue, consult
./references/troubleshooting_workflow.md
:
Pod Pending
Resource/scheduling workflow
CrashLoopBackOff
Application crash workflow
ImagePullBackOff
Image pull workflow
Service issues
Network connectivity workflow
DNS failures
DNS troubleshooting workflow
Resource exhaustion
Performance investigation workflow
Storage issues
PVC binding workflow
Deployment stuck
Rollout workflow 5. Apply Targeted Fixes Refer to ./references/common_issues.md for symptom-specific fixes. 6. Verify and Close Run final verification: kubectl get pods -n < namespace

-o wide kubectl get events -n < namespace

--sort-by

'.lastTimestamp' | tail -20 kubectl rollout status deployment/ < name

-n < namespace

Issue is done when user-visible behavior is healthy and no new critical warning events appear. Example Flows Example 1: CrashLoopBackOff in payments Namespace python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail = 100 kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions. Example 2: Service DNS/Connectivity Failure ./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm kubectl get svc checkout-api -n checkout kubectl get endpoints checkout-api -n checkout kubectl get networkpolicies -n checkout Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md . Essential Manual Commands Pod Debugging

View pod status

kubectl get pods -n < namespace

-o wide

Detailed pod information

kubectl describe pod < pod-name

-n < namespace

View logs

kubectl logs < pod-name

-n < namespace

kubectl logs < pod-name

-n < namespace

--previous

Previous container

kubectl logs < pod-name

-n < namespace

-c < container

Specific container

Execute commands in pod

kubectl exec < pod-name

-n < namespace

-it -- /bin/sh

Get pod YAML

kubectl get pod < pod-name

-n < namespace

-o yaml Service and Network Debugging

Check services

kubectl get svc -n < namespace

kubectl describe svc < service-name

-n < namespace

Check endpoints

kubectl get endpoints -n < namespace

Test DNS

kubectl exec < pod-name

-n < namespace

-- nslookup kubernetes.default

View events

kubectl get events -n < namespace

--sort-by

'.lastTimestamp' Resource Monitoring

Node resources

kubectl top nodes kubectl describe nodes

Pod resources

kubectl top pods -n < namespace

kubectl top pod < pod-name

-n < namespace

--containers Emergency Operations

Restart deployment

kubectl rollout restart deployment/ < name

-n < namespace

Rollback deployment

kubectl rollout undo deployment/ < name

-n < namespace

Force delete stuck pod

kubectl delete pod < pod-name

-n < namespace

--force --grace-period = 0

Drain node (maintenance)

kubectl drain < node-name

--ignore-daemonsets --delete-emptydir-data

Cordon node (prevent scheduling)

kubectl cordon
<
node-name
>
Completion Criteria
Troubleshooting session is complete when all are true:
Cluster context and namespace are confirmed.
Relevant diagnostic script output is captured.
Root cause is identified and tied to evidence (events/logs/config/state).
Any disruptive action was preceded by snapshot and rollback plan.
Fix verification commands show healthy state.
Reference path used (
./references/troubleshooting_workflow.md
or
./references/common_issues.md
) is documented in notes.
Related Tools
Useful additional tools for Kubernetes debugging:
kubectl-debug
Advanced debugging plugin
stern
Multi-pod log tailing
kubectx/kubens
Context and namespace switching
k9s
Terminal UI for Kubernetes
lens
Desktop IDE for Kubernetes
Prometheus/Grafana
Monitoring and alerting
Jaeger/Zipkin
Distributed tracing
返回排行榜