安装
npx skills add https://github.com/dynatrace/dynatrace-for-ai --skill dt-obs-kubernetes
- Infrastructure Kubernetes
- Monitor and analyze Kubernetes infrastructure using Dynatrace DQL. Query
- cluster resources, monitor workload health, analyze pod placement, optimize
- costs, and assess security posture.
- When to Use This Skill
- Monitoring Kubernetes cluster health and capacity
- Analyzing pod and container resource utilization
- Investigating pod failures, OOMKills, evictions, or crash loops
- Debugging degraded deployments, stuck rollouts, or node pressure
- Optimizing Kubernetes resource costs
- Assessing security posture and compliance
- Troubleshooting workload scheduling and placement
- Auditing ingress routing and network policies
- Reference Files
- File
- Contents
- references/cluster-inventory.md
- Clusters, namespaces, resource distribution
- references/labels-annotations.md
- Labels, annotations, k8s.object parsing patterns
- references/pod-node-placement.md
- Node selectors, affinity, taints, HA scheduling
- references/pod-debugging.md
- Exit codes, pod conditions, init containers, image pull errors, logs, service→pod drill-down
- references/workload-health.md
- Degraded deployments, stuck rollouts, node conditions, CPU throttling, HPA, StatefulSet ordering
- references/pv-pvc.md
- PVC/PV lifecycle, phase reference, orphaned volumes, StorageClass
- references/ingress.md
- Routing rule parsing, TLS audit
- references/network-policies.md
- Policy listing, namespace isolation audit
- Key Concepts
- Entity Types
- Workloads:
- K8S_DEPLOYMENT
- ,
- K8S_STATEFULSET
- ,
- K8S_DAEMONSET
- ,
- K8S_JOB
- ,
- K8S_CRONJOB
- ,
- K8S_HORIZONTALPODAUTOSCALER
- Infrastructure:
- K8S_CLUSTER
- ,
- K8S_NAMESPACE
- ,
- K8S_NODE
- ,
- K8S_POD
- Configuration:
- K8S_SERVICE
- ,
- K8S_CONFIGMAP
- ,
- K8S_SECRET
- ,
- K8S_PERSISTENTVOLUMECLAIM
- ,
- K8S_PERSISTENTVOLUME
- ,
- K8S_INGRESS
- ,
- K8S_NETWORKPOLICY
- Query Types
- smartscapeNodes
- - Query K8s entities:
- smartscapeNodes K8S_POD
- | filter k8s.namespace.name == "production"
- | fields k8s.cluster.name, k8s.pod.name
- timeseries
- - Monitor metrics over time:
- timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
- by:
- | fieldsAdd avg_cpu = arrayAvg(cpu)
- fetch logs
- - Analyze log events:
- fetch logs
- | filter k8s.namespace.name == "production" and loglevel == "ERROR"
- Core Fields
- k8s.cluster.name
- ,
- k8s.namespace.name
- ,
- k8s.pod.name
- ,
- k8s.node.name
- k8s.workload.name
- ,
- k8s.workload.kind
- ,
- k8s.container.name
- k8s.object
- - Full JSON configuration for deep inspection
- tags[label]
- - Access labels and annotations
- Available Metrics
- CPU:
- dt.kubernetes.container.cpu_usage
- ,
- cpu_throttled
- ,
- limits_cpu
- ,
- requests_cpu
- Memory:
- dt.kubernetes.container.memory_working_set
- ,
- limits_memory
- ,
- requests_memory
- Operations:
- dt.kubernetes.container.restarts
- ,
- oom_kills
- Node:
- dt.kubernetes.node.pods_allocatable
- ,
- cpu_allocatable
- ,
- memory_allocatable
- ,
- dt.kubernetes.pods
- Entity Disambiguation
- K8S_POD
- vs
- CONTAINER
- these are different entity types in Dynatrace.
K8S_POD
— K8s-native entities with
k8s.object
JSON, scheduling state, conditions, and K8s metrics. Use this skill.
CONTAINER
— Host-level container inventory (image, lifetime, host assignment). Use
dt-obs-hosts
skill instead.
The smartscape edge is
CONTAINER --(is_part_of)--> K8S_POD
. To reach containers from a pod, traverse backward:
smartscapeNodes K8S_POD
| filter k8s.namespace.name == ""
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id
Service → K8S_POD Correlation
No direct smartscape edge exists between
SERVICE
and
K8S_POD
. The correlation key is the shared dimension
k8s.workload.name
. See
Service → Pod Drill-Down
in
references/pod-debugging.md
for the full two-step pattern.
Common Workflows
1. Cluster Health Check
List all clusters:
smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution
Check node capacity:
timeseries {
current_pods = avg(dt.kubernetes.pods),
max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80
Identify pods in non-Running state:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase
2. Resource Optimization
Find over-provisioned pods (usage < 30%):
timeseries {
cpu_usage = sum(dt.kubernetes.container.cpu_usage),
cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0
Identify containers without limits:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
cpu_limit = container[resources][limits][cpu],
memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)
3. Troubleshooting Pod Issues
Find pods with OOMKills:
timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc
Analyze pod restart patterns:
timeseries restarts = sum(dt.kubernetes.container.restarts),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5
4. Security Assessment
Identify privileged containers:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
privileged = container[securityContext][privileged]
| filter privileged == true
Find containers running as root:
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
run_as_user = container[securityContext][runAsUser],
run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true
5. Scheduling Analysis
Verify pod distribution (HA compliance):
smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
node_count = countDistinct(k8s.node.name),
by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant
6. DAVIS Problems affecting K8s Entities
Find active DAVIS problems affecting K8s entities:
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids
Use entries
smartscape.affected_entity.ids
(array of Smartscape IDs) to look up the affected entity using its Smartscape ID.
Best Practices
Query Performance
Filter early
- Apply cluster/namespace filters immediately
Use specific entity types
- Avoid wildcards
Limit result sets
- Use
limit
for exploration
Monitoring Recommendations
Set resource limits on all containers
Monitor OOMKills and adjust memory limits
Track CPU throttling and adjust CPU limits
Review resource efficiency regularly (target 70-80%)
Implement security best practices (non-root, read-only filesystem)
Use specific image tags (avoid :latest)
Configuration Standards
Use labels for organization (app, environment, team)
Set resource requests and limits
Configure health checks (liveness/readiness probes)
Use TLS for all ingress resources
Document with annotations
Limitations
Unavailable Metrics:
Pod network metrics (rx_bytes, tx_bytes) are NOT available in Grail
Workaround: Use service mesh metrics or host-level network metrics
Query Considerations:
Minimize result set size: Do not include the
k8s.object
field if not necessary
Keep result set as simple as possible: Parsing k8s.object increases query complexity
Large clusters may require pagination or time-range limits
Some K8s status fields update asynchronously
← 返回排行榜