dt-obs-problems

安装量: 225
排名: #9503

安装

npx skills add https://github.com/dynatrace/dynatrace-for-ai --skill dt-obs-problems
Problem Analysis Skill
Analyze Dynatrace AI-detected problems including root cause identification, impact assessment, and correlation with logs and metrics.
Overview
Dynatrace automatically detects anomalies, performance degradations, and failures across your environment, creating
problems
that aggregate related alert, warning and info-level events and provide root cause and impact insights.
What are Problems?
Problems are automatically detected, software and infrastructure health and resilience issues that:
Automatically correlate
related alert, warning, and info-level events across services, infrastructure, frontend applications, and user sessions
Identify root causes
using causal analysis of Smartscape dependencies
Assess business impact
by tracking affected users and services
Reduce alert noise
by grouping related symptoms into single problems that share the same root cause and impact
Track problem lifecycle
from early detection through resolution
Event Kinds
The
event.kind
field (stable, permission) identifies the high-level event type:
event.kind
value
Description
DAVIS_EVENT
Davis-detected infrastructure/application events
BIZ_EVENT
Business events (ingested via API or captured from spans)
RUM_EVENT
Real User Monitoring events
AUDIT_EVENT
Administrative/security audit events
event.provider
(stable, permission) identifies the event source.
Problem Categories
Common
event.category
values:
Category
Description
Example
AVAILABILITY
Infrastructure or service unavailable
Web service returns no data, synthetic test actively fails, database connection lost
ERROR
Increased error rates beyond baseline
API error rate jumped from 0.1% to 15%
SLOWDOWN
Performance degradation
Response time increased from 200ms to 5000ms
RESOURCE
Resource saturation
Container memory at 95%, causing OOM kills
CUSTOM
Custom anomaly detections
Business KPI (orders/minute) dropped below threshold
Problem Lifecycle
Detection → ACTIVE → Under Investigation → CLOSED
ACTIVE
Currently occurring issues requiring attention
CLOSED
Resolved issues used for historical analysis
Essential Fields
Common Field Name Mistakes
❌ WRONG
✅ CORRECT
Description
title
event.name
Problem title/description
status
event.status
Problem lifecycle status
severity
event.category
Problem type/category
start
event.start
Problem start time
Correct Status Values
// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE" // Currently occurring problems
// or event.status == "CLOSED" // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1
Key Fields Reference
fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
event.start, // Problem start timestamp
event.end, // Problem end timestamp (if closed)
display_id, // Human-readable problem ID (P-XXXXX)
event.name, // Problem title
event.description, // Detailed description
event.category, // Problem type
event.status, // ACTIVE or CLOSED
dt.smartscape_source.id, // The smartscape ID for the affected resource
dt.davis.affected_users_count, // Number of affected users
smartscape.affected_entity.ids, // Array of affected entity IDs
dt.smartscape.service, // Affected services (may be array)
dt.davis.root_cause_entity, // Entity identified as root cause
root_cause_entity_id, // Root cause entity ID
root_cause_entity_name, // Human-readable root cause name
dt.davis.is_duplicate, // Whether duplicate detection
dt.davis.is_rootcause // Root cause vs. symptom
| limit 10
Standard Query Pattern
Always start problem queries with this foundation:
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20
Key components:
fetch dt.davis.problems
- The problems data source
not(dt.davis.is_duplicate)
- Filter out duplicate detections
event.status == "ACTIVE"
- Show only active problems
Time range - Always specify a reasonable window
Common Query Patterns
Active Problems by Category
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by:
| sort problem_count desc
High-Impact Active Problems (affecting many users)
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count desc
High-Impact Active Problems (affecting many smartscape entities)
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count desc
Specific Problem Details
fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name
Service-Specific Problem History
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by:
Important: Entity Filter DO and DON'T
DO
use array-safe filters and include both deprecated and Smartscape service fields when filtering by service ID:
| filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897"))
DON'T
use scalar equality on service fields or only one field variant:
// Wrong: not array-safe and misses Smartscape-only matches
| filter dt.entity.service == "SERVICE-00E66996F1555897"
Root Cause Analysis Patterns
Basic Root Cause Query
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
display_id,
event.name,
event.description,
root_cause_entity_id,
root_cause_entity_name,
smartscape.affected_entity.ids
Root Cause by Entity Type
Identify which entity types most frequently cause problems:
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20
Affected entity is an AWS resource
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")
Infrastructure Root Cause with Service Impact
fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.service
Problem Blast Radius
Calculate entity impact per root cause:
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
avg_affected = avg(affected_count),
max_affected = max(affected_count),
problem_count = count(),
by:{root_cause_entity_name}
| sort avg_affected desc
Recurring Root Causes
Identify entities repeatedly causing problems:
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
problem_count = count(),
first_occurrence = min(event.start),
last_occurrence = max(event.start),
by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc
Problem Trending and Pattern Analysis
Track problem trends over time, identify recurring issues, and analyze resolution performance.
Primary Files:
references/problem-trending.md
- Timeseries analysis and pattern detection
Common Use Cases:
Active problems over time with
makeTimeseries
Problem creation rate by category
Recurring problem detection by schedule
Resolution time trends and P95 duration analysis
Key Techniques:
makeTimeseries
vs
bin()
Choose the right approach for lifecycle spans vs discrete events
NULL handling
Use
coalesce(event.end, now())
for active problems
Peak hours analysis
Identify when problems occur most frequently
Impact trending
Track user impact changes over time
See
references/problem-trending.md
for complete query patterns and best practices.
Best Practices
Essential Rules
Always filter duplicates
Use
not(dt.davis.is_duplicate)
to avoid counting the same problem multiple times
Use correct status values
:
"ACTIVE"
or
"CLOSED"
, never
"OPEN"
Specify time ranges
Always include time bounds to optimize performance
Include display_id
Essential for problem identification and linking
Test incrementally
Add one filter or field at a time when building queries
Filter early
Apply
not(dt.davis.is_duplicate)
immediately after fetch
Query Development
Start simple
Begin with basic filtering, then add complexity
Test fields first
Run with
| limit 1
to verify field names exist
Use meaningful time ranges
Too broad wastes resources, too narrow misses data
Document problem IDs
Always capture and store
display_id
for reference
Root Cause Verification
Always filter
isNotNull(root_cause_entity_id)
when required
Cross-reference events using
dt.davis.event_ids
Consider time delays: root cause may appear in logs minutes before problem
Time Range Guidelines
// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4h
// ❌ BAD - Scans all historical data
fetch dt.davis.problems
Related Documentation
references/problem-trending.md
Problem trending and timeseries analysis patterns
references/problem-correlation.md
Correlating problems with logs and other telemetry
references/impact-analysis.md
Business and technical impact assessment
references/problem-merging.md
When and why DAVIS merges events into problems
返回排行榜