安装
npx skills add https://github.com/dynatrace/dynatrace-for-ai --skill dt-obs-problems
- Problem Analysis Skill
- Analyze Dynatrace AI-detected problems including root cause identification, impact assessment, and correlation with logs and metrics.
- Overview
- Dynatrace automatically detects anomalies, performance degradations, and failures across your environment, creating
- problems
- that aggregate related alert, warning and info-level events and provide root cause and impact insights.
- What are Problems?
- Problems are automatically detected, software and infrastructure health and resilience issues that:
- Automatically correlate
- related alert, warning, and info-level events across services, infrastructure, frontend applications, and user sessions
- Identify root causes
- using causal analysis of Smartscape dependencies
- Assess business impact
- by tracking affected users and services
- Reduce alert noise
- by grouping related symptoms into single problems that share the same root cause and impact
- Track problem lifecycle
- from early detection through resolution
- Event Kinds
- The
- event.kind
- field (stable, permission) identifies the high-level event type:
- event.kind
- value
- Description
- DAVIS_EVENT
- Davis-detected infrastructure/application events
- BIZ_EVENT
- Business events (ingested via API or captured from spans)
- RUM_EVENT
- Real User Monitoring events
- AUDIT_EVENT
- Administrative/security audit events
- event.provider
- (stable, permission) identifies the event source.
- Problem Categories
- Common
- event.category
- values:
- Category
- Description
- Example
- AVAILABILITY
- Infrastructure or service unavailable
- Web service returns no data, synthetic test actively fails, database connection lost
- ERROR
- Increased error rates beyond baseline
- API error rate jumped from 0.1% to 15%
- SLOWDOWN
- Performance degradation
- Response time increased from 200ms to 5000ms
- RESOURCE
- Resource saturation
- Container memory at 95%, causing OOM kills
- CUSTOM
- Custom anomaly detections
- Business KPI (orders/minute) dropped below threshold
- Problem Lifecycle
- Detection → ACTIVE → Under Investigation → CLOSED
- ACTIVE
-
- Currently occurring issues requiring attention
- CLOSED
-
- Resolved issues used for historical analysis
- Essential Fields
- Common Field Name Mistakes
- ❌ WRONG
- ✅ CORRECT
- Description
- title
- event.name
- Problem title/description
- status
- event.status
- Problem lifecycle status
- severity
- event.category
- Problem type/category
- start
- event.start
- Problem start time
- Correct Status Values
- // ✅ CORRECT: Use these status values
- fetch dt.davis.problems
- | filter event.status == "ACTIVE" // Currently occurring problems
- // or event.status == "CLOSED" // Resolved problems
- // ❌ INCORRECT: event.status == "OPEN" does not exist!
- | limit 1
- Key Fields Reference
- fetch dt.davis.problems, from:now() - 1h
- | filter not(dt.davis.is_duplicate)
- | fields
- event.start, // Problem start timestamp
- event.end, // Problem end timestamp (if closed)
- display_id, // Human-readable problem ID (P-XXXXX)
- event.name, // Problem title
- event.description, // Detailed description
- event.category, // Problem type
- event.status, // ACTIVE or CLOSED
- dt.smartscape_source.id, // The smartscape ID for the affected resource
- dt.davis.affected_users_count, // Number of affected users
- smartscape.affected_entity.ids, // Array of affected entity IDs
- dt.smartscape.service, // Affected services (may be array)
- dt.davis.root_cause_entity, // Entity identified as root cause
- root_cause_entity_id, // Root cause entity ID
- root_cause_entity_name, // Human-readable root cause name
- dt.davis.is_duplicate, // Whether duplicate detection
- dt.davis.is_rootcause // Root cause vs. symptom
- | limit 10
- Standard Query Pattern
- Always start problem queries with this foundation:
- fetch dt.davis.problems, from:now() - 2h
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | fields event.start, display_id, event.name, event.category
- | sort event.start desc
- | limit 20
- Key components:
- fetch dt.davis.problems
- - The problems data source
- not(dt.davis.is_duplicate)
- - Filter out duplicate detections
- event.status == "ACTIVE"
- - Show only active problems
- Time range - Always specify a reasonable window
- Common Query Patterns
- Active Problems by Category
- fetch dt.davis.problems
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | summarize problem_count = count(), by:
- | sort problem_count desc
- High-Impact Active Problems (affecting many users)
- fetch dt.davis.problems
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | filter dt.davis.affected_users_count > 100
- | fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
- | sort dt.davis.affected_users_count desc
- High-Impact Active Problems (affecting many smartscape entities)
- fetch dt.davis.problems
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | filter arraySize(affected_entity_ids) > 5
- | fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
- | sort impacted_entity_count desc
- Specific Problem Details
- fetch dt.davis.problems
- | filter display_id == "P-XXXXXXXXXX"
- | fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name
- Service-Specific Problem History
- fetch dt.davis.problems, from:now() - 7d
- | filter not(dt.davis.is_duplicate)
- | filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
- | summarize problems = count(), by:
- Important: Entity Filter DO and DON'T
- DO
- use array-safe filters and include both deprecated and Smartscape service fields when filtering by service ID:
- | filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897"))
- DON'T
- use scalar equality on service fields or only one field variant:
- // Wrong: not array-safe and misses Smartscape-only matches
- | filter dt.entity.service == "SERVICE-00E66996F1555897"
- Root Cause Analysis Patterns
- Basic Root Cause Query
- fetch dt.davis.problems, from:now() - 24h
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | fields
- display_id,
- event.name,
- event.description,
- root_cause_entity_id,
- root_cause_entity_name,
- smartscape.affected_entity.ids
- Root Cause by Entity Type
- Identify which entity types most frequently cause problems:
- fetch dt.davis.problems, from:now() - 7d
- | filter not(dt.davis.is_duplicate)
- | filter isNotNull(root_cause_entity_id)
- | summarize problem_count = count(), by:{root_cause_entity_name}
- | sort problem_count desc
- | limit 20
- Affected entity is an AWS resource
- fetch dt.davis.problems, from:now() - 24h
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")
- Infrastructure Root Cause with Service Impact
- fetch dt.davis.problems, from:now() - 30m
- | filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
- | filter matchesPhrase(root_cause_entity_id, "HOST-")
- | filter isNotNull(dt.smartscape.service)
- | fields display_id, event.name, root_cause_entity_name, dt.smartscape.service
- Problem Blast Radius
- Calculate entity impact per root cause:
- fetch dt.davis.problems, from:now() - 7d
- | filter not(dt.davis.is_duplicate)
- | filter isNotNull(root_cause_entity_id)
- | fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
- | summarize
- avg_affected = avg(affected_count),
- max_affected = max(affected_count),
- problem_count = count(),
- by:{root_cause_entity_name}
- | sort avg_affected desc
- Recurring Root Causes
- Identify entities repeatedly causing problems:
- fetch dt.davis.problems, from:now() - 24h
- | filter not(dt.davis.is_duplicate)
- | filter isNotNull(root_cause_entity_id)
- | summarize
- problem_count = count(),
- first_occurrence = min(event.start),
- last_occurrence = max(event.start),
- by:{root_cause_entity_id, root_cause_entity_name}
- | filter problem_count > 3
- | sort problem_count desc
- Problem Trending and Pattern Analysis
- Track problem trends over time, identify recurring issues, and analyze resolution performance.
- Primary Files:
- references/problem-trending.md
- - Timeseries analysis and pattern detection
- Common Use Cases:
- Active problems over time with
- makeTimeseries
- Problem creation rate by category
- Recurring problem detection by schedule
- Resolution time trends and P95 duration analysis
- Key Techniques:
- makeTimeseries
- vs
- bin()
-
- Choose the right approach for lifecycle spans vs discrete events
- NULL handling
-
- Use
- coalesce(event.end, now())
- for active problems
- Peak hours analysis
-
- Identify when problems occur most frequently
- Impact trending
-
- Track user impact changes over time
- See
- references/problem-trending.md
- for complete query patterns and best practices.
- Best Practices
- Essential Rules
- Always filter duplicates
-
- Use
- not(dt.davis.is_duplicate)
- to avoid counting the same problem multiple times
- Use correct status values
- :
- "ACTIVE"
- or
- "CLOSED"
- , never
- "OPEN"
- Specify time ranges
-
- Always include time bounds to optimize performance
- Include display_id
-
- Essential for problem identification and linking
- Test incrementally
-
- Add one filter or field at a time when building queries
- Filter early
-
- Apply
- not(dt.davis.is_duplicate)
- immediately after fetch
- Query Development
- Start simple
-
- Begin with basic filtering, then add complexity
- Test fields first
-
- Run with
- | limit 1
- to verify field names exist
- Use meaningful time ranges
-
- Too broad wastes resources, too narrow misses data
- Document problem IDs
-
- Always capture and store
- display_id
- for reference
- Root Cause Verification
- Always filter
- isNotNull(root_cause_entity_id)
- when required
- Cross-reference events using
- dt.davis.event_ids
- Consider time delays: root cause may appear in logs minutes before problem
- Time Range Guidelines
- // ✅ GOOD - Specific time range
- fetch dt.davis.problems, from:now() - 4h
- // ❌ BAD - Scans all historical data
- fetch dt.davis.problems
- Related Documentation
- references/problem-trending.md
-
- Problem trending and timeseries analysis patterns
- references/problem-correlation.md
-
- Correlating problems with logs and other telemetry
- references/impact-analysis.md
-
- Business and technical impact assessment
- references/problem-merging.md
- When and why DAVIS merges events into problems
← 返回排行榜