Application Services Skill
Monitor application service performance, health, and runtime-specific metrics using DQL.
Core Capabilities
1. Service Performance (RED Metrics)
Monitor service
Rate, Errors, Duration
using metrics-based timeseries queries.
Key Metrics:
dt.service.request.response_time
- Response time (microseconds)
dt.service.request.count
- Request count
dt.service.request.failure_count
- Failed request count
Common Use Cases:
Response time monitoring (avg, p50, p95, p99)
Error rate tracking and spike detection
Traffic analysis (throughput, peaks, growth)
Performance degradation detection
Multi-cluster comparison
Quick Example:
timeseries {
p95 = percentile(dt.service.request.response_time, 95),
total_requests = sum(dt.service.request.count),
failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]
→
For detailed queries:
See
references/service-metrics.md
2. Advanced Service Analysis
Span-based queries for complex scenarios requiring flexible filtering and custom aggregations.
Use Cases:
SLA compliance tracking with custom thresholds
Service health scoring (multi-dimensional)
Operation/endpoint-level performance analysis
Custom error classification
Failure pattern detection with error details
Quick Example:
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total
→
For detailed queries:
See
references/service-metrics.md
3. Service Messaging Metrics
Monitor message-based service communication (queues, topics).
Key Metrics:
dt.service.messaging.publish.count
- Messages sent to queues or topics
dt.service.messaging.receive.count
- Messages received from queues or topics
dt.service.messaging.process.count
- Messages successfully processed
dt.service.messaging.process.failure_count
- Messages that failed processing
Use Cases:
Message throughput monitoring (publish/receive rates)
Message processing failure tracking
Queue/topic health analysis
Consumer lag detection (publish vs receive rate comparison)
Quick Example:
timeseries {
published = sum(dt.service.messaging.publish.count),
received = sum(dt.service.messaging.receive.count),
processed = sum(dt.service.messaging.process.count),
failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}
→
For detailed queries:
See
references/service-metrics.md
4. Service Mesh Monitoring
Monitor service mesh ingress performance and overhead.
Key Metrics:
dt.service.request.service_mesh.response_time
- Mesh response time (microseconds)
dt.service.request.service_mesh.count
- Mesh request count
dt.service.request.service_mesh.failure_count
- Mesh failure count
Use Cases:
Mesh vs direct performance comparison
Mesh overhead calculation
Mesh failure analysis
gRPC traffic monitoring
Multi-cluster mesh performance
Quick Example:
timeseries {
direct_p95 = percentile(dt.service.request.response_time, 95),
mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000
→
For detailed queries:
See
references/service-metrics.md
5. Runtime-Specific Monitoring
Technology-specific runtime performance and resource usage metrics.
Java/JVM
-
references/java.md
Memory: heap, pools, metaspace
GC: impact, suspension, frequency, pause time
Threads: count monitoring, leak detection
Classes: loading, unloading, growth
Node.js
-
references/nodejs.md
Event loop: utilization, active handles
V8 heap: memory used, total
GC: collection time, suspension
Process: RSS memory
.NET CLR
-
references/dotnet.md
Memory: consumption by generation
GC: collection count, suspension time
Thread pool: threads, queued work
JIT: compilation time
Python
-
references/python.md
Threads: active thread count
Heap: allocated blocks
GC: collection by generation, pause time
Objects: collected, uncollectable
PHP
-
references/php.md
OPcache: hit ratio, memory, restarts
GC: effectiveness, duration
JIT: buffer usage
Interned strings: usage, buffer
Go
-
references/go.md
Goroutines: count, leak detection
GC: suspension, collection time
Memory: heap by state, committed
Scheduler: worker threads, queue size
CGo: call frequency
When to Use This Skill
✅
Use for:
Monitoring service performance (response time, errors, traffic)
Calculating SLA compliance
Analyzing service mesh performance
Monitoring messaging throughput and processing failures
Troubleshooting runtime-specific issues (GC, memory, threads)
Multi-cluster service comparison
Operation/endpoint-level analysis
❌
Don't use for:
Infrastructure metrics (use infrastructure skills)
Log analysis (use logs skills)
Distributed tracing workflows (use traces/spans skills)
Database performance (use database skills)
Agent Instructions
Understanding User Intent
Map user questions to capabilities:
User Request
Use Capability
Key Files
"service performance", "response time", "error rate"
Service Performance (RED)
service-metrics.md
"SLA tracking", "health scoring"
Advanced Service Analysis
service-metrics.md
"service mesh", "Istio", "Linkerd", "mesh overhead"
Service Mesh Monitoring
service-metrics.md
"messaging", "queue", "topic", "publish", "consumer"
Service Messaging Metrics
service-metrics.md
"JVM GC", "Java memory", "heap"
Runtime-Specific (Java)
java.md
"Node.js event loop", "V8 heap"
Runtime-Specific (Node.js)
nodejs.md
".NET CLR", "GC generation"
Runtime-Specific (.NET)
dotnet.md
"Python GC", "thread count"
Runtime-Specific (Python)
python.md
"OPcache", "PHP GC"
Runtime-Specific (PHP)
php.md
"goroutines", "Go GC", "scheduler"
Runtime-Specific (Go)
go.md
Query Construction Patterns
1. Metrics-based (timeseries)
Use for:
Standard monitoring, dashboards, alerting
Pattern:
timeseries
dt-obs-services
安装
npx skills add https://github.com/dynatrace/dynatrace-for-ai --skill dt-obs-services