Datadog Monitors Create, manage, and maintain monitors for alerting. Prerequisites This requires Go or the pup binary in your path. pup - go install github.com/datadog-labs/pup@latest Ensure ~/go/bin is in $PATH . Quick Start pup auth login Common Operations List Monitors pup monitors list pup monitors list --tags "team:platform" pup monitors list --status "Alert" Get Monitor pup monitors get < id
--json Create Monitor pup monitors create \ --name "High CPU on web servers" \ --type "metric alert" \ --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \ --message "CPU above 80% @slack-ops" Mute/Unmute
Mute with duration
pup monitors mute --id 12345 --duration 1h
Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
Unmute
pup monitors unmute --id 12345 ⚠️ Monitor Creation Best Practices 1. Avoid Alert Fatigue Rule Why No flapping alerts Use last_Xm not last_1m Meaningful thresholds Based on SLOs, not guesses Actionable alerts If no action needed, don't alert Include runbook @runbook-url in message
WRONG - will flap constantly
query
"avg(last_1m):avg:system.cpu.user{*} > 50"
❌ Too sensitive
CORRECT - stable alerting
query
"avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"
✅ Reasonable window
- Use Proper Scoping
WRONG - alerts on everything
query
"avg(last_5m):avg:system.cpu.user{*} > 80"
❌ No scope
CORRECT - scoped to what matters
query
"avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"
✅
- Set Recovery Thresholds monitor = { "query" : "avg(last_5m):avg:system.cpu.user{env:prod} > 80" , "options" : { "thresholds" : { "critical" : 80 , "critical_recovery" : 70 ,
✅ Prevents flapping
"warning" : 60 , "warning_recovery" : 50 } } } 4. Include Context in Messages message = """
High CPU Alert
Host: {{host.name}} Current Value: {{value}} Threshold: {{threshold}}
Runbook
- Check top processes:
ssh {{host.name}} 'top -bn1 | head -20' - Check recent deploys
- Scale if needed
@slack-ops @pagerduty-oncall
"""
⚠️ NEVER Delete Monitors Directly
Use safe deletion workflow (same as dashboards):
def
safe_mark_monitor_for_deletion
(
monitor_id
:
str
,
client
)
-
bool : """Mark monitor instead of deleting.""" monitor = client . get_monitor ( monitor_id ) name = monitor . get ( "name" , "" ) if "[MARKED FOR DELETION]" in name : print ( f"Already marked: { name } " ) return False new_name = f"[MARKED FOR DELETION] { name } " client . update_monitor ( monitor_id , { "name" : new_name } ) print ( f"✓ Marked: { new_name } " ) return True Monitor Types Type Use Case metric alert CPU, memory, custom metrics query alert Complex metric queries service check Agent check status event alert Event stream patterns log alert Log pattern matching composite Combine multiple monitors apm APM metrics Audit Monitors
Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}' Downtime vs Muting Use When Mute monitor Quick one-off, < 1 hour Downtime Scheduled maintenance, recurring
Downtime (preferred)
pup downtime create \ --scope "env:prod" \ --monitor-tags "team:platform" \ --start "2024-01-15T02:00:00Z" \ --end "2024-01-15T06:00:00Z" Failure Handling Problem Fix Alert not firing Check query returns data, thresholds Too many alerts Increase window, add recovery threshold No data alerts Check agent connectivity, metric exists Auth error pup auth refresh References Monitor Types Alerting Best Practices SLO Monitors