Datadog Monitors Create, manage, and maintain monitors for alerting. Prerequisites This requires Go or the pup binary in your path. pup - go install github.com/datadog-labs/pup@latest Ensure ~/go/bin is in $PATH . Quick Start pup auth login Common Operations List Monitors pup monitors list pup monitors list --tags "team:platform" pup monitors list --status "Alert" Get Monitor pup monitors get < id

--json Create Monitor pup monitors create \ --name "High CPU on web servers" \ --type "metric alert" \ --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \ --message "CPU above 80% @slack-ops" Mute/Unmute

Mute with duration

pup monitors mute --id 12345 --duration 1h

Or mute with specific end time

pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

Unmute

pup monitors unmute --id 12345 ⚠️ Monitor Creation Best Practices 1. Avoid Alert Fatigue Rule Why No flapping alerts Use last_Xm not last_1m Meaningful thresholds Based on SLOs, not guesses Actionable alerts If no action needed, don't alert Include runbook @runbook-url in message

WRONG - will flap constantly

query

"avg(last_1m):avg:system.cpu.user{*} > 50"

❌ Too sensitive

CORRECT - stable alerting

query

"avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"

✅ Reasonable window

Use Proper Scoping

WRONG - alerts on everything

query

"avg(last_5m):avg:system.cpu.user{*} > 80"

❌ No scope

CORRECT - scoped to what matters

query

"avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"

✅

Set Recovery Thresholds monitor = { "query" : "avg(last_5m):avg:system.cpu.user{env:prod} > 80" , "options" : { "thresholds" : { "critical" : 80 , "critical_recovery" : 70 ,

✅ Prevents flapping

"warning" : 60 , "warning_recovery" : 50 } } } 4. Include Context in Messages message = """

High CPU Alert

Host: {{host.name}} Current Value: {{value}} Threshold: {{threshold}}

Runbook

Check top processes: ssh {{host.name}} 'top -bn1 | head -20'
Check recent deploys
Scale if needed @slack-ops @pagerduty-oncall """ ⚠️ NEVER Delete Monitors Directly Use safe deletion workflow (same as dashboards): def safe_mark_monitor_for_deletion ( monitor_id : str , client ) -

bool : """Mark monitor instead of deleting.""" monitor = client . get_monitor ( monitor_id ) name = monitor . get ( "name" , "" ) if "[MARKED FOR DELETION]" in name : print ( f"Already marked: { name } " ) return False new_name = f"[MARKED FOR DELETION] { name } " client . update_monitor ( monitor_id , { "name" : new_name } ) print ( f"✓ Marked: { new_name } " ) return True Monitor Types Type Use Case metric alert CPU, memory, custom metrics query alert Complex metric queries service check Agent check status event alert Event stream patterns log alert Log pattern matching composite Combine multiple monitors apm APM metrics Audit Monitors

Find monitors without owners

pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

Find noisy monitors (high alert count)

pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}' Downtime vs Muting Use When Mute monitor Quick one-off, < 1 hour Downtime Scheduled maintenance, recurring

Downtime (preferred)

pup downtime create \ --scope "env:prod" \ --monitor-tags "team:platform" \ --start "2024-01-15T02:00:00Z" \ --end "2024-01-15T06:00:00Z" Failure Handling Problem Fix Alert not firing Check query returns data, thresholds Too many alerts Increase window, add recovery threshold No data alerts Check agent connectivity, metric exists Auth error pup auth refresh References Monitor Types Alerting Best Practices SLO Monitors

dd-monitors

安装

Mute with duration

Or mute with specific end time

Unmute

WRONG - will flap constantly

query

❌ Too sensitive

CORRECT - stable alerting

query

✅ Reasonable window

WRONG - alerts on everything

query

❌ No scope

CORRECT - scoped to what matters

query

✅

✅ Prevents flapping

High CPU Alert

Runbook

Find monitors without owners

Find noisy monitors (high alert count)

Downtime (preferred)