- Prometheus Monitoring and Alerting
- Overview
- Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability in cloud-native environments. Built for multi-dimensional time-series data with flexible querying via PromQL.
- Architecture Components
- Prometheus Server
-
- Core component that scrapes and stores time-series data with local TSDB
- Alertmanager
-
- Handles alerts, deduplication, grouping, routing, and notifications to receivers
- Pushgateway
-
- Allows ephemeral jobs to push metrics (use sparingly - prefer pull model)
- Exporters
-
- Convert metrics from third-party systems to Prometheus format (node, blackbox, etc.)
- Client Libraries
-
- Instrument application code (Go, Java, Python, Rust, etc.)
- Prometheus Operator
-
- Kubernetes-native deployment and management via CRDs
- Remote Storage
-
- Long-term storage via Thanos, Cortex, Mimir for multi-cluster federation
- Data Model
- Metrics
-
- Time-series data identified by metric name and key-value labels
- Format
- :
- metric_name{label1="value1", label2="value2"} sample_value timestamp
- Metric Types
- :
- Counter
-
- Monotonically increasing value (requests, errors) - use
- rate()
- or
- increase()
- for querying
- Gauge
-
- Value that can go up/down (temperature, memory usage, queue length)
- Histogram
-
- Observations in configurable buckets (latency, request size) - exposes
- _bucket
- ,
- _sum
- ,
- _count
- Summary
- Similar to histogram but calculates quantiles client-side - use histograms for aggregation Setup and Configuration Basic Prometheus Server Configuration
prometheus.yml
global : scrape_interval : 15s scrape_timeout : 10s evaluation_interval : 15s external_labels : cluster : "production" region : "us-east-1"
Alertmanager configuration
alerting : alertmanagers : - static_configs : - targets : - alertmanager : 9093
Load rules files
rule_files : - "alerts/.yml" - "rules/.yml"
Scrape configurations
scrape_configs :
Prometheus itself
- job_name : "prometheus" static_configs : - targets : [ "localhost:9090" ]
Application services
- job_name : "application" metrics_path : "/metrics" static_configs : - targets : - "app-1:8080" - "app-2:8080" labels : env : "production" team : "backend"
Kubernetes service discovery
- job_name : "kubernetes-pods" kubernetes_sd_configs : - role : pod relabel_configs :
Only scrape pods with prometheus.io/scrape annotation
- source_labels : [ __meta_kubernetes_pod_annotation_prometheus_io_scrape ] action : keep regex : true
Use custom metrics path if specified
- source_labels : [ meta_kubernetes_pod_annotation_prometheus_io_path ] action : replace target_label : __metrics_path regex : (.+)
Use custom port if specified
- source_labels : [ address , meta_kubernetes_pod_annotation_prometheus_io_port ] action : replace regex : ( [ ^ : ] +)( ? : : \d+) ? ;(\d+) replacement : $1 : $2 target_label : __address
Add namespace label
- source_labels : [ __meta_kubernetes_namespace ] action : replace target_label : kubernetes_namespace
Add pod name label
- source_labels : [ __meta_kubernetes_pod_name ] action : replace target_label : kubernetes_pod_name
Add service name label
- source_labels : [ __meta_kubernetes_pod_label_app ] action : replace target_label : app
Node Exporter for host metrics
- job_name : "node-exporter" static_configs : - targets : - "node-exporter:9100" Alertmanager Configuration
alertmanager.yml
global : resolve_timeout : 5m slack_api_url : "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" pagerduty_url : "https://events.pagerduty.com/v2/enqueue"
Template files for custom notifications
templates : - "/etc/alertmanager/templates/*.tmpl"
Route alerts to appropriate receivers
route : group_by : [ "alertname" , "cluster" , "service" ] group_wait : 10s group_interval : 10s repeat_interval : 12h receiver : "default" routes :
Critical alerts go to PagerDuty
- match : severity : critical receiver : "pagerduty" continue : true
Database alerts to DBA team
- match : team : database receiver : "dba-team" group_by : [ "alertname" , "instance" ]
Development environment alerts
- match : env : development receiver : "slack-dev" group_wait : 5m repeat_interval : 4h
Inhibition rules (suppress alerts)
inhibit_rules :
Suppress warning alerts if critical alert is firing
- source_match : severity : "critical" target_match : severity : "warning" equal : [ "alertname" , "instance" ]
Suppress instance alerts if entire service is down
- source_match : alertname : "ServiceDown" target_match_re : alertname : ".*" equal : [ "service" ] receivers : - name : "default" slack_configs : - channel : "#alerts" title : "Alert: {{ .GroupLabels.alertname }}" text : "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}" - name : "pagerduty" pagerduty_configs : - service_key : "YOUR_PAGERDUTY_SERVICE_KEY" description : "{{ .GroupLabels.alertname }}" - name : "dba-team" slack_configs : - channel : "#database-alerts" email_configs : - to : "dba-team@example.com" headers : Subject : "Database Alert: {{ .GroupLabels.alertname }}" - name : "slack-dev" slack_configs : - channel : "#dev-alerts" send_resolved : true Best Practices Metric Naming Conventions Follow these naming patterns for consistency:
Format: _
Counters (always use _total suffix)
http_requests_total http_request_errors_total cache_hits_total
Gauges
memory_usage_bytes active_connections queue_size
Histograms (use _bucket, _sum, _count suffixes automatically)
http_request_duration_seconds response_size_bytes db_query_duration_seconds
Use consistent base units
- seconds for duration (not milliseconds)
- bytes for size (not kilobytes)
- ratio for percentages (0.0-1.0, not 0-100) Label Cardinality Management DO
Good: Bounded cardinality
http_requests_total { method="GET" , status="200" , endpoint="/api/users" }
Good: Reasonable number of label values
db_queries_total { table="users" , operation="select" } DON'T
Bad: Unbounded cardinality (user IDs, email addresses, timestamps)
http_requests_total { user_id="12345" } http_requests_total { email="user@example.com" } http_requests_total { timestamp="1234567890" }
Bad: High cardinality (full URLs, IP addresses)
http_requests_total { url="/api/users/12345/profile" } http_requests_total { client_ip="192.168.1.100" } Guidelines Keep label values to < 10 per label (ideally) Total unique time-series per metric should be < 10,000 Use recording rules to pre-aggregate high-cardinality metrics Avoid labels with unbounded values (IDs, timestamps, user input) Recording Rules for Performance Use recording rules to pre-compute expensive queries:
rules/recording_rules.yml
groups : - name : performance_rules interval : 30s rules :
Pre-calculate request rates
- record : job : http_requests : rate5m expr : sum(rate(http_requests_total [ 5m ] )) by (job)
Pre-calculate error rates
- record : job : http_request_errors : rate5m expr : sum(rate(http_request_errors_total [ 5m ] )) by (job)
Pre-calculate error ratio
- record : job : http_request_error_ratio : rate5m expr : | job:http_request_errors:rate5m / job:http_requests:rate5m
Pre-aggregate latency percentiles
- record : job : http_request_duration_seconds : p95 expr : histogram_quantile(0.95 , sum(rate(http_request_duration_seconds_bucket [ 5m ] )) by (job , le)) - record : job : http_request_duration_seconds : p99 expr : histogram_quantile(0.99 , sum(rate(http_request_duration_seconds_bucket [ 5m ] )) by (job , le)) - name : aggregation_rules interval : 1m rules :
Multi-level aggregation for dashboards
- record : instance : node_cpu_utilization : ratio expr : | 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) - record : cluster : node_cpu_utilization : ratio expr : avg(instance : node_cpu_utilization : ratio)
Memory aggregation
- record : instance : node_memory_utilization : ratio expr : | 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) Alert Design (Symptoms vs Causes) Alert on symptoms (user-facing impact), not causes
alerts/symptom_based.yml
groups : - name : symptom_alerts rules :
GOOD: Alert on user-facing symptoms
- alert : HighErrorRate expr : | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for : 5m labels : severity : critical team : backend annotations : summary : "High error rate detected" description : "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" runbook : "https://wiki.example.com/runbooks/high-error-rate" - alert : HighLatency expr : | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for : 5m labels : severity : warning team : backend annotations : summary : "High latency on {{ $labels.service }}" description : "P95 latency is {{ $value }}s (threshold: 1s)" impact : "Users experiencing slow page loads"
GOOD: SLO-based alerting
- alert : SLOBudgetBurnRate expr : | ( 1 - ( sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) > (14.4 * (1 - 0.999)) # 14.4x burn rate for 99.9% SLO for : 5m labels : severity : critical team : sre annotations : summary : "SLO budget burning too fast" description : "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}" Cause-based alerts (use for debugging, not paging)
alerts/cause_based.yml
groups : - name : infrastructure_alerts rules :
Lower severity for infrastructure issues
- alert : HighMemoryUsage expr : | ( node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes ) / node_memory_MemTotal_bytes > 0.9 for : 10m labels : severity : warning
Not critical unless symptoms appear
- team
- :
- infrastructure
- annotations
- :
- summary
- :
- "High memory usage on {{ $labels.instance }}"
- description
- :
- "Memory usage is {{ $value | humanizePercentage }}"
- -
- alert
- :
- DiskSpaceLow
- expr
- :
- |
- (
- node_filesystem_avail_bytes{mountpoint="/"}
- /
- node_filesystem_size_bytes{mountpoint="/"}
- ) < 0.1
- for
- :
- 5m
- labels
- :
- severity
- :
- warning
- team
- :
- infrastructure
- annotations
- :
- summary
- :
- "Low disk space on {{ $labels.instance }}"
- description
- :
- "Only {{ $value | humanizePercentage }} disk space remaining"
- action
- :
- "Clean up logs or expand disk"
- Alert Best Practices
- For duration
-
- Use
- for
- clause to avoid flapping
- Meaningful annotations
-
- Include summary, description, runbook URL, impact
- Proper severity levels
-
- critical (page immediately), warning (ticket), info (log)
- Actionable alerts
-
- Every alert should require human action
- Include context
- Add labels for team ownership, service, environment PromQL Query Patterns PromQL is the query language for Prometheus. Key concepts: instant vectors, range vectors, scalar, string literals, selectors, operators, functions, and aggregation. Selectors and Matchers
Instant vector selector (latest sample for each time-series)
http_requests_total
Filter by label values
http_requests_total { method = "GET" , status = "200" }
Regex matching (=~) and negative regex (!~)
http_requests_total { status =~ "5.." }
5xx errors
http_requests_total { endpoint !~ "/admin.*" }
exclude admin endpoints
Label absence/presence
http_requests_total { job = "api" , status = "" }
empty label
http_requests_total { job = "api" , status != "" }
non
- empty label
Range vector selector (samples over time)
http_requests_total [ 5m ]
last
5 minutes of samples Rate Calculations
Request rate (requests per second) - ALWAYS use rate() for counters
rate ( http_requests_total [ 5m ] )
Sum by service
sum ( rate ( http_requests_total [ 5m ] ) ) by ( service )
Increase over time window (total count) - for alerts/dashboards showing total
increase ( http_requests_total [ 1h ] )
irate() for volatile, fast-moving counters (more sensitive to spikes)
irate ( http_requests_total [ 5m ] ) Error Ratios
Error rate ratio
sum ( rate ( http_requests_total { status =~ "5.." } [ 5m ] ) ) / sum ( rate ( http_requests_total [ 5m ] ) )
Success rate
sum ( rate ( http_requests_total { status =~ "2.." } [ 5m ] ) ) / sum ( rate ( http_requests_total [ 5m ] ) ) Histogram Queries
P95 latency
histogram_quantile ( 0.95 , sum ( rate ( http_request_duration_seconds_bucket [ 5m ] ) ) by ( le ) )
P50, P95, P99 latency by service
histogram_quantile ( 0.50 , sum ( rate ( http_request_duration_seconds_bucket [ 5m ] ) ) by ( le , service ) ) histogram_quantile ( 0.95 , sum ( rate ( http_request_duration_seconds_bucket [ 5m ] ) ) by ( le , service ) ) histogram_quantile ( 0.99 , sum ( rate ( http_request_duration_seconds_bucket [ 5m ] ) ) by ( le , service ) )
Average request duration
sum ( rate ( http_request_duration_seconds_sum [ 5m ] ) ) by ( service ) / sum ( rate ( http_request_duration_seconds_count [ 5m ] ) ) by ( service ) Aggregation Operations
Sum across all instances
sum ( node_memory_MemTotal_bytes ) by ( cluster )
Average CPU usage
avg ( rate ( node_cpu_seconds_total { mode = "idle" } [ 5m ] ) ) by ( instance )
Maximum value
max ( http_request_duration_seconds ) by ( service )
Minimum value
min ( node_filesystem_avail_bytes ) by ( instance )
Count number of instances
count ( up == 1 ) by ( job )
Standard deviation
stddev ( http_request_duration_seconds ) by ( service ) Advanced Queries
Top 5 services by request rate
topk ( 5 , sum ( rate ( http_requests_total [ 5m ] ) ) by ( service ) )
Bottom 3 instances by available memory
bottomk ( 3 , node_memory_MemAvailable_bytes )
Predict disk full time (linear regression)
predict_linear ( node_filesystem_avail_bytes { mountpoint = "/" } [ 1h ] , 4 * 3600 ) < 0
Compare with 1 day ago
http_requests_total
http_requests_total offset 1d
Rate of change (derivative)
deriv ( node_memory_MemAvailable_bytes [ 5m ] )
Absent metric detection
absent ( up { job = "critical-service" } ) Complex Aggregations
Calculate Apdex score (Application Performance Index)
( sum ( rate ( http_request_duration_seconds_bucket { le = "0.1" } [ 5m ] ) ) + sum ( rate ( http_request_duration_seconds_bucket { le = "0.5" } [ 5m ] ) ) * 0.5 ) / sum ( rate ( http_request_duration_seconds_count [ 5m ] ) )
Multi-window multi-burn-rate SLO
( sum ( rate ( http_requests_total { status =~ "5.." } [ 1h ] ) ) / sum ( rate ( http_requests_total [ 1h ] ) )
0.001 * 14.4 ) and ( sum ( rate ( http_requests_total { status =~ "5.." } [ 5m ] ) ) / sum ( rate ( http_requests_total [ 5m ] ) )
0.001 * 14.4 ) Binary Operators and Vector Matching
Arithmetic operators (+, -, *, /, %, ^)
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
Comparison operators (==, !=, >, <, >=, <=) - filter to matching values
http_request_duration_seconds
1
Logical operators (and, or, unless)
up { job = "api" } and rate ( http_requests_total [ 5m ] )
100
One-to-one matching (default)
method:http_requests:rate5m / method:http_requests:total
Many-to-one matching with group_left
sum ( rate ( http_requests_total [ 5m ] ) ) by ( instance , method ) / on ( instance ) group_left sum ( rate ( http_requests_total [ 5m ] ) ) by ( instance )
One-to-many matching with group_right
sum ( rate ( http_requests_total [ 5m ] ) ) by ( instance ) / on ( instance ) group_right sum ( rate ( http_requests_total [ 5m ] ) ) by ( instance , method ) Time Functions and Offsets
Compare with previous time period
rate ( http_requests_total [ 5m ] ) / rate ( http_requests_total [ 5m ] offset 1h )
Day-over-day comparison
http_requests_total
http_requests_total offset 1d
Time-based filtering
http_requests_total and hour ( )
= 9 and hour ( ) < 17
business hours
day_of_week ( ) == 0 or day_of_week ( ) == 6
weekends
Timestamp functions
time ( ) - process_start_time_seconds # uptime in seconds Service Discovery Prometheus supports multiple service discovery mechanisms for dynamic environments where targets appear and disappear. Static Configuration scrape_configs : - job_name : "static-targets" static_configs : - targets : - "host1:9100" - "host2:9100" labels : env : production region : us - east - 1 File-based Service Discovery scrape_configs : - job_name : 'file-sd' file_sd_configs : - files : - '/etc/prometheus/targets/.json' - '/etc/prometheus/targets/.yml' refresh_interval : 30s
targets/webservers.json
[ { "targets" : [ "web1:8080" , "web2:8080" ] , "labels" : { "job" : "web" , "env" : "prod" } } ] Kubernetes Service Discovery scrape_configs :
Pod-based discovery
- job_name : "kubernetes-pods" kubernetes_sd_configs : - role : pod namespaces : names : - production - staging relabel_configs :
Keep only pods with prometheus.io/scrape=true annotation
- source_labels : [ __meta_kubernetes_pod_annotation_prometheus_io_scrape ] action : keep regex : true
Extract custom scrape path from annotation
- source_labels : [ meta_kubernetes_pod_annotation_prometheus_io_path ] action : replace target_label : __metrics_path regex : (.+)
Extract custom port from annotation
- source_labels : [ address , meta_kubernetes_pod_annotation_prometheus_io_port ] action : replace regex : ( [ ^ : ] +)( ? : : \d+) ? ;(\d+) replacement : $1 : $2 target_label : __address
Add standard Kubernetes labels
- action : labelmap regex : __meta_kubernetes_pod_label_(.+) - source_labels : [ __meta_kubernetes_namespace ] target_label : kubernetes_namespace - source_labels : [ __meta_kubernetes_pod_name ] target_label : kubernetes_pod_name
Service-based discovery
- job_name : "kubernetes-services" kubernetes_sd_configs : - role : service relabel_configs : - source_labels : [ meta_kubernetes_service_annotation_prometheus_io_scrape ] action : keep regex : true - source_labels : [ __meta_kubernetes_service_annotation_prometheus_io_scheme ] action : replace target_label : __scheme regex : (https ? ) - source_labels : [ meta_kubernetes_service_annotation_prometheus_io_path ] action : replace target_label : __metrics_path regex : (.+)
Node-based discovery (for node exporters)
- job_name : "kubernetes-nodes" kubernetes_sd_configs : - role : node relabel_configs : - action : labelmap regex : meta_kubernetes_node_label_(.+) - target_label : __address replacement : kubernetes.default.svc : 443 - source_labels : [ meta_kubernetes_node_name ] regex : (.+) target_label : __metrics_path replacement : /api/v1/nodes/$ { 1 } /proxy/metrics
Endpoints discovery (for service endpoints)
- job_name : "kubernetes-endpoints" kubernetes_sd_configs : - role : endpoints relabel_configs : - source_labels : [ __meta_kubernetes_service_annotation_prometheus_io_scrape ] action : keep regex : true - source_labels : [ __meta_kubernetes_endpoint_port_name ] action : keep regex : metrics Consul Service Discovery scrape_configs : - job_name : "consul-services" consul_sd_configs : - server : "consul.example.com:8500" datacenter : "dc1" services : [ "web" , "api" , "cache" ] tags : [ "production" ] relabel_configs : - source_labels : [ __meta_consul_service ] target_label : service - source_labels : [ __meta_consul_tags ] target_label : tags EC2 Service Discovery scrape_configs : - job_name : "ec2-instances" ec2_sd_configs : - region : us - east - 1 access_key : YOUR_ACCESS_KEY secret_key : YOUR_SECRET_KEY port : 9100 filters : - name : tag : Environment values : [ production ] - name : instance - state - name values : [ running ] relabel_configs : - source_labels : [ __meta_ec2_tag_Name ] target_label : instance_name - source_labels : [ __meta_ec2_availability_zone ] target_label : availability_zone - source_labels : [ __meta_ec2_instance_type ] target_label : instance_type DNS Service Discovery scrape_configs : - job_name : "dns-srv-records" dns_sd_configs : - names : - "_prometheus._tcp.example.com" type : "SRV" refresh_interval : 30s relabel_configs : - source_labels : [ __meta_dns_name ] target_label : instance Relabeling Actions Reference Action Description Use Case keep Keep targets where regex matches source labels Filter targets by annotation/label drop Drop targets where regex matches source labels Exclude specific targets replace Replace target label with value from source labels Extract custom labels/paths/ports labelmap Map source label names to target labels via regex Copy all Kubernetes labels labeldrop Drop labels matching regex Remove internal metadata labels labelkeep Keep only labels matching regex Reduce cardinality hashmod Set target label to hash of source labels modulo N Sharding/routing High Availability and Scalability Prometheus High Availability Setup
Deploy multiple identical Prometheus instances scraping same targets
Use external labels to distinguish instances
global : external_labels : replica : prometheus - 1
Change to prometheus-2, etc.
cluster : production
Alertmanager will deduplicate alerts from multiple Prometheus instances
alerting : alertmanagers : - static_configs : - targets : - alertmanager - 1 : 9093 - alertmanager - 2 : 9093 - alertmanager - 3 : 9093 Alertmanager Clustering
alertmanager.yml - HA cluster configuration
global : resolve_timeout : 5m route : receiver : "default" group_by : [ "alertname" , "cluster" ] group_wait : 10s group_interval : 10s repeat_interval : 12h receivers : - name : "default" slack_configs : - api_url : "https://hooks.slack.com/services/YOUR/WEBHOOK" channel : "#alerts"
Start Alertmanager cluster members
alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094
Federation for Hierarchical Monitoring
Global Prometheus federating from regional instances
scrape_configs : - job_name : "federate" scrape_interval : 15s honor_labels : true metrics_path : "/federate" params : "match[]" :
Pull aggregated metrics only
- '{job="prometheus"}' - '{name=~"job:.*"}'
Recording rules
- "up" static_configs : - targets : - "prometheus-us-east-1:9090" - "prometheus-us-west-2:9090" - "prometheus-eu-west-1:9090" labels : region : "us-east-1" Remote Storage for Long-term Retention
Prometheus remote write to Thanos/Cortex/Mimir
remote_write : - url : "http://thanos-receive:19291/api/v1/receive" queue_config : capacity : 10000 max_shards : 50 min_shards : 1 max_samples_per_send : 5000 batch_send_deadline : 5s min_backoff : 30ms max_backoff : 100ms write_relabel_configs :
Drop high-cardinality metrics before remote write
- source_labels : [ name ] regex : "go_.*" action : drop
Prometheus remote read from long-term storage
remote_read : - url : "http://thanos-query:9090/api/v1/read" read_recent : true Thanos Architecture for Global View
Thanos Sidecar - runs alongside Prometheus
thanos sidecar \
- prometheus.url=http : //localhost : 9090 \ - - tsdb.path=/prometheus \ - - objstore.config - file=/etc/thanos/bucket.yml \ - - grpc - address=0.0.0.0 : 10901 \ - - http - address=0.0.0.0 : 10902
Thanos Store - queries object storage
thanos store \
- data - dir=/var/thanos/store \ - - objstore.config - file=/etc/thanos/bucket.yml \ - - grpc - address=0.0.0.0 : 10901 \ - - http - address=0.0.0.0 : 10902
Thanos Query - global query interface
thanos query \
- http - address=0.0.0.0 : 9090 \ - - grpc - address=0.0.0.0 : 10901 \ - - store=prometheus - 1 - sidecar : 10901 \ - - store=prometheus - 2 - sidecar : 10901 \ - - store=thanos - store : 10901
Thanos Compactor - downsample and compact blocks
thanos compact \
- data - dir=/var/thanos/compact \ - - objstore.config - file=/etc/thanos/bucket.yml \ - - retention.resolution - raw=30d \ - - retention.resolution - 5m=90d \ - - retention.resolution - 1h=365d Horizontal Sharding with Hashmod
Split scrape targets across multiple Prometheus instances using hashmod
scrape_configs : - job_name : "kubernetes-pods-shard-0" kubernetes_sd_configs : - role : pod relabel_configs :
Hash pod name and keep only shard 0 (mod 3)
- source_labels : [ __meta_kubernetes_pod_name ] modulus : 3 target_label : __tmp_hash action : hashmod - source_labels : [ __tmp_hash ] regex : "0" action : keep - job_name : "kubernetes-pods-shard-1" kubernetes_sd_configs : - role : pod relabel_configs : - source_labels : [ __meta_kubernetes_pod_name ] modulus : 3 target_label : __tmp_hash action : hashmod - source_labels : [ __tmp_hash ] regex : "1" action : keep
shard-2 similar pattern...
Kubernetes Integration ServiceMonitor for Prometheus Operator
servicemonitor.yaml
apiVersion : monitoring.coreos.com/v1 kind : ServiceMonitor metadata : name : app - metrics namespace : monitoring labels : app : myapp release : prometheus spec :
Select services to monitor
selector : matchLabels : app : myapp
Define namespaces to search
namespaceSelector : matchNames : - production - staging
Endpoint configuration
endpoints : - port : metrics
Service port name
path : /metrics interval : 30s scrapeTimeout : 10s
Relabeling
relabelings : - sourceLabels : [ __meta_kubernetes_pod_name ] targetLabel : pod - sourceLabels : [ __meta_kubernetes_namespace ] targetLabel : namespace
Metric relabeling (filter/modify metrics)
metricRelabelings : - sourceLabels : [ name ] regex : "go_.*" action : drop
Drop Go runtime metrics
- sourceLabels : [ status ] regex : "[45].." targetLabel : error replacement : "true"
Optional: TLS configuration
tlsConfig:
insecureSkipVerify: true
ca:
secret:
name: prometheus-tls
key: ca.crt
PodMonitor for Direct Pod Scraping
podmonitor.yaml
apiVersion : monitoring.coreos.com/v1 kind : PodMonitor metadata : name : app - pods namespace : monitoring labels : release : prometheus spec :
Select pods to monitor
selector : matchLabels : app : myapp
Namespace selection
namespaceSelector : matchNames : - production
Pod metrics endpoints
podMetricsEndpoints : - port : metrics path : /metrics interval : 15s
Relabeling
relabelings : - sourceLabels : [ __meta_kubernetes_pod_label_version ] targetLabel : version - sourceLabels : [ __meta_kubernetes_pod_node_name ] targetLabel : node PrometheusRule for Alerts and Recording Rules
prometheusrule.yaml
apiVersion : monitoring.coreos.com/v1 kind : PrometheusRule metadata : name : app - rules namespace : monitoring labels : release : prometheus role : alert - rules spec : groups : - name : app_alerts interval : 30s rules : - alert : HighErrorRate expr : | ( sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m])) / sum(rate(http_requests_total{app="myapp"}[5m])) ) > 0.05 for : 5m labels : severity : critical team : backend annotations : summary : "High error rate on {{ $labels.namespace }}/{{ $labels.pod }}" description : "Error rate is {{ $value | humanizePercentage }}" dashboard : "https://grafana.example.com/d/app-overview" runbook : "https://wiki.example.com/runbooks/high-error-rate" - alert : PodCrashLooping expr : | rate(kube_pod_container_status_restarts_total[15m]) > 0 for : 5m labels : severity : warning annotations : summary : "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping" description : "Container {{ $labels.container }} has restarted {{ $value }} times in 15m" - name : app_recording_rules interval : 30s rules : - record : app : http_requests : rate5m expr : sum(rate(http_requests_total { app="myapp" } [ 5m ] )) by (namespace , pod , method , status) - record : app : http_request_duration_seconds : p95 expr : | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod) ) Prometheus Custom Resource
prometheus.yaml
apiVersion : monitoring.coreos.com/v1 kind : Prometheus metadata : name : prometheus namespace : monitoring spec : replicas : 2 version : v2.45.0
Service account for Kubernetes API access
serviceAccountName : prometheus
Select ServiceMonitors
serviceMonitorSelector : matchLabels : release : prometheus
Select PodMonitors
podMonitorSelector : matchLabels : release : prometheus
Select PrometheusRules
ruleSelector : matchLabels : release : prometheus role : alert - rules
Resource limits
resources : requests : memory : 2Gi cpu : 1000m limits : memory : 4Gi cpu : 2000m
Storage
storage : volumeClaimTemplate : spec : accessModes : - ReadWriteOnce resources : requests : storage : 50Gi storageClassName : fast - ssd
Retention
retention : 30d retentionSize : 45GB
Alertmanager configuration
alerting : alertmanagers : - namespace : monitoring name : alertmanager port : web
External labels
externalLabels : cluster : production region : us - east - 1
Security context
securityContext : fsGroup : 2000 runAsNonRoot : true runAsUser : 1000
Enable admin API for management operations
enableAdminAPI : false
Additional scrape configs (from Secret)
additionalScrapeConfigs
:
name
:
additional
-
scrape
-
configs
key
:
prometheus
-
additional.yaml
Application Instrumentation Examples
Go Application
// main.go
package
main
import
(
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var
(
// Counter for total requests
httpRequestsTotal
=
promauto
.
NewCounterVec
(
prometheus
.
CounterOpts
{
Name
:
"http_requests_total"
,
Help
:
"Total number of HTTP requests"
,
}
,
[
]
string
{
"method"
,
"endpoint"
,
"status"
}
,
)
// Histogram for request duration
httpRequestDuration
=
promauto
.
NewHistogramVec
(
prometheus
.
HistogramOpts
{
Name
:
"http_request_duration_seconds"
,
Help
:
"HTTP request duration in seconds"
,
Buckets
:
[
]
float64
{
.001
,
.005
,
.01
,
.025
,
.05
,
.1
,
.25
,
.5
,
1
,
2.5
,
5
,
10
}
,
}
,
[
]
string
{
"method"
,
"endpoint"
}
,
)
// Gauge for active connections
activeConnections
=
promauto
.
NewGauge
(
prometheus
.
GaugeOpts
{
Name
:
"active_connections"
,
Help
:
"Number of active connections"
,
}
,
)
// Summary for response sizes
responseSizeBytes
=
promauto
.
NewSummaryVec
(
prometheus
.
SummaryOpts
{
Name
:
"http_response_size_bytes"
,
Help
:
"HTTP response size in bytes"
,
Objectives
:
map
[
float64
]
float64
{
0.5
:
0.05
,
0.9
:
0.01
,
0.99
:
0.001
}
,
}
,
[
]
string
{
"endpoint"
}
,
)
)
// Middleware to instrument HTTP handlers
func
instrumentHandler
(
endpoint
string
,
handler http
.
HandlerFunc
)
http
.
HandlerFunc
{
return
func
(
w http
.
ResponseWriter
,
r
*
http
.
Request
)
{
start
:=
time
.
Now
(
)
activeConnections
.
Inc
(
)
defer
activeConnections
.
Dec
(
)
// Wrap response writer to capture status code
wrapped
:=
&
responseWriter
{
ResponseWriter
:
w
,
statusCode
:
200
}
handler
(
wrapped
,
r
)
duration
:=
time
.
Since
(
start
)
.
Seconds
(
)
httpRequestDuration
.
WithLabelValues
(
r
.
Method
,
endpoint
)
.
Observe
(
duration
)
httpRequestsTotal
.
WithLabelValues
(
r
.
Method
,
endpoint
,
http
.
StatusText
(
wrapped
.
statusCode
)
)
.
Inc
(
)
}
}
type
responseWriter
struct
{
http
.
ResponseWriter
statusCode
int
}
func
(
rw
*
responseWriter
)
WriteHeader
(
code
int
)
{
rw
.
statusCode
=
code
rw
.
ResponseWriter
.
WriteHeader
(
code
)
}
func
handleUsers
(
w http
.
ResponseWriter
,
r
*
http
.
Request
)
{
w
.
Header
(
)
.
Set
(
"Content-Type"
,
"application/json"
)
w
.
Write
(
[
]
byte
(
{"users": []}
)
)
}
func
main
(
)
{
// Register handlers
http
.
HandleFunc
(
"/api/users"
,
instrumentHandler
(
"/api/users"
,
handleUsers
)
)
http
.
Handle
(
"/metrics"
,
promhttp
.
Handler
(
)
)
// Start server
http
.
ListenAndServe
(
":8080"
,
nil
)
}
Python Application (Flask)
app.py
from flask import Flask , request from prometheus_client import Counter , Histogram , Gauge , generate_latest import time app = Flask ( name )
Define metrics
request_count
Counter ( 'http_requests_total' , 'Total HTTP requests' , [ 'method' , 'endpoint' , 'status' ] ) request_duration = Histogram ( 'http_request_duration_seconds' , 'HTTP request duration in seconds' , [ 'method' , 'endpoint' ] , buckets = [ .001 , .005 , .01 , .025 , .05 , .1 , .25 , .5 , 1 , 2.5 , 5 , 10 ] ) active_requests = Gauge ( 'active_requests' , 'Number of active requests' )
Middleware for instrumentation
@app . before_request def before_request ( ) : active_requests . inc ( ) request . start_time = time . time ( ) @app . after_request def after_request ( response ) : active_requests . dec ( ) duration = time . time ( ) - request . start_time request_duration . labels ( method = request . method , endpoint = request . endpoint or 'unknown' ) . observe ( duration ) request_count . labels ( method = request . method , endpoint = request . endpoint or 'unknown' , status = response . status_code ) . inc ( ) return response @app . route ( '/metrics' ) def metrics ( ) : return generate_latest ( ) @app . route ( '/api/users' ) def users ( ) : return { 'users' : [ ] } if name == 'main' : app . run ( host = '0.0.0.0' , port = 8080 ) Production Deployment Checklist Set appropriate retention period (balance storage vs history needs) Configure persistent storage with adequate size Enable high availability (multiple Prometheus replicas or federation) Set up remote storage for long-term retention (Thanos, Cortex, Mimir) Configure service discovery for dynamic environments Implement recording rules for frequently-used queries Create symptom-based alerts with proper annotations Set up Alertmanager with appropriate routing and receivers Configure inhibition rules to reduce alert noise Add runbook URLs to all critical alerts Implement proper label hygiene (avoid high cardinality) Monitor Prometheus itself (meta-monitoring) Set up authentication and authorization Enable TLS for scrape targets and remote storage Configure rate limiting for queries Test alert and recording rule validity ( promtool check rules ) Implement backup and disaster recovery procedures Document metric naming conventions for the team Create dashboards in Grafana for common queries Set up log aggregation alongside metrics (Loki) Troubleshooting Commands
Check Prometheus configuration syntax
promtool check config prometheus.yml
Check rules file syntax
promtool check rules alerts/*.yml
Test PromQL queries
promtool query instant http://localhost:9090 'up'
Check which targets are up
curl http://localhost:9090/api/v1/targets
Query current metric values
curl 'http://localhost:9090/api/v1/query?query=up'
Check service discovery
curl http://localhost:9090/api/v1/targets/metadata
View TSDB stats
curl http://localhost:9090/api/v1/status/tsdb
Check runtime information
curl http://localhost:9090/api/v1/status/runtimeinfo Quick Reference Common PromQL Patterns
Request rate per second
rate ( http_requests_total [ 5m ] )
Error ratio percentage
100 * sum ( rate ( http_requests_total { status =~ "5.." } [ 5m ] ) ) / sum ( rate ( http_requests_total [ 5m ] ) )
P95 latency from histogram
histogram_quantile ( 0.95 , sum ( rate ( http_request_duration_seconds_bucket [ 5m ] ) ) by ( le ) )
Average latency from histogram
sum ( rate ( http_request_duration_seconds_sum [ 5m ] ) ) / sum ( rate ( http_request_duration_seconds_count [ 5m ] ) )
Memory utilization percentage
100 * ( 1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes )
CPU utilization (non-idle)
100 * ( 1 - avg ( rate ( node_cpu_seconds_total { mode = "idle" } [ 5m ] ) ) )
Disk space remaining percentage
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes
Top 5 endpoints by request rate
topk ( 5 , sum ( rate ( http_requests_total [ 5m ] ) ) by ( endpoint ) )
Service uptime in days
( time ( ) - process_start_time_seconds ) / 86400
Request rate growth compared to 1 hour ago
rate ( http_requests_total [ 5m ] ) / rate ( http_requests_total [ 5m ] offset 1h ) Alert Rule Patterns
High error rate (symptom-based)
alert : HighErrorRate expr : | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for : 5m labels : severity : critical annotations : summary : "Error rate is {{ $value | humanizePercentage }}" runbook : "https://runbooks.example.com/high-error-rate"
High latency P95
alert : HighLatency expr : | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for : 5m labels : severity : warning
Service down
alert : ServiceDown expr : up { job="critical - service" } == 0 for : 2m labels : severity : critical
Disk space low (cause-based, warning only)
alert : DiskSpaceLow expr : | node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1 for : 10m labels : severity : warning
Pod crash looping
alert : PodCrashLooping expr : rate(kube_pod_container_status_restarts_total [ 15m ] )
0 for : 5m labels : severity : warning Recording Rule Naming Convention
Format: level:metric:operations
level = aggregation level (job, instance, cluster)
metric = base metric name
operations = transformations applied (rate5m, sum, ratio)
groups : - name : aggregation_rules rules :
Instance-level aggregation
- record : instance : node_cpu_utilization : ratio expr : 1 - avg(rate(node_cpu_seconds_total { mode="idle" } [ 5m ] )) by (instance)
Job-level aggregation
- record : job : http_requests : rate5m expr : sum(rate(http_requests_total [ 5m ] )) by (job)
Job-level error ratio
- record : job : http_request_errors : ratio expr : | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
Cluster-level aggregation
- record : cluster : cpu_utilization : ratio expr : avg(instance : node_cpu_utilization : ratio) Metric Naming Best Practices Pattern Good Example Bad Example Counter suffix http_requests_total http_requests Base units http_request_duration_seconds http_request_duration_ms Ratio range cache_hit_ratio (0.0-1.0) cache_hit_percentage (0-100) Byte units response_size_bytes response_size_kb Namespace prefix myapp_http_requests_total http_requests_total Label naming {method="GET", status="200"} {httpMethod="GET", statusCode="200"} Label Cardinality Guidelines Cardinality Examples Recommendation Low (<10) HTTP method, status code, environment Safe for all labels Medium (10-100) API endpoint, service name, pod name Safe with aggregation High (100-1000) Container ID, hostname Use only when necessary Unbounded User ID, IP address, timestamp, URL path Never use as label Kubernetes Annotation-based Scraping
Pod annotations for automatic Prometheus scraping
apiVersion : v1 kind : Pod metadata : annotations : prometheus.io/scrape : "true" prometheus.io/port : "8080" prometheus.io/path : "/metrics" prometheus.io/scheme : "http" spec : containers : - name : app image : myapp : latest ports : - containerPort : 8080 name : metrics Alertmanager Routing Patterns route : receiver : default group_by : [ "alertname" , "cluster" ] routes :
Critical alerts to PagerDuty
- match : severity : critical receiver : pagerduty continue : true
Also send to default
Team-based routing
- match : team : database receiver : dba - team group_by : [ "alertname" , "instance" ]
Environment-based routing
- match : env : development receiver : slack - dev repeat_interval : 4h
Time-based routing (office hours only)
- match : severity : warning receiver : email active_time_intervals : - business - hours time_intervals : - name : business - hours time_intervals : - times : - start_time : "09:00" end_time : "17:00" weekdays : [ "monday:friday" ] Additional Resources Prometheus Documentation PromQL Basics Best Practices Alerting Rules Recording Rules Prometheus Operator Thanos Documentation Google SRE Book - Monitoring