Grafana and LGTM Stack Skill Overview
The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:
Loki: Log aggregation and querying (LogQL) Grafana: Visualization, dashboarding, alerting, and exploration Tempo: Distributed tracing (TraceQL) Mimir: Long-term metrics storage (Prometheus-compatible)
This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.
When to Use This Skill Primary Use Cases Creating or modifying Grafana dashboards Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.) Writing queries (PromQL, LogQL, TraceQL) Configuring data sources (Prometheus, Loki, Tempo, Mimir) Setting up alerting rules and notification policies Implementing dashboard variables and templates Dashboard provisioning and GitOps workflows Troubleshooting observability queries Analyzing application performance, errors, or system behavior Who Uses This Skill senior-infrastructure-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture software-engineer: Application dashboards, service metrics visualization devops-engineer: Infrastructure monitoring, deployment dashboards LGTM Stack Components Loki - Log Aggregation Architecture - Loki
Horizontally scalable log aggregation inspired by Prometheus
Indexes only metadata (labels), not log content Cost-effective storage with object stores (S3, GCS, etc.) LogQL query language similar to PromQL Key Concepts - Loki Labels for indexing (low cardinality) Log streams identified by unique label sets Parsers: logfmt, JSON, regex, pattern Line filters and label filters Grafana - Visualization Features Multi-datasource dashboarding Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series Templating and variables for dynamic dashboards Alerting (unified alerting with contact points and notification policies) Dashboard provisioning and GitOps integration Role-based access control (RBAC) Explore mode for ad-hoc queries Annotations for event markers Dashboard folders and organization Tempo - Distributed Tracing Architecture - Tempo
Scalable distributed tracing backend
Cost-effective trace storage TraceQL for trace querying Integration with logs and metrics (trace-to-logs, trace-to-metrics) OpenTelemetry compatible Mimir - Metrics Storage Architecture - Mimir
Horizontally scalable long-term Prometheus storage
Multi-tenancy support Query federation High availability Prometheus remote_write compatible Dashboard Design and Best Practices Dashboard Organization Principles Hierarchy: Overview -> Service -> Component -> Deep Dive Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method) Variable-driven: Use templates for flexibility across environments Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow Performance: Limit queries, use query caching, appropriate time intervals Panel Types and When to Use Them Panel Type Use Case Best For Time Series / Graph Trends over time Request rates, latency, resource usage Stat Single metric value Error rates, current values, percentage Gauge Progress toward limit CPU usage, memory, disk space Bar Gauge Comparative values Top N items, distribution Table Structured data Service lists, error details, resource inventory Pie Chart Proportions Traffic distribution, error breakdown Heatmap Distribution over time Latency percentiles, request patterns Logs Log streams Error investigation, debugging Traces Distributed tracing Performance analysis, dependency mapping Panel Configuration Best Practices Titles and Descriptions Clear, descriptive titles: Include units and metric context Tooltips: Add description fields for panel documentation Examples: Good: "P95 Latency (seconds) by Endpoint" Bad: "Latency" Legends and Labels Show legends only when needed (multiple series) Use {{label}} format for dynamic legend names Place legends appropriately (bottom, right, or hidden) Sort by value when showing Top N Axes and Units Always label axes with units Use appropriate unit formats (seconds, bytes, percent, requests/sec) Set reasonable min/max ranges to avoid misleading scales Use logarithmic scales for wide value ranges Thresholds and Colors Use thresholds for visual cues (green/yellow/red) Standard threshold pattern: Green: Normal operation Yellow: Warning (action may be needed) Red: Critical (immediate attention required) Examples: Error rate: 0% (green), 1% (yellow), 5% (red) P95 latency: <1s (green), 1-3s (yellow), >3s (red) Links and Drilldowns Link panels to related dashboards Use data links for context (logs, traces, related services) Create drill-down paths: Overview -> Service -> Component -> Details Link to runbooks for alert panels Dashboard Variables and Templating
Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.
Variable Types Type Purpose Example Query Populate from data source Namespaces, services, pods Custom Static list of options Environments (prod/staging/dev) Interval Time interval selection Auto-adjusted query intervals Datasource Switch between data sources Multiple Prometheus instances Constant Hidden values for queries Cluster name, region Text box Free-form input Custom filters Common Variable Patterns { "templating": { "list": [ { "name": "datasource", "type": "datasource", "query": "prometheus", "description": "Select Prometheus data source" }, { "name": "namespace", "type": "query", "datasource": "${datasource}", "query": "label_values(kube_pod_info, namespace)", "multi": true, "includeAll": true, "description": "Kubernetes namespace filter" }, { "name": "app", "type": "query", "datasource": "${datasource}", "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)", "multi": true, "includeAll": true, "description": "Application filter (depends on namespace)" }, { "name": "interval", "type": "interval", "auto": true, "auto_count": 30, "auto_min": "10s", "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"], "description": "Query resolution interval" }, { "name": "environment", "type": "custom", "options": [ { "text": "Production", "value": "prod" }, { "text": "Staging", "value": "staging" }, { "text": "Development", "value": "dev" } ], "current": { "text": "Production", "value": "prod" } } ] } }
Variable Usage in Queries
Variables are referenced with $variable_name or ${variable_name} syntax:
Simple variable reference
rate(http_requests_total{namespace="$namespace"}[5m])
Multi-select with regex match
rate(http_requests_total{namespace=~"$namespace"}[5m])
Variable in legend
rate(http_requests_total{app="$app"}[5m]) by (method)
Legend format: "{{method}}"
Using interval variable for adaptive queries
rate(http_requests_total[$__interval])
Chained variables (app depends on namespace)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])
Advanced Variable Techniques
Regex filtering:
{ "name": "pod", "type": "query", "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)", "regex": "/^$app-.*/", "description": "Filter pods by app prefix" }
All option with custom value:
{ "name": "status", "type": "custom", "options": ["200", "404", "500"], "includeAll": true, "allValue": ".*", "description": "HTTP status code filter" }
Dependent variables (variable chain):
$datasource (datasource type) $cluster (query: depends on datasource) $namespace (query: depends on cluster) $app (query: depends on namespace) $pod (query: depends on app) Annotations
Annotations display events as vertical markers on time series panels:
{ "annotations": { "list": [ { "name": "Deployments", "datasource": "Prometheus", "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])", "tagKeys": "deployment,namespace", "textFormat": "Deployment: {{deployment}}", "iconColor": "blue" }, { "name": "Alerts", "datasource": "Loki", "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"", "textFormat": "Alert: {{alertname}}", "iconColor": "red" } ] } }
Dashboard Performance Optimization Query Optimization Limit number of panels (< 15 per dashboard) Use appropriate time ranges (avoid queries over months) Leverage $__interval for adaptive sampling Avoid high-cardinality grouping (too many series) Use query caching when available Panel Performance Set max data points to reasonable values Use instant queries for current-state panels Combine related metrics into single queries when possible Disable auto-refresh on heavy dashboards Dashboard as Code and Provisioning Dashboard Provisioning
Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.
Provisioning Provider Configuration
File: /etc/grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/provisioning/dashboards
-
name: 'application' orgId: 1 folder: 'Applications' type: file disableDeletion: true editable: false options: path: /var/lib/grafana/dashboards/application
-
name: 'infrastructure' orgId: 1 folder: 'Infrastructure' type: file options: path: /var/lib/grafana/dashboards/infrastructure
Dashboard JSON Structure
Complete dashboard JSON with metadata and provisioning:
{ "dashboard": { "title": "Application Observability - ${app}", "uid": "app-observability", "tags": ["observability", "application"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-1h", "to": "now" }, "refresh": "30s", "templating": { "list": [] }, "panels": [], "links": [] }, "overwrite": true, "folderId": null, "folderUid": null }
Kubernetes ConfigMap Provisioning apiVersion: v1 kind: ConfigMap metadata: name: grafana-dashboards namespace: monitoring labels: grafana_dashboard: "1" data: application-dashboard.json: | { "dashboard": { "title": "Application Metrics", "uid": "app-metrics", "tags": ["application"], "panels": [] } }
Grafana Operator (CRD) apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: application-observability namespace: monitoring spec: instanceSelector: matchLabels: dashboards: "grafana" json: | { "dashboard": { "title": "Application Observability", "panels": [] } }
Data Source Provisioning Loki Data Source
File: /etc/grafana/provisioning/datasources/loki.yaml
apiVersion: 1
datasources: - name: Loki type: loki access: proxy url: http://loki:3100 jsonData: maxLines: 1000 derivedFields: - datasourceUid: tempo_uid matcherRegex: "trace_id=(\w+)" name: TraceID url: "$${__value.raw}" editable: false
Tempo Data Source
File: /etc/grafana/provisioning/datasources/tempo.yaml
apiVersion: 1
datasources: - name: Tempo type: tempo access: proxy url: http://tempo:3200 uid: tempo_uid jsonData: httpMethod: GET tracesToLogs: datasourceUid: loki_uid tags: ["job", "instance", "pod", "namespace"] mappedTags: [{ key: "service.name", value: "service" }] spanStartTimeShift: "1h" spanEndTimeShift: "1h" tracesToMetrics: datasourceUid: prometheus_uid tags: [{ key: "service.name", value: "service" }] serviceMap: datasourceUid: prometheus_uid nodeGraph: enabled: true editable: false
Mimir/Prometheus Data Source
File: /etc/grafana/provisioning/datasources/mimir.yaml
apiVersion: 1
datasources: - name: Mimir type: prometheus access: proxy url: http://mimir:8080/prometheus uid: prometheus_uid jsonData: httpMethod: POST exemplarTraceIdDestinations: - datasourceUid: tempo_uid name: trace_id prometheusType: Mimir prometheusVersion: 2.40.0 cacheLevel: "High" incrementalQuerying: true incrementalQueryOverlapWindow: 10m editable: false
Alerting Alert Rule Configuration
Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.
Prometheus/Mimir Alert Rule
File: /etc/grafana/provisioning/alerting/rules.yaml
apiVersion: 1
groups: - name: application_alerts interval: 1m rules: - uid: error_rate_high title: High Error Rate condition: A data: - refId: A queryType: "" relativeTimeRange: from: 300 to: 0 datasourceUid: prometheus_uid model: expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 intervalMs: 1000 maxDataPoints: 43200 noDataState: NoData execErrState: Error for: 5m annotations: description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)' summary: Application error rate is above threshold runbook_url: https://wiki.company.com/runbooks/high-error-rate labels: severity: critical team: platform isPaused: false
- uid: high_latency
title: High P95 Latency
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
annotations:
description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
runbook_url: https://wiki.company.com/runbooks/high-latency
labels:
severity: warning
Loki Alert Rule apiVersion: 1
groups: - name: log_based_alerts interval: 1m rules: - uid: error_spike title: Error Log Spike condition: A data: - refId: A queryType: "" datasourceUid: loki_uid model: expr: | sum(rate({app="api"} | json | level="error" [5m])) > 10 for: 2m annotations: description: "Error log rate is {{ $values.A.Value }} logs/sec" summary: Spike in error logs detected labels: severity: warning
- uid: critical_error_pattern
title: Critical Error Pattern Detected
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum(count_over_time({app="api"}
|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
)) > 0
for: 0m
annotations:
description: "Critical error pattern found in logs"
labels:
severity: critical
page: true
Contact Points and Notification Policies
File: /etc/grafana/provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints: - orgId: 1 name: slack-critical receivers: - uid: slack_critical type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL title: "{{ .GroupLabels.alertname }}" text: | {{ range .Alerts }} Alert: {{ .Labels.alertname }} Summary: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Severity: {{ .Labels.severity }} {{ end }} disableResolveMessage: false
-
orgId: 1 name: pagerduty-oncall receivers:
- uid: pagerduty_oncall type: pagerduty settings: integrationKey: YOUR_INTEGRATION_KEY severity: critical class: infrastructure
-
orgId: 1 name: email-team receivers:
- uid: email_team type: email settings: addresses: team@company.com singleEmail: true
notificationPolicies: - orgId: 1 receiver: slack-critical group_by: ["alertname", "namespace"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - receiver: pagerduty-oncall matchers: - severity = critical - page = true group_wait: 10s repeat_interval: 1h continue: true
- receiver: email-team
matchers:
- severity = warning
- team = platform
group_interval: 10m
repeat_interval: 12h
LogQL Query Patterns Basic Log Queries Stream Selection
Simple label matching
{namespace="production", app="api"}
Regex matching
{app=~"api|web|worker"}
Not equal
{env!="staging"}
Multiple conditions
{namespace="production", app="api", level!="debug"}
Line Filters
Contains
{app="api"} |= "error"
Does not contain
{app="api"} != "debug"
Regex match
{app="api"} |~ "error|exception|fatal"
Case insensitive
{app="api"} |~ "(?i)error"
Chaining filters
{app="api"} |= "error" != "timeout"
Parsing and Extraction JSON Parsing
Parse JSON logs
{app="api"} | json
Extract specific fields
{app="api"} | json message="msg", level="severity"
Filter on extracted field
{app="api"} | json | level="error"
Nested JSON
{app="api"} | json | line_format "{{.response.status}}"
Logfmt Parsing
Parse logfmt (key=value)
{app="api"} | logfmt
Extract specific fields
{app="api"} | logfmt level, caller, msg
Filter parsed fields
{app="api"} | logfmt | level="error"
Pattern Parsing
Extract with pattern
{app="nginx"} | pattern <ip> - - <_> "<method> <uri> <_>" <status> <_>
Filter on extracted values
{app="nginx"} | pattern <_> <status> <_> | status >= 400
Complex pattern
{app="api"} | pattern level=<level> msg="<msg>" duration=<duration>ms
Aggregations and Metrics Count Queries
Count log lines over time
count_over_time({app="api"}[5m])
Rate of logs
rate({app="api"}[5m])
Errors per second
sum(rate({app="api"} |= "error" [5m])) by (namespace)
Error ratio
sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m]))
Extracted Metrics
Average duration
avg_over_time({app="api"} | logfmt | unwrap duration [5m]) by (endpoint)
P95 latency
quantile_over_time(0.95, {app="api"}
| regexp duration=(?P<duration>[0-9.]+)ms
| unwrap duration [5m]) by (method)
Top 10 error messages
topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) )
TraceQL Query Patterns Basic Trace Queries
Find traces by service
{ .service.name = "api" }
HTTP status codes
{ .http.status_code = 500 }
Combine conditions
{ .service.name = "api" && .http.status_code >= 400 }
Duration filter
{ duration > 1s }
Advanced TraceQL
Parent-child relationship
{ .service.name = "frontend" }
{ .service.name = "backend" && .http.status_code = 500 }
Descendant spans
{ .service.name = "api" }
- { .db.system = "postgresql" && duration > 1s }
Failed database queries
{ .service.name = "api" }
{ .db.system = "postgresql" && status = "error" }
Complete Dashboard Examples Application Observability Dashboard { "dashboard": { "title": "Application Observability - ${app}", "tags": ["observability", "application"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-1h", "to": "now" }, "templating": { "list": [ { "name": "app", "type": "query", "datasource": "Mimir", "query": "label_values(up, app)", "current": { "selected": false, "text": "api", "value": "api" } }, { "name": "namespace", "type": "query", "datasource": "Mimir", "query": "label_values(up{app=\"$app\"}, namespace)", "multi": true, "includeAll": true } ] }, "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)", "legendFormat": "{{method}} - {{status}}" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "yaxes": [ { "format": "reqps", "label": "Requests/sec" } ] }, { "id": 2, "title": "P95 Latency", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))", "legendFormat": "{{endpoint}}" } ], "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }, "yaxes": [ { "format": "s", "label": "Duration" } ], "thresholds": [ { "value": 1, "colorMode": "critical", "fill": true, "line": true, "op": "gt" } ] }, { "id": 3, "title": "Error Rate", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))", "legendFormat": "Error %" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }, "yaxes": [ { "format": "percentunit", "max": 1, "min": 0 } ], "alert": { "conditions": [ { "evaluator": { "params": [0.01], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "reducer": { "type": "avg" }, "type": "query" } ], "frequency": "1m", "handler": 1, "name": "Error Rate Alert", "noDataState": "no_data", "notifications": [] } }, { "id": 4, "title": "Recent Error Logs", "type": "logs", "datasource": "Loki", "targets": [ { "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"", "refId": "A" } ], "options": { "showTime": true, "showLabels": false, "showCommonLabels": false, "wrapLogMessage": true, "dedupStrategy": "none", "enableLogDetails": true }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 } } ], "links": [ { "title": "Explore Logs", "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\"$app\\",namespace=~\\"$namespace\\"}\"}]}", "type": "link", "icon": "doc" }, { "title": "Explore Traces", "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\"$app\\"}\",\"queryType\":\"traceql\"}]}", "type": "link", "icon": "gf-traces" } ] } }
LGTM Stack Configuration Loki Configuration
File: loki.yaml
auth_enabled: false
server: http_listen_port: 3100 grpc_listen_port: 9096 log_level: info
common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory
schema_config: configs: - from: 2024-01-01 store: tsdb object_store: s3 schema: v13 index: prefix: index_ period: 24h
storage_config: aws: s3: s3://us-east-1/my-loki-bucket s3forcepathstyle: true tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache shared_store: s3
limits_config: retention_period: 744h # 31 days ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 max_query_series: 500 max_query_lookback: 30d reject_old_samples: true reject_old_samples_max_age: 168h
compactor: working_directory: /loki/compactor shared_store: s3 compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h
Tempo Configuration
File: tempo.yaml
server: http_listen_port: 3200 grpc_listen_port: 9096
distributor: receivers: otlp: protocols: http: grpc: jaeger: protocols: thrift_http: grpc:
ingester: max_block_duration: 5m
compactor: compaction: block_retention: 720h # 30 days
storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com region: us-east-1 wal: path: /var/tempo/wal
metrics_generator: registry: external_labels: source: tempo cluster: primary storage: path: /var/tempo/generator/wal remote_write: - url: http://mimir:9009/api/v1/push send_exemplars: true
Production Best Practices Performance Optimization Query Optimization Use label filters before line filters Limit time ranges for expensive queries Use unwrap instead of parsing when possible Cache query results with query frontend Dashboard Performance Limit number of panels (< 15 per dashboard) Use appropriate time intervals Avoid high-cardinality grouping Use $__interval for adaptive sampling Storage Optimization Configure retention policies Use compaction for Loki and Tempo Implement tiered storage (hot/warm/cold) Monitor storage growth Security Best Practices Authentication Enable auth (auth_enabled: true in Loki/Tempo) Use OAuth/LDAP for Grafana Implement multi-tenancy with org isolation Authorization Configure RBAC in Grafana Limit datasource access by team Use folder permissions for dashboards Network Security TLS for all components Network policies in Kubernetes Rate limiting at ingress Troubleshooting Common Issues
High Cardinality: Too many unique label combinations
Solution: Reduce label dimensions, use log parsing instead
Query Timeouts: Complex queries on large datasets
Solution: Reduce time range, use aggregations, add query limits
Storage Growth: Unbounded retention
Solution: Configure retention policies, enable compaction
Missing Traces: Incomplete trace data
Solution: Check sampling rates, verify instrumentation Resources Loki Documentation Tempo Documentation Grafana Documentation LogQL Cheat Sheet TraceQL Guide Grafana Operator