Grafana and LGTM Stack Skill Overview

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

Loki: Log aggregation and querying (LogQL) Grafana: Visualization, dashboarding, alerting, and exploration Tempo: Distributed tracing (TraceQL) Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

When to Use This Skill Primary Use Cases Creating or modifying Grafana dashboards Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.) Writing queries (PromQL, LogQL, TraceQL) Configuring data sources (Prometheus, Loki, Tempo, Mimir) Setting up alerting rules and notification policies Implementing dashboard variables and templates Dashboard provisioning and GitOps workflows Troubleshooting observability queries Analyzing application performance, errors, or system behavior Who Uses This Skill senior-infrastructure-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture software-engineer: Application dashboards, service metrics visualization devops-engineer: Infrastructure monitoring, deployment dashboards LGTM Stack Components Loki - Log Aggregation Architecture - Loki

Horizontally scalable log aggregation inspired by Prometheus

Indexes only metadata (labels), not log content Cost-effective storage with object stores (S3, GCS, etc.) LogQL query language similar to PromQL Key Concepts - Loki Labels for indexing (low cardinality) Log streams identified by unique label sets Parsers: logfmt, JSON, regex, pattern Line filters and label filters Grafana - Visualization Features Multi-datasource dashboarding Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series Templating and variables for dynamic dashboards Alerting (unified alerting with contact points and notification policies) Dashboard provisioning and GitOps integration Role-based access control (RBAC) Explore mode for ad-hoc queries Annotations for event markers Dashboard folders and organization Tempo - Distributed Tracing Architecture - Tempo

Scalable distributed tracing backend

Cost-effective trace storage TraceQL for trace querying Integration with logs and metrics (trace-to-logs, trace-to-metrics) OpenTelemetry compatible Mimir - Metrics Storage Architecture - Mimir

Horizontally scalable long-term Prometheus storage

Multi-tenancy support Query federation High availability Prometheus remote_write compatible Dashboard Design and Best Practices Dashboard Organization Principles Hierarchy: Overview -> Service -> Component -> Deep Dive Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method) Variable-driven: Use templates for flexibility across environments Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow Performance: Limit queries, use query caching, appropriate time intervals Panel Types and When to Use Them Panel Type Use Case Best For Time Series / Graph Trends over time Request rates, latency, resource usage Stat Single metric value Error rates, current values, percentage Gauge Progress toward limit CPU usage, memory, disk space Bar Gauge Comparative values Top N items, distribution Table Structured data Service lists, error details, resource inventory Pie Chart Proportions Traffic distribution, error breakdown Heatmap Distribution over time Latency percentiles, request patterns Logs Log streams Error investigation, debugging Traces Distributed tracing Performance analysis, dependency mapping Panel Configuration Best Practices Titles and Descriptions Clear, descriptive titles: Include units and metric context Tooltips: Add description fields for panel documentation Examples: Good: "P95 Latency (seconds) by Endpoint" Bad: "Latency" Legends and Labels Show legends only when needed (multiple series) Use {{label}} format for dynamic legend names Place legends appropriately (bottom, right, or hidden) Sort by value when showing Top N Axes and Units Always label axes with units Use appropriate unit formats (seconds, bytes, percent, requests/sec) Set reasonable min/max ranges to avoid misleading scales Use logarithmic scales for wide value ranges Thresholds and Colors Use thresholds for visual cues (green/yellow/red) Standard threshold pattern: Green: Normal operation Yellow: Warning (action may be needed) Red: Critical (immediate attention required) Examples: Error rate: 0% (green), 1% (yellow), 5% (red) P95 latency: <1s (green), 1-3s (yellow), >3s (red) Links and Drilldowns Link panels to related dashboards Use data links for context (logs, traces, related services) Create drill-down paths: Overview -> Service -> Component -> Details Link to runbooks for alert panels Dashboard Variables and Templating

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

Variable Types Type Purpose Example Query Populate from data source Namespaces, services, pods Custom Static list of options Environments (prod/staging/dev) Interval Time interval selection Auto-adjusted query intervals Datasource Switch between data sources Multiple Prometheus instances Constant Hidden values for queries Cluster name, region Text box Free-form input Custom filters Common Variable Patterns { "templating": { "list": [ { "name": "datasource", "type": "datasource", "query": "prometheus", "description": "Select Prometheus data source" }, { "name": "namespace", "type": "query", "datasource": "${datasource}", "query": "label_values(kube_pod_info, namespace)", "multi": true, "includeAll": true, "description": "Kubernetes namespace filter" }, { "name": "app", "type": "query", "datasource": "${datasource}", "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)", "multi": true, "includeAll": true, "description": "Application filter (depends on namespace)" }, { "name": "interval", "type": "interval", "auto": true, "auto_count": 30, "auto_min": "10s", "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"], "description": "Query resolution interval" }, { "name": "environment", "type": "custom", "options": [ { "text": "Production", "value": "prod" }, { "text": "Staging", "value": "staging" }, { "text": "Development", "value": "dev" } ], "current": { "text": "Production", "value": "prod" } } ] } }

Variable Usage in Queries

Variables are referenced with $variable_name or ${variable_name} syntax:

Simple variable reference

rate(http_requests_total{namespace="$namespace"}[5m])

Multi-select with regex match

rate(http_requests_total{namespace=~"$namespace"}[5m])

Variable in legend

rate(http_requests_total{app="$app"}[5m]) by (method)

Legend format: "{{method}}"

Using interval variable for adaptive queries

rate(http_requests_total[$__interval])

Chained variables (app depends on namespace)

rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

Advanced Variable Techniques

Regex filtering:

{ "name": "pod", "type": "query", "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)", "regex": "/^$app-.*/", "description": "Filter pods by app prefix" }

All option with custom value:

{ "name": "status", "type": "custom", "options": ["200", "404", "500"], "includeAll": true, "allValue": ".*", "description": "HTTP status code filter" }

Dependent variables (variable chain):

$datasource (datasource type) $cluster (query: depends on datasource) $namespace (query: depends on cluster) $app (query: depends on namespace) $pod (query: depends on app) Annotations

Annotations display events as vertical markers on time series panels:

{ "annotations": { "list": [ { "name": "Deployments", "datasource": "Prometheus", "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])", "tagKeys": "deployment,namespace", "textFormat": "Deployment: {{deployment}}", "iconColor": "blue" }, { "name": "Alerts", "datasource": "Loki", "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"", "textFormat": "Alert: {{alertname}}", "iconColor": "red" } ] } }

Dashboard Performance Optimization Query Optimization Limit number of panels (< 15 per dashboard) Use appropriate time ranges (avoid queries over months) Leverage $__interval for adaptive sampling Avoid high-cardinality grouping (too many series) Use query caching when available Panel Performance Set max data points to reasonable values Use instant queries for current-state panels Combine related metrics into single queries when possible Disable auto-refresh on heavy dashboards Dashboard as Code and Provisioning Dashboard Provisioning

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

Provisioning Provider Configuration

File: /etc/grafana/provisioning/dashboards/dashboards.yaml

apiVersion: 1

providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/provisioning/dashboards

name: 'application' orgId: 1 folder: 'Applications' type: file disableDeletion: true editable: false options: path: /var/lib/grafana/dashboards/application
name: 'infrastructure' orgId: 1 folder: 'Infrastructure' type: file options: path: /var/lib/grafana/dashboards/infrastructure

Dashboard JSON Structure

Complete dashboard JSON with metadata and provisioning:

{ "dashboard": { "title": "Application Observability - ${app}", "uid": "app-observability", "tags": ["observability", "application"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-1h", "to": "now" }, "refresh": "30s", "templating": { "list": [] }, "panels": [], "links": [] }, "overwrite": true, "folderId": null, "folderUid": null }

Kubernetes ConfigMap Provisioning apiVersion: v1 kind: ConfigMap metadata: name: grafana-dashboards namespace: monitoring labels: grafana_dashboard: "1" data: application-dashboard.json: | { "dashboard": { "title": "Application Metrics", "uid": "app-metrics", "tags": ["application"], "panels": [] } }

Grafana Operator (CRD) apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: application-observability namespace: monitoring spec: instanceSelector: matchLabels: dashboards: "grafana" json: | { "dashboard": { "title": "Application Observability", "panels": [] } }

Data Source Provisioning Loki Data Source

File: /etc/grafana/provisioning/datasources/loki.yaml

apiVersion: 1

datasources: - name: Loki type: loki access: proxy url: http://loki:3100 jsonData: maxLines: 1000 derivedFields: - datasourceUid: tempo_uid matcherRegex: "trace_id=(\w+)" name: TraceID url: "$${__value.raw}" editable: false

Tempo Data Source

File: /etc/grafana/provisioning/datasources/tempo.yaml

apiVersion: 1

datasources: - name: Tempo type: tempo access: proxy url: http://tempo:3200 uid: tempo_uid jsonData: httpMethod: GET tracesToLogs: datasourceUid: loki_uid tags: ["job", "instance", "pod", "namespace"] mappedTags: [{ key: "service.name", value: "service" }] spanStartTimeShift: "1h" spanEndTimeShift: "1h" tracesToMetrics: datasourceUid: prometheus_uid tags: [{ key: "service.name", value: "service" }] serviceMap: datasourceUid: prometheus_uid nodeGraph: enabled: true editable: false

Mimir/Prometheus Data Source

File: /etc/grafana/provisioning/datasources/mimir.yaml

apiVersion: 1

datasources: - name: Mimir type: prometheus access: proxy url: http://mimir:8080/prometheus uid: prometheus_uid jsonData: httpMethod: POST exemplarTraceIdDestinations: - datasourceUid: tempo_uid name: trace_id prometheusType: Mimir prometheusVersion: 2.40.0 cacheLevel: "High" incrementalQuerying: true incrementalQueryOverlapWindow: 10m editable: false

Alerting Alert Rule Configuration

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

Prometheus/Mimir Alert Rule

File: /etc/grafana/provisioning/alerting/rules.yaml

apiVersion: 1

groups: - name: application_alerts interval: 1m rules: - uid: error_rate_high title: High Error Rate condition: A data: - refId: A queryType: "" relativeTimeRange: from: 300 to: 0 datasourceUid: prometheus_uid model: expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 intervalMs: 1000 maxDataPoints: 43200 noDataState: NoData execErrState: Error for: 5m annotations: description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)' summary: Application error rate is above threshold runbook_url: https://wiki.company.com/runbooks/high-error-rate labels: severity: critical team: platform isPaused: false

  - uid: high_latency
    title: High P95 Latency
    condition: A
    data:
      - refId: A
        datasourceUid: prometheus_uid
        model:
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
            ) > 2
    for: 10m
    annotations:
      description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
      runbook_url: https://wiki.company.com/runbooks/high-latency
    labels:
      severity: warning

Loki Alert Rule apiVersion: 1

groups: - name: log_based_alerts interval: 1m rules: - uid: error_spike title: Error Log Spike condition: A data: - refId: A queryType: "" datasourceUid: loki_uid model: expr: | sum(rate({app="api"} | json | level="error" [5m])) > 10 for: 2m annotations: description: "Error log rate is {{ $values.A.Value }} logs/sec" summary: Spike in error logs detected labels: severity: warning

  - uid: critical_error_pattern
    title: Critical Error Pattern Detected
    condition: A
    data:
      - refId: A
        datasourceUid: loki_uid
        model:
          expr: |
            sum(count_over_time({app="api"}
              |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
            )) > 0
    for: 0m
    annotations:
      description: "Critical error pattern found in logs"
    labels:
      severity: critical
      page: true

Contact Points and Notification Policies

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

apiVersion: 1

contactPoints: - orgId: 1 name: slack-critical receivers: - uid: slack_critical type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL title: "{{ .GroupLabels.alertname }}" text: | {{ range .Alerts }} Alert: {{ .Labels.alertname }} Summary: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Severity: {{ .Labels.severity }} {{ end }} disableResolveMessage: false

orgId: 1 name: pagerduty-oncall receivers:
- uid: pagerduty_oncall type: pagerduty settings: integrationKey: YOUR_INTEGRATION_KEY severity: critical class: infrastructure
orgId: 1 name: email-team receivers:
- uid: email_team type: email settings: addresses: team@company.com singleEmail: true

notificationPolicies: - orgId: 1 receiver: slack-critical group_by: ["alertname", "namespace"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - receiver: pagerduty-oncall matchers: - severity = critical - page = true group_wait: 10s repeat_interval: 1h continue: true

  - receiver: email-team
    matchers:
      - severity = warning
      - team = platform
    group_interval: 10m
    repeat_interval: 12h

LogQL Query Patterns Basic Log Queries Stream Selection

Simple label matching

{namespace="production", app="api"}

Regex matching

{app=~"api|web|worker"}

Not equal

{env!="staging"}

Multiple conditions

{namespace="production", app="api", level!="debug"}

Line Filters

Contains

{app="api"} |= "error"

Does not contain

{app="api"} != "debug"

Regex match

{app="api"} |~ "error|exception|fatal"

Case insensitive

{app="api"} |~ "(?i)error"

Chaining filters

{app="api"} |= "error" != "timeout"

Parsing and Extraction JSON Parsing

Parse JSON logs

{app="api"} | json

Extract specific fields

{app="api"} | json message="msg", level="severity"

Filter on extracted field

{app="api"} | json | level="error"

Nested JSON

{app="api"} | json | line_format "{{.response.status}}"

Logfmt Parsing

Parse logfmt (key=value)

{app="api"} | logfmt

Extract specific fields

{app="api"} | logfmt level, caller, msg

Filter parsed fields

{app="api"} | logfmt | level="error"

Pattern Parsing

Extract with pattern

{app="nginx"} | pattern <ip> - - <_> "<method> <uri> <_>" <status> <_>

Filter on extracted values

{app="nginx"} | pattern <_> <status> <_> | status >= 400

Complex pattern

{app="api"} | pattern level=<level> msg="<msg>" duration=<duration>ms

Aggregations and Metrics Count Queries

Count log lines over time

count_over_time({app="api"}[5m])

Rate of logs

rate({app="api"}[5m])

Errors per second

sum(rate({app="api"} |= "error" [5m])) by (namespace)

Error ratio

sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m]))

Extracted Metrics

Average duration

avg_over_time({app="api"} | logfmt | unwrap duration [5m]) by (endpoint)

P95 latency

quantile_over_time(0.95, {app="api"} | regexp duration=(?P<duration>[0-9.]+)ms | unwrap duration [5m]) by (method)

Top 10 error messages

topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) )

TraceQL Query Patterns Basic Trace Queries

Find traces by service

{ .service.name = "api" }

HTTP status codes

{ .http.status_code = 500 }

Combine conditions

{ .service.name = "api" && .http.status_code >= 400 }

Duration filter

{ duration > 1s }

Advanced TraceQL

Parent-child relationship

{ .service.name = "frontend" }

{ .service.name = "backend" && .http.status_code = 500 }

Descendant spans

{ .service.name = "api" }

{ .db.system = "postgresql" && duration > 1s }

Failed database queries

{ .service.name = "api" }

{ .db.system = "postgresql" && status = "error" }

Complete Dashboard Examples Application Observability Dashboard { "dashboard": { "title": "Application Observability - ${app}", "tags": ["observability", "application"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-1h", "to": "now" }, "templating": { "list": [ { "name": "app", "type": "query", "datasource": "Mimir", "query": "label_values(up, app)", "current": { "selected": false, "text": "api", "value": "api" } }, { "name": "namespace", "type": "query", "datasource": "Mimir", "query": "label_values(up{app=\"$app\"}, namespace)", "multi": true, "includeAll": true } ] }, "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)", "legendFormat": "{{method}} - {{status}}" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "yaxes": [ { "format": "reqps", "label": "Requests/sec" } ] }, { "id": 2, "title": "P95 Latency", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))", "legendFormat": "{{endpoint}}" } ], "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }, "yaxes": [ { "format": "s", "label": "Duration" } ], "thresholds": [ { "value": 1, "colorMode": "critical", "fill": true, "line": true, "op": "gt" } ] }, { "id": 3, "title": "Error Rate", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))", "legendFormat": "Error %" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }, "yaxes": [ { "format": "percentunit", "max": 1, "min": 0 } ], "alert": { "conditions": [ { "evaluator": { "params": [0.01], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "reducer": { "type": "avg" }, "type": "query" } ], "frequency": "1m", "handler": 1, "name": "Error Rate Alert", "noDataState": "no_data", "notifications": [] } }, { "id": 4, "title": "Recent Error Logs", "type": "logs", "datasource": "Loki", "targets": [ { "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"", "refId": "A" } ], "options": { "showTime": true, "showLabels": false, "showCommonLabels": false, "wrapLogMessage": true, "dedupStrategy": "none", "enableLogDetails": true }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 } } ], "links": [ { "title": "Explore Logs", "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\"$app\\",namespace=~\\"$namespace\\"}\"}]}", "type": "link", "icon": "doc" }, { "title": "Explore Traces", "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\"$app\\"}\",\"queryType\":\"traceql\"}]}", "type": "link", "icon": "gf-traces" } ] } }

LGTM Stack Configuration Loki Configuration

File: loki.yaml

auth_enabled: false

server: http_listen_port: 3100 grpc_listen_port: 9096 log_level: info

common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory

schema_config: configs: - from: 2024-01-01 store: tsdb object_store: s3 schema: v13 index: prefix: index_ period: 24h

storage_config: aws: s3: s3://us-east-1/my-loki-bucket s3forcepathstyle: true tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache shared_store: s3

limits_config: retention_period: 744h # 31 days ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 max_query_series: 500 max_query_lookback: 30d reject_old_samples: true reject_old_samples_max_age: 168h

compactor: working_directory: /loki/compactor shared_store: s3 compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h

Tempo Configuration

File: tempo.yaml

server: http_listen_port: 3200 grpc_listen_port: 9096

distributor: receivers: otlp: protocols: http: grpc: jaeger: protocols: thrift_http: grpc:

ingester: max_block_duration: 5m

compactor: compaction: block_retention: 720h # 30 days

storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com region: us-east-1 wal: path: /var/tempo/wal

metrics_generator: registry: external_labels: source: tempo cluster: primary storage: path: /var/tempo/generator/wal remote_write: - url: http://mimir:9009/api/v1/push send_exemplars: true

Production Best Practices Performance Optimization Query Optimization Use label filters before line filters Limit time ranges for expensive queries Use unwrap instead of parsing when possible Cache query results with query frontend Dashboard Performance Limit number of panels (< 15 per dashboard) Use appropriate time intervals Avoid high-cardinality grouping Use $__interval for adaptive sampling Storage Optimization Configure retention policies Use compaction for Loki and Tempo Implement tiered storage (hot/warm/cold) Monitor storage growth Security Best Practices Authentication Enable auth (auth_enabled: true in Loki/Tempo) Use OAuth/LDAP for Grafana Implement multi-tenancy with org isolation Authorization Configure RBAC in Grafana Limit datasource access by team Use folder permissions for dashboards Network Security TLS for all components Network policies in Kubernetes Rate limiting at ingress Troubleshooting Common Issues

High Cardinality: Too many unique label combinations

Solution: Reduce label dimensions, use log parsing instead

Query Timeouts: Complex queries on large datasets

Solution: Reduce time range, use aggregations, add query limits

Storage Growth: Unbounded retention

Solution: Configure retention policies, enable compaction

Missing Traces: Incomplete trace data

Solution: Check sampling rates, verify instrumentation Resources Loki Documentation Tempo Documentation Grafana Documentation LogQL Cheat Sheet TraceQL Guide Grafana Operator

安装

Simple variable reference

Multi-select with regex match

Variable in legend

Legend format: "{{method}}"

Using interval variable for adaptive queries

Chained variables (app depends on namespace)

Simple label matching

Regex matching

Not equal

Multiple conditions

Contains

Does not contain

Regex match

Case insensitive

Chaining filters

Parse JSON logs

Extract specific fields

Filter on extracted field

Nested JSON

Parse logfmt (key=value)

Extract specific fields

Filter parsed fields

Extract with pattern

Filter on extracted values

Complex pattern

Count log lines over time

Rate of logs

Errors per second

Error ratio

Average duration

P95 latency

Top 10 error messages

Find traces by service

HTTP status codes

Combine conditions

Duration filter

Parent-child relationship

Descendant spans

Failed database queries