Building Dashboards
You design dashboards that help humans make decisions quickly. Dashboards are products: audience, questions, and actions matter more than chart count.
Philosophy Decisions first. Every panel answers a question that leads to an action. Overview → drilldown → evidence. Start broad, narrow on click/filter, end with raw logs. Rates and percentiles over averages. Averages hide problems; p95/p99 expose them. Simple beats dense. One question per panel. No chart junk. Validate with data. Never guess fields—discover schema first. Entry Points
Choose your starting point:
Starting from Workflow Vague description Intake → design blueprint → APL per panel → deploy Template Pick template → customize dataset/service/env → deploy Splunk dashboard Extract SPL → translate via spl-to-apl → map to chart types → deploy Exploration Use axiom-sre to discover schema/signals → productize into panels Intake: What to Ask First
Before designing, clarify:
Audience & decision
Oncall triage? (fast refresh, error-focused) Team health? (daily trends, SLO tracking) Exec reporting? (weekly summaries, high-level)
Scope
Service, environment, region, cluster, endpoint? Single service or cross-service view?
Datasets
Which Axiom datasets contain the data? Run getschema to discover fields—never guess: ['dataset'] | where _time between (ago(1h) .. now()) | getschema
Golden signals
Traffic: requests/sec, events/min Errors: error rate, 5xx count Latency: p50, p95, p99 duration Saturation: CPU, memory, queue depth, connections
Drilldown dimensions
What do users filter/group by? (service, route, status, pod, customer_id) Dashboard Blueprint
Use this 4-section structure as the default:
- At-a-Glance (Statistic panels)
Single numbers that answer "is it broken right now?"
Error rate (last 5m) p95 latency (last 5m) Request rate (last 5m) Active alerts (if applicable) 2. Trends (TimeSeries panels)
Time-based patterns that answer "what changed?"
Traffic over time Error rate over time Latency percentiles over time Stacked by status/service for comparison 3. Breakdowns (Table/Pie panels)
Top-N analysis that answers "where should I look?"
Top 10 failing routes Top 10 error messages Worst pods by error rate Request distribution by status 4. Evidence (LogStream + SmartFilter)
Raw events that answer "what exactly happened?"
LogStream filtered to errors SmartFilter for service/env/route Key fields projected for readability Chart Types
Note: Dashboard queries inherit time from the UI picker—no explicit _time filter needed.
Validation: TimeSeries, Statistic, Table, Pie, LogStream, Note, MonitorList are fully validated by dashboard-validate. Heatmap, ScatterPlot, SmartFilter work but may trigger warnings.
Statistic
When: Single KPI, current value, threshold comparison.
['logs'] | where service == "api" | summarize total = count(), errors = countif(status >= 500) | extend error_rate = round(100.0 * errors / total, 2) | project error_rate
Pitfalls: Don't use for time series; ensure query returns single row.
TimeSeries
When: Trends over time, before/after comparison, rate changes.
// Single metric - use bin_auto for automatic sizing ['logs'] | summarize ['req/min'] = count() by bin_auto(_time)
// Latency percentiles - use percentiles_array for proper overlay ['logs'] | summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
Best practices:
Use bin_auto(_time) instead of fixed bin(_time, 1m) — auto-adjusts to time window Use percentiles_array() instead of multiple percentile() calls — renders as one chart Too many series = unreadable; use top N or filter Table
When: Top-N lists, detailed breakdowns, exportable data.
['logs'] | where status >= 500 | summarize errors = count() by route, error_message | top 10 by errors | project route, error_message, errors
Pitfalls:
Always use top N to prevent unbounded results Use project to control column order and names Pie
When: Share-of-total for LOW cardinality dimensions (≤6 slices).
['logs'] | summarize count() by status_class = case( status < 300, "2xx", status < 400, "3xx", status < 500, "4xx", "5xx" )
Pitfalls:
Never use for high cardinality (routes, user IDs) Prefer tables for >6 categories Always aggregate to reduce slices LogStream
When: Raw event inspection, debugging, evidence gathering.
['logs'] | where service == "api" and status >= 500 | project-keep _time, trace_id, route, status, error_message, duration_ms | take 100
Pitfalls:
Always include take N (100-500 max) Use project-keep to show relevant fields only Filter aggressively—raw logs are expensive Heatmap
When: Distribution visualization, latency patterns, density analysis.
['logs'] | summarize histogram(duration_ms, 15) by bin_auto(_time)
Best for: Latency distributions, response time patterns, identifying outliers.
Scatter Plot
When: Correlation between two metrics, identifying patterns.
['logs'] | summarize avg(duration_ms), avg(resp_size_bytes) by route
Best for: Response size vs latency correlation, resource usage patterns.
SmartFilter (Filter Bar)
When: Interactive filtering for the entire dashboard.
SmartFilter is a chart type that creates dropdown/search filters. Requires:
A SmartFilter chart with filter definitions declare query_parameters in each panel query
Filter types:
selectType: "apl" — Dynamic dropdown from APL query selectType: "list" — Static dropdown with predefined options type: "search" — Free-text input
Panel query pattern:
declare query_parameters (country_filter:string = ""); ['logs'] | where isempty(country_filter) or ['geo.country'] == country_filter
See reference/smartfilter.md for full JSON structure and cascading filter examples.
Monitor List
When: Display monitor status on operational dashboards.
No APL needed—select monitors from the UI. Shows:
Monitor status (normal/triggered/off) Run history (green/red squares) Dataset, type, notifiers Note
When: Context, instructions, section headers.
Use GitHub Flavored Markdown for:
Dashboard purpose and audience Runbook links Section dividers On-call instructions Chart Configuration
Charts support JSON configuration options beyond the query. See reference/chart-config.md for full details.
Quick reference:
Chart Type Key Options Statistic colorScheme, customUnits, unit, showChart (sparkline), errorThreshold/warningThreshold TimeSeries aggChartOpts: variant (line/area/bars), scaleDistr (linear/log), displayNull LogStream/Table tableSettings: columns, fontSize, highlightSeverity, wrapLines Pie hideHeader Note text (markdown), variant
Common options (all charts):
overrideDashboardTimeRange: boolean overrideDashboardCompareAgainst: boolean hideHeader: boolean APL Patterns Time Filtering in Dashboards vs Ad-hoc Queries
Dashboard panel queries do NOT need explicit time filters. The dashboard UI time picker automatically scopes all queries to the selected time window.
// DASHBOARD QUERY — no time filter needed ['logs'] | where service == "api" | summarize count() by bin_auto(_time)
Ad-hoc queries (Axiom Query tab, axiom-sre exploration) MUST have explicit time filters:
// AD-HOC QUERY — always include time filter ['logs'] | where _time between (ago(1h) .. now()) | where service == "api" | summarize count() by bin_auto(_time)
Bin Size Selection
Prefer bin_auto(_time) — it automatically adjusts to the dashboard time window.
Manual bin sizes (only when auto doesn't fit your needs):
Time window Bin size 15m 10s–30s 1h 1m 6h 5m 24h 15m–1h 7d 1h–6h Cardinality Guardrails
Prevent query explosion:
// GOOD: bounded | summarize count() by route | top 10 by count_
// BAD: unbounded high-cardinality grouping | summarize count() by user_id // millions of rows
Field Escaping
Fields with dots need bracket notation:
| where ['kubernetes.pod.name'] == "frontend"
Fields with dots IN the name (not hierarchy) need escaping:
| where ['kubernetes.labels.app\.kubernetes\.io/name'] == "frontend"
Golden Signal Queries
Traffic:
| summarize requests = count() by bin_auto(_time)
Errors (as rate %):
| summarize total = count(), errors = countif(status >= 500) by bin_auto(_time) | extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0) | project _time, error_rate
Latency (use percentiles_array for proper chart overlay):
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
Layout Composition Grid Principles Dashboard width = 12 units Typical panel: w=3 (quarter), w=4 (third), w=6 (half), w=12 (full) Stats row: 4 panels × w=3, h=2 TimeSeries row: 2 panels × w=6, h=4 Tables: w=6 or w=12, h=4–6 LogStream: w=12, h=6–8 Section Layout Pattern Row 0-1: [Stat w=3] [Stat w=3] [Stat w=3] [Stat w=3] Row 2-5: [TimeSeries w=6, h=4] [TimeSeries w=6, h=4] Row 6-9: [Table w=6, h=4] [Pie w=6, h=4] Row 10+: [LogStream w=12, h=6]
Naming Conventions Use question-style titles: "Error rate by route" not "Errors" Prefix with context if multi-service: "[API] Error rate" Include units: "Latency (ms)", "Traffic (req/s)" Dashboard Settings Refresh Rate
Dashboard auto-refreshes at configured interval. Options: 15s, 30s, 1m, 5m, etc.
⚠️ Query cost warning: Short refresh (15s) + long time range (90d) = expensive queries running constantly.
Recommendations:
Use case Refresh rate Oncall/real-time 15s–30s Team health 1m–5m Executive/weekly 5m–15m Sharing Just Me: Private, only you can access Group: Specific team/group in your org Everyone: All users in your Axiom org
Data visibility is still governed by dataset permissions—users only see data from datasets they can access.
URL Time Range Parameters
?t_qr=24h (quick range), ?t_ts=...&t_te=... (custom), ?t_against=-1d (comparison)
Setup
Run scripts/setup to check requirements (curl, jq, ~/.axiom.toml).
Config in ~/.axiom.toml (shared with axiom-sre):
[ deployments.prod ] url = "https://api.axiom.co" token = "xaat-your-token" org_id = "your-org-id"
Deployment
Scripts
Script Usage
scripts/get-user-id
⚠️ CRITICAL: Always validate queries BEFORE deploying.
Design dashboard (sections + panels) Write APL for each panel Build JSON (from template or manually) Validate queries using axiom-sre with explicit time filter dashboard-validate to check structure dashboard-create or dashboard-update to deploy dashboard-link to get URL — NEVER construct Axiom URLs manually (org IDs and base URLs vary per deployment) Share link with user Sibling Skill Integration
spl-to-apl: Translate Splunk SPL → APL. Map timechart → TimeSeries, stats → Statistic/Table. See reference/splunk-migration.md.
axiom-sre: Discover schema with getschema, explore baselines, identify dimensions, then productize into panels.
Templates
Pre-built templates in reference/templates/:
Template Use case service-overview.json Single service oncall dashboard with Heatmap service-overview-with-filters.json Same with SmartFilter (route/status dropdowns) api-health.json HTTP API with traffic/errors/latency blank.json Minimal skeleton
Placeholders: {{owner_id}}, {{service}}, {{dataset}}
Usage:
USER_ID=$(scripts/get-user-id prod) scripts/dashboard-from-template service-overview "my-service" "$USER_ID" "my-dataset" ./dashboard.json scripts/dashboard-validate ./dashboard.json scripts/dashboard-create prod ./dashboard.json
⚠️ Templates assume field names (service, status, route, duration_ms). Discover your schema first and use sed to fix mismatches.
Common Pitfalls
Problem Cause Solution
"unable to find dataset" errors Dataset name doesn't exist in your org Check available datasets in Axiom UI
"creating dashboards for other users" 403 Owner ID doesn't match your token Use scripts/get-user-id prod to get your UUID
All panels show errors Field names don't match your schema Discover schema first, use sed to fix field names
Dashboard shows no data Service filter too restrictive Remove or adjust where service == 'x' filters
Queries time out Missing time filter or too broad Dashboard inherits time from picker; ad-hoc queries need explicit time filter
Wrong org in dashboard URL Manually constructed URL Always use dashboard-link
For APL syntax: https://axiom.co/docs/apl/introduction