) to Elastic. Metrics typically include token usage, request counts, latency, and—where the
integration supports it—cost-related fields. Logs may include prompt/response or guardrail events. Exact field names and
data streams are defined by each integration package; discover them from the integration docs or from the target data
stream mapping.
Determine what data is available
List data streams:
GET _data_stream
and filter for
traces*
,
metrics-apm*
(or
metrics*
), and
metrics-*
/
logs-*
that match known LLM integration datasets (e.g. from
Elastic LLM observability
).
Inspect trace indices:
For
traces*
, run a small search or use mapping to see if spans contain
gen_ai.*
or
llm.*
(or similar) attributes. Confirm presence of token, model, and duration fields.
Inspect integration indices:
For metrics/logs data streams, check mapping or one document to see token, cost,
latency, and model dimensions.
Use one source per use case:
If both APM and integration data exist, prefer one consistent source for a given
question (e.g. use traces for per-request chain analysis, integration metrics for aggregate token/cost).
Check alerts and SLOs:
Use the SLOs API and Alerting API to list SLOs and alerting rules that target LLM-related
services or integration metrics, and to get open or recently fired alerts. Firing alerts or SLOs in
degrading/violated status point to potential degraded performance.
Use cases and query patterns
LLM performance (latency, throughput, errors)
Traces:
ES|QL on
traces*
filtered by span attributes (e.g.
gen_ai.operation.name
or
gen_ai.provider.name
when present). Compute throughput (count per time bucket), latency (e.g.
duration.us
or span duration), and error
rate (
event.outcome == "failure"
) by model, service, or time.
Integrations:
Query integration metrics for request rate, latency, and error metrics by model/dimension as exposed
by the integration.
Cost and token utilization
Traces:
Aggregate from spans in
traces*
sum
gen_ai.usage.input_tokens
and
gen_ai.usage.output_tokens
(or
equivalent attribute names) by time, model, or service. If a cost attribute exists (e.g. custom
llm.response.cost.
), sum it for cost views.
Integrations:
Use integration metrics that expose token counts and/or cost; aggregate by time and model.
Response quality and safety
Traces:
Use
event.outcome
,
error.type
, and span attributes (e.g.
gen_ai.response.finish_reasons
) in
traces
to identify failures, timeouts, or content filters. Correlate with prompts/responses if captured in
attributes (e.g.
gen_ai.input.messages
,
gen_ai.output.messages
) and not redacted.
Integrations:
Query integration logs for guardrail blocks, content filter events, or policy violations (e.g.
Bedrock Guardrails
)
using the fields defined by that integration.
Call chaining and agentic workflow orchestration
Traces only:
Use
trace hierarchy
in
traces
. Filter by root service or trace attributes; group by
trace.id
and use parent/child span relationships (e.g.
parent.id
,
span.id
) to reconstruct chains (e.g. orchestration span →
multiple LLM or tool-call spans). Aggregate by span name or
gen_ai.operation.name
to see distribution of steps (e.g.
retrieval, LLM, tool use). Duration per span and per trace gives bottleneck and end-to-end latency.
Using ES|QL for LLM data
Availability:
ES|QL is available in Elasticsearch 8.11+ (GA in 8.14) and in Elastic Observability Serverless.
Scoping:
Always restrict by time range (
@timestamp
). When present, add
service.name
and optionally
service.environment
. For LLM-specific spans, filter by span attributes once you know the field names (e.g. a keyword
field for
gen_ai.provider.name
or
gen_ai.operation.name
).
Performance:
Use
LIMIT
, coarse time buckets when only trends are needed, and avoid full scans over large
windows.
Workflow
LLM observability progress:
- [ ] Step 1: Determine available data (traces, metrics-apm or metrics, or integration data streams)
- [ ] Step 2: Discover LLM-related field names (mapping or sample doc)
- [ ] Step 3: Run ES|QL or Elasticsearch queries for the user's question (performance, cost, quality, orchestration)
- [ ] Step 4: Check for active alerts or SLOs defined on LLM-related data (Alerting API, SLOs API); field names from
Step 2 help identify related rules; firing alerts or violated/degrading SLOs indicate potential degraded performance
- [ ] Step 5: Summarize findings from ingested data only; include alert/SLO status when relevant
Examples
Example: Token usage over time from traces
Assume span attributes are available as
span.attributes.gen_ai.usage.input_tokens
and
span.attributes.gen_ai.usage.output_tokens
(adjust to actual field names from mapping):
FROM traces
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500
Example: Latency and error rate by model
FROM traces
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
request_count = COUNT(),
failures = COUNT() WHERE event.outcome == "failure",
avg_duration_us = AVG(span.duration.us)
BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100
Example: Agentic workflow (trace-level view)
Get trace IDs that contain at least one LLM span and count spans per trace to see chain length:
FROM traces
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50
Example: Integration metrics (Amazon Bedrock AgentCore)
The
Amazon Bedrock AgentCore integration
ships metrics to the
metrics-aws_bedrock_agentcore.metrics-
data stream (time series index). Use
TS
for
aggregations on time series data streams (Elasticsearch 9.2+); use a time range with
TRANGE
(9.3+). The
integration’s dashboards and
alerting rule templates
Example: token usage (counter), invocations (counter), and average latency (gauge) by hour and agent:
TS metrics-aws_bedrock_agentcore.metrics-
| WHERE TRANGE(7 days)
AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC
For Elasticsearch 8.x or when
TS
is not available, use
FROM
with
BUCKET(@timestamp, 1 hour)
and
SUM
/
AVG
over
the metric fields (as in the integration's alert rule templates). For other LLM integrations (OpenAI, Azure OpenAI,
Vertex AI, etc.), use that integration’s data stream index pattern and field names from its package (see
Elastic LLM observability
).
Guidelines
Data only in Elastic:
Use only data collected and stored in Elastic (traces in
traces
, metrics, or integration
metrics/logs). Do not describe or rely on other vendors’ UIs or products.
One technology per customer:
Assume a single ingestion path per deployment when answering; discover which (traces
vs integration) exists and use it consistently for the question.
Discover field names:
Before writing ES|QL or Query DSL, confirm LLM-related attribute or metric names from
_mapping
or a sample document; naming may differ (e.g.
gen_ai.
vs
llm.*
or integration-specific fields).
No Kibana UI dependency:
Prefer ES|QL and Elasticsearch APIs; use Kibana APIs only when needed (e.g. SLO,
alerting). Do not instruct the user to open Kibana UI.
References:
LLM and agentic AI observability
,
Observability Labs – LLM Observability
,
OpenTelemetry GenAI spans
. For ES|QL syntax and
query patterns, use the
elasticsearch-esql
skill, or look through
ES|QL TS command reference
for Elastic v9.3
or higher and for Serverless, and look through
ES|QL FROM command reference
for other
Elastic versions.