- Arize Prompt Optimization Skill
- Concepts
- Where Prompts Live in Trace Data
- LLM applications emit spans following OpenInference semantic conventions. Prompts are stored in different span attributes depending on the span kind and instrumentation:
- Column
- What it contains
- When to use
- attributes.llm.input_messages
- Structured chat messages (system, user, assistant, tool) in role-based format
- Primary source
- for chat-based LLM prompts
- attributes.llm.input_messages.roles
- Array of roles:
- system
- ,
- user
- ,
- assistant
- ,
- tool
- Extract individual message roles
- attributes.llm.input_messages.contents
- Array of message content strings
- Extract message text
- attributes.input.value
- Serialized prompt or user question (generic, all span kinds)
- Fallback when structured messages are not available
- attributes.llm.prompt_template.template
- Template with
- {variable}
- placeholders (e.g.,
- "Answer {question} using {context}"
- )
- When the app uses prompt templates
- attributes.llm.prompt_template.variables
- Template variable values (JSON object)
- See what values were substituted into the template
- attributes.output.value
- Model response text
- See what the LLM produced
- attributes.llm.output_messages
- Structured model output (including tool calls)
- Inspect tool-calling responses
- Finding Prompts by Span Kind
- LLM span
- (
- attributes.openinference.span.kind = 'LLM'
- ): Check
- attributes.llm.input_messages
- for structured chat messages, OR
- attributes.input.value
- for a serialized prompt. Check
- attributes.llm.prompt_template.template
- for the template.
- Chain/Agent span
- :
- attributes.input.value
- contains the user's question. The actual LLM prompt lives on
- child LLM spans
- -- navigate down the trace tree.
- Tool span
- :
- attributes.input.value
- has tool input,
- attributes.output.value
- has tool result. Not typically where prompts live.
- Performance Signal Columns
- These columns carry the feedback data used for optimization:
- Column pattern
- Source
- What it tells you
- annotation.
.label - Human reviewers
- Categorical grade (e.g.,
- correct
- ,
- incorrect
- ,
- partial
- )
- annotation.
.score - Human reviewers
- Numeric quality score (e.g., 0.0 - 1.0)
- annotation.
.text - Human reviewers
- Freeform explanation of the grade
- eval.
.label - LLM-as-judge evals
- Automated categorical assessment
- eval.
.score - LLM-as-judge evals
- Automated numeric score
- eval.
.explanation - LLM-as-judge evals
- Why the eval gave that score --
- most valuable for optimization
- attributes.input.value
- Trace data
- What went into the LLM
- attributes.output.value
- Trace data
- What the LLM produced
- {experiment_name}.output
- Experiment runs
- Output from a specific experiment
- Prerequisites
- Three things are needed:
- ax
- CLI, an API key (env var or profile), and a project. A space ID is also needed when using project names.
- Install ax
- Verify
- ax
- is installed and working before proceeding:
- Check if
- ax
- is on PATH:
- command -v ax
- (Unix) or
- where ax
- (Windows)
- If not found, check common install locations:
- macOS/Linux:
- test -x ~/.local/bin/ax && export PATH="$HOME/.local/bin:$PATH"
- Windows: check
- %APPDATA%\Python\Scripts\ax.exe
- or
- %LOCALAPPDATA%\Programs\Python\Scripts\ax.exe
- If still not found, install it (requires shell access to install packages):
- Preferred:
- uv tool install arize-ax-cli
- Alternative:
- pipx install arize-ax-cli
- Fallback:
- pip install arize-ax-cli
- After install, if
- ax
- is not on PATH:
- macOS/Linux:
- export PATH="$HOME/.local/bin:$PATH"
- Windows (PowerShell):
- $env:PATH = "$env:APPDATA\Python\Scripts;$env:PATH"
- If
- ax --version
- fails with an SSL/certificate error:
- macOS:
- export SSL_CERT_FILE=/etc/ssl/cert.pem
- Linux:
- export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
- Windows (PowerShell):
- $env:SSL_CERT_FILE = "C:\Program Files\Common Files\SSL\cert.pem"
- (or use
- python -c "import certifi; print(certifi.where())"
- to find the cert bundle)
- ax --version
- must succeed before proceeding. If it doesn't, stop and ask the user for help.
- Verify environment
- Run a quick check for credentials:
- macOS/Linux (bash):
- ax
- --version
- &&
- echo
- "--- env ---"
- &&
- if
- [
- -n
- "
- $ARIZE_API_KEY
- "
- ]
- ;
- then
- echo
- "ARIZE_API_KEY: (set)"
- ;
- else
- echo
- "ARIZE_API_KEY: (not set)"
- ;
- fi
- &&
- echo
- "ARIZE_SPACE_ID:
- ${ARIZE_SPACE_ID
- :-
- (not set)}
- "
- &&
- echo
- "ARIZE_DEFAULT_PROJECT:
- ${ARIZE_DEFAULT_PROJECT
- :-
- (not set)}
- "
- &&
- echo
- "--- profiles ---"
- &&
- ax profiles show
- 2
- >
- &1
- Windows (PowerShell):
- ax
- --
- version
- ;
- Write-Host
- "--- env ---"
- ;
- Write-Host
- "ARIZE_API_KEY:
- $
- (
- if
- (
- $env
- :ARIZE_API_KEY
- )
- { '(set)' } else { '(not set)' })"
- ;
- Write-Host
- "ARIZE_SPACE_ID:
- $env
- :ARIZE_SPACE_ID"
- ;
- Write-Host
- "ARIZE_DEFAULT_PROJECT:
- $env
- :ARIZE_DEFAULT_PROJECT"
- ;
- Write-Host
- "--- profiles ---"
- ;
- ax profiles show 2>&1
- Read the output and proceed immediately
- if either the env var or the profile has an API key. Only ask the user if
- both
- are missing. Resolve failures:
- No API key in env
- and
- no profile →
- AskQuestion
- "Arize API key (
https://app.arize.com/admin
API Keys)" Space ID unknown → AskQuestion , or run ax projects list -o json --limit 100 and search for a match Project unclear → ask, or run ax projects list -o json --limit 100 and present as selectable options Default Project If ARIZE_DEFAULT_PROJECT is set (visible in the output above), use its value as the project for all commands in this session. Do NOT ask the user for a project ID -- just use it. Continue using this default until the user explicitly provides a different project. If ARIZE_DEFAULT_PROJECT is not set and no project is provided, ask the user for one. Phase 1: Extract the Current Prompt Find LLM spans containing prompts
List LLM spans (where prompts live)
ax spans list PROJECT_ID --filter "attributes.openinference.span.kind = 'LLM'" --limit 10
Filter by model
ax spans list PROJECT_ID --filter "attributes.llm.model_name = 'gpt-4o'" --limit 10
Filter by span name (e.g., a specific LLM call)
ax spans list PROJECT_ID --filter "name = 'ChatCompletion'" --limit 10 Export a trace to inspect prompt structure
Export all spans in a trace
ax spans export --trace-id TRACE_ID --project PROJECT_ID
Export a single span
ax spans export --span-id SPAN_ID --project PROJECT_ID Extract prompts from exported JSON
Extract structured chat messages (system + user + assistant)
jq '.[0] | { messages: .attributes.llm.input_messages, model: .attributes.llm.model_name }' trace_*/spans.json
Extract the system prompt specifically
jq '[.[] | select(.attributes.llm.input_messages.roles[]? == "system")] | .[0].attributes.llm.input_messages' trace_*/spans.json
Extract prompt template and variables
jq '.[0].attributes.llm.prompt_template' trace_*/spans.json
Extract from input.value (fallback for non-structured prompts)
jq '.[0].attributes.input.value' trace_*/spans.json Reconstruct the prompt as messages Once you have the span data, reconstruct the prompt as a messages array: [ { "role" : "system" , "content" : "You are a helpful assistant that..." } , { "role" : "user" , "content" : "Given {input}, answer the question: {question}" } ] If the span has attributes.llm.prompt_template.template , the prompt uses variables. Preserve these placeholders ( {variable} or {{variable}} ) -- they are substituted at runtime. Phase 2: Gather Performance Data From traces (production feedback)
Find error spans -- these indicate prompt failures
ax spans list PROJECT_ID \ --filter "status_code = 'ERROR' AND attributes.openinference.span.kind = 'LLM'" \ --limit 20
Find spans with low eval scores
ax spans list PROJECT_ID \ --filter "annotation.correctness.label = 'incorrect'" \ --limit 20
Find spans with high latency (may indicate overly complex prompts)
ax spans list PROJECT_ID \ --filter "attributes.openinference.span.kind = 'LLM' AND latency_ms > 10000" \ --limit 20
Export error traces for detailed inspection
ax spans export --trace-id TRACE_ID --project PROJECT_ID From datasets and experiments
Export a dataset (ground truth examples)
ax datasets export DATASET_ID
-> dataset_*/examples.json
Export experiment results (what the LLM produced)
ax experiments export EXPERIMENT_ID
-> experiment_*/runs.json
Merge dataset + experiment for analysis Join the two files by example_id to see inputs alongside outputs and evaluations:
Count examples and runs
jq 'length' dataset_/examples.json jq 'length' experiment_/runs.json
View a single joined record
jq -s ' .[0] as $dataset | .[1][0] as $run | ($dataset[] | select(.id == $run.example_id)) as $example | { input: $example, output: $run.output, evaluations: $run.evaluations } ' dataset_/examples.json experiment_/runs.json
Find failed examples (where eval score < threshold)
- jq
- '[.[] | select(.evaluations.correctness.score < 0.5)]'
- experiment_*/runs.json
- Identify what to optimize
- Look for patterns across failures:
- Compare outputs to ground truth
-
- Where does the LLM output differ from expected?
- Read eval explanations
- :
- eval.*.explanation
- tells you WHY something failed
- Check annotation text
-
- Human feedback describes specific issues
- Look for verbosity mismatches
-
- If outputs are too long/short vs ground truth
- Check format compliance
- Are outputs in the expected format? Phase 3: Optimize the Prompt The Optimization Meta-Prompt Use this template to generate an improved version of the prompt. Fill in the three placeholders and send it to your LLM (GPT-4o, Claude, etc.): You are an expert in prompt optimization. Given the original baseline prompt and the associated performance data (inputs, outputs, evaluation labels, and explanations), generate a revised version that improves results. ORIGINAL BASELINE PROMPT ======================== {PASTE_ORIGINAL_PROMPT_HERE} ======================== PERFORMANCE DATA ================ The following records show how the current prompt performed. Each record includes the input, the LLM output, and evaluation feedback: {PASTE_RECORDS_HERE} ================ HOW TO USE THIS DATA 1. Compare outputs: Look at what the LLM generated vs what was expected 2. Review eval scores: Check which examples scored poorly and why 3. Examine annotations: Human feedback shows what worked and what didn't 4. Identify patterns: Look for common issues across multiple examples 5. Focus on failures: The rows where the output DIFFERS from the expected value are the ones that need fixing ALIGNMENT STRATEGY - If outputs have extra text or reasoning not present in the ground truth, remove instructions that encourage explanation or verbose reasoning - If outputs are missing information, add instructions to include it - If outputs are in the wrong format, add explicit format instructions - Focus on the rows where the output differs from the target -- these are the failures to fix RULES Maintain Structure: - Use the same template variables as the current prompt ({var} or {{var}}) - Don't change sections that are already working - Preserve the exact return format instructions from the original prompt Avoid Overfitting: - DO NOT copy examples verbatim into the prompt - DO NOT quote specific test data outputs exactly - INSTEAD: Extract the ESSENCE of what makes good vs bad outputs - INSTEAD: Add general guidelines and principles - INSTEAD: If adding few-shot examples, create SYNTHETIC examples that demonstrate the principle, not real data from above Goal: Create a prompt that generalizes well to new inputs, not one that memorizes the test data. OUTPUT FORMAT Return the revised prompt as a JSON array of messages: [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."} ] Also provide a brief reasoning section (bulleted list) explaining: - What problems you found - How the revised prompt addresses each one Preparing the performance data Format the records as a JSON array before pasting into the template:
From dataset + experiment: join and select relevant columns
jq -s ' .[0] as $ds | [.[1][] | . as $run | ($ds[] | select(.id == $run.example_id)) as $ex | { input: $ex.input, expected: $ex.expected_output, actual_output: $run.output, eval_score: $run.evaluations.correctness.score, eval_label: $run.evaluations.correctness.label, eval_explanation: $run.evaluations.correctness.explanation } ] ' dataset_/examples.json experiment_/runs.json
From exported spans: extract input/output pairs with annotations
jq '[.[] | select(.attributes.openinference.span.kind == "LLM") | { input: .attributes.input.value, output: .attributes.output.value, status: .status_code, model: .attributes.llm.model_name }]' trace_*/spans.json Applying the revised prompt After the LLM returns the revised messages array: Compare the original and revised prompts side by side Verify all template variables are preserved Check that format instructions are intact Test on a few examples before full deployment Phase 4: Iterate The optimization loop 1. Extract prompt -> Phase 1 (once) 2. Run experiment -> ax experiments create ... 3. Export results -> ax experiments export EXPERIMENT_ID 4. Analyze failures -> jq to find low scores 5. Run meta-prompt -> Phase 3 with new failure data 6. Apply revised prompt 7. Repeat from step 2 Measure improvement
Compare scores across experiments
Experiment A (baseline)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_a/runs.json
Experiment B (optimized)
jq '[.[] | .evaluations.correctness.score] | add / length' experiment_b/runs.json
Find examples that flipped from fail to pass
jq -s ' [.[0][] | select(.evaluations.correctness.label == "incorrect")] as $fails | [.[1][] | select(.evaluations.correctness.label == "correct") | select(.example_id as $id | $fails | any(.example_id == $id)) ] | length ' experiment_a/runs.json experiment_b/runs.json A/B compare two prompts Create two experiments against the same dataset, each using a different prompt version Export both: ax experiments export EXP_A and ax experiments export EXP_B Compare average scores, failure rates, and specific example flips Check for regressions -- examples that passed with prompt A but fail with prompt B Prompt Engineering Best Practices Apply these when writing or revising prompts: Technique When to apply Example Clear, detailed instructions Output is vague or off-topic "Classify the sentiment as exactly one of: positive, negative, neutral" Instructions at the beginning Model ignores later instructions Put the task description before examples Step-by-step breakdowns Complex multi-step processes "First extract entities, then classify each, then summarize" Specific personas Need consistent style/tone "You are a senior financial analyst writing for institutional investors" Delimiter tokens Sections blend together Use
,
, or XML tags to separate input from instructions Few-shot examples Output format needs clarification Show 2-3 synthetic input/output pairs Output length specifications Responses are too long or short "Respond in exactly 2-3 sentences" Reasoning instructions Accuracy is critical "Think step by step before answering" "I don't know" guidelines Hallucination is a risk "If the answer is not in the provided context, say 'I don't have enough information'" Variable preservation When optimizing prompts that use template variables: Single braces ( {variable} ): Python f-string / Jinja style. Most common in Arize. Double braces ( {{variable}} ): Mustache style. Used when the framework requires it. Never add or remove variable placeholders during optimization Never rename variables -- the runtime substitution depends on exact names If adding few-shot examples, use literal values, not variable placeholders Workflows Optimize a prompt from a failing trace Find failing traces: ax traces list PROJECT_ID --filter "status_code = 'ERROR'" --limit 5 Export the trace: ax spans export --trace-id TRACE_ID --project PROJECT_ID Extract the prompt from the LLM span: jq '[.[] | select(.attributes.openinference.span.kind == "LLM")][0] | { messages: .attributes.llm.input_messages, template: .attributes.llm.prompt_template, output: .attributes.output.value, error: .attributes.exception.message }' trace_*/spans.json Identify what failed from the error message or output Fill in the optimization meta-prompt (Phase 3) with the prompt and error context Apply the revised prompt Optimize using a dataset and experiment Find the dataset and experiment: ax datasets list ax experiments list --dataset-id DATASET_ID Export both: ax datasets export DATASET_ID ax experiments export EXPERIMENT_ID Prepare the joined data for the meta-prompt Run the optimization meta-prompt Create a new experiment with the revised prompt to measure improvement Debug a prompt that produces wrong format Export spans where the output format is wrong: ax spans list PROJECT_ID \ --filter "attributes.openinference.span.kind = 'LLM' AND annotation.format.label = 'incorrect'" \ --limit 10 -o json
bad_format.json Look at what the LLM is producing vs what was expected Add explicit format instructions to the prompt (JSON schema, examples, delimiters) Common fix: add a few-shot example showing the exact desired output format Reduce hallucination in a RAG prompt Find traces where the model hallucinated: ax spans list PROJECT_ID \ --filter "annotation.faithfulness.label = 'unfaithful'" \ --limit 20 Export and inspect the retriever + LLM spans together: ax spans export --trace-id TRACE_ID --project PROJECT_ID jq '[.[] | {kind: .attributes.openinference.span.kind, name, input: .attributes.input.value, output: .attributes.output.value}]' trace_/spans.json Check if the retrieved context actually contained the answer Add grounding instructions to the system prompt: "Only use information from the provided context. If the answer is not in the context, say so." Troubleshooting Problem Solution ax: command not found macOS/Linux: check ~/.local/bin/ax . Windows: check if ax is on PATH. If missing: uv tool install arize-ax-cli (requires shell access to install packages) No profile found Create ~/.arize/config.toml with api_key = "${ARIZE_API_KEY}" (see Prerequisites) No input_messages on span Check span kind -- Chain/Agent spans store prompts on child LLM spans, not on themselves Prompt template is null Not all instrumentations emit prompt_template . Use input_messages or input.value instead Variables lost after optimization Verify the revised prompt preserves all {var} placeholders from the original Optimization makes things worse Check for overfitting -- the meta-prompt may have memorized test data. Ensure few-shot examples are synthetic No eval/annotation columns Run evaluations first (via Arize UI or SDK), then re-export Experiment output column not found The column name is {experiment_name}.output -- check exact experiment name via ax experiments get jq errors on span JSON Ensure you're targeting the correct file path (e.g., trace_/spans.json )