phoenix-observability

安装量: 160
排名: #5428

安装

npx skills add https://github.com/davila7/claude-code-templates --skill phoenix-observability

Phoenix - AI Observability Platform

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.

When to use Phoenix

Use Phoenix when:

Debugging LLM application issues with detailed traces Running systematic evaluations on datasets Monitoring production LLM systems in real-time Building experiment pipelines for prompt/model comparison Self-hosted observability without vendor lock-in

Key features:

Tracing: OpenTelemetry-based trace collection for any LLM framework Evaluation: LLM-as-judge evaluators for quality assessment Datasets: Versioned test sets for regression testing Experiments: Compare prompts, models, and configurations Playground: Interactive prompt testing with multiple models Open-source: Self-hosted with PostgreSQL or SQLite

Use alternatives instead:

LangSmith: Managed platform with LangChain-first integration Weights & Biases: Deep learning experiment tracking focus Arize Cloud: Managed Phoenix with enterprise features MLflow: General ML lifecycle, model registry focus Quick start Installation pip install arize-phoenix

With specific backends

pip install arize-phoenix[embeddings] # Embedding analysis pip install arize-phoenix-otel # OpenTelemetry config pip install arize-phoenix-evals # Evaluation framework pip install arize-phoenix-client # Lightweight REST client

Launch Phoenix server import phoenix as px

Launch in notebook (ThreadServer mode)

session = px.launch_app()

View UI

session.view() # Embedded iframe print(session.url) # http://localhost:6006

Command-line server (production)

Start Phoenix server

phoenix serve

With PostgreSQL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db" phoenix serve --port 6006

Basic tracing from phoenix.otel import register from openinference.instrumentation.openai import OpenAIInstrumentor

Configure OpenTelemetry with Phoenix

tracer_provider = register( project_name="my-llm-app", endpoint="http://localhost:6006/v1/traces" )

Instrument OpenAI SDK

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

All OpenAI calls are now traced

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )

Core concepts Traces and spans

A trace represents a complete execution flow, while spans are individual operations within that trace.

from phoenix.otel import register from opentelemetry import trace

Setup tracing

tracer_provider = register(project_name="my-app") tracer = trace.get_tracer(name)

Create custom spans

with tracer.start_as_current_span("process_query") as span: span.set_attribute("input.value", query)

# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
    context = retriever.search(query)

with tracer.start_as_current_span("generate_response"):
    response = llm.generate(query, context)

span.set_attribute("output.value", response)

Projects

Projects organize related traces:

import os os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"

Or per-trace

from phoenix.otel import register tracer_provider = register(project_name="experiment-v2")

Framework instrumentation OpenAI from phoenix.otel import register from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register() OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain from phoenix.otel import register from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register() LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

All LangChain operations traced

from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("Hello!")

LlamaIndex from phoenix.otel import register from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register() LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic from phoenix.otel import register from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register() AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework Built-in evaluators from phoenix.evals import ( OpenAIModel, HallucinationEvaluator, RelevanceEvaluator, ToxicityEvaluator, llm_classify )

Setup model for evaluation

eval_model = OpenAIModel(model="gpt-4o")

Evaluate hallucination

hallucination_eval = HallucinationEvaluator(eval_model) results = hallucination_eval.evaluate( input="What is the capital of France?", output="The capital of France is Paris.", reference="Paris is the capital of France." )

Custom evaluators from phoenix.evals import llm_classify

Define custom evaluation

def evaluate_helpfulness(input_text, output_text): template = """ Evaluate if the response is helpful for the given question.

Question: {input}
Response: {output}

Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""

result = llm_classify(
    model=eval_model,
    template=template,
    input=input_text,
    output=output_text,
    rails=["helpful", "not_helpful"]
)
return result

Run evaluations on dataset from phoenix import Client from phoenix.evals import run_evals

client = Client()

Get spans to evaluate

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'" )

Run evaluations

eval_results = run_evals( dataframe=spans_df, evaluators=[ HallucinationEvaluator(eval_model), RelevanceEvaluator(eval_model) ], provide_explanation=True )

Log results back to Phoenix

client.log_evaluations(eval_results)

Datasets and experiments Create dataset from phoenix import Client

client = Client()

Create dataset

dataset = client.create_dataset( name="qa-test-set", description="QA evaluation dataset" )

Add examples

client.add_examples_to_dataset( dataset_name="qa-test-set", examples=[ { "input": {"question": "What is Python?"}, "output": {"answer": "A programming language"} }, { "input": {"question": "What is ML?"}, "output": {"answer": "Machine learning"} } ] )

Run experiment from phoenix import Client from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data): """Your model function.""" question = input_data["question"] return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected): """Custom evaluator.""" return { "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0, "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect" }

Run experiment

results = run_experiment( dataset_name="qa-test-set", task=my_model, evaluators=[accuracy_evaluator], experiment_name="baseline-v1" )

print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

Client API Query traces and spans from phoenix import Client

client = Client(endpoint="http://localhost:6006")

Get spans as DataFrame

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'", limit=1000 )

Get specific span

span = client.get_span(span_id="abc123")

Get trace

trace = client.get_trace(trace_id="xyz789")

Log feedback from phoenix import Client

client = Client()

Log user feedback

client.log_annotation( span_id="abc123", name="user_rating", annotator_kind="HUMAN", score=0.8, label="helpful", metadata={"comment": "Good response"} )

Export data

Export to pandas

df = client.get_spans_dataframe(project_name="my-app")

Export traces

traces = client.list_traces(project_name="my-app")

Production deployment Docker docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

Set database URL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"

Start server

phoenix serve --host 0.0.0.0 --port 6006

Environment variables Variable Description Default PHOENIX_PORT HTTP server port 6006 PHOENIX_HOST Server bind address 127.0.0.1 PHOENIX_GRPC_PORT gRPC/OTLP port 4317 PHOENIX_SQL_DATABASE_URL Database connection SQLite temp PHOENIX_WORKING_DIR Data storage directory OS temp PHOENIX_ENABLE_AUTH Enable authentication false PHOENIX_SECRET JWT signing secret Required if auth enabled With authentication export PHOENIX_ENABLE_AUTH=true export PHOENIX_SECRET="your-secret-key-min-32-chars" export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve

Best practices Use projects: Separate traces by environment (dev/staging/prod) Add metadata: Include user IDs, session IDs for debugging Evaluate regularly: Run automated evaluations in CI/CD Version datasets: Track test set changes over time Monitor costs: Track token usage via Phoenix dashboards Self-host: Use PostgreSQL for production deployments Common issues

Traces not appearing:

from phoenix.otel import register

Verify endpoint

tracer_provider = register( project_name="my-app", endpoint="http://localhost:6006/v1/traces" # Correct endpoint )

Force flush

from opentelemetry import trace trace.get_tracer_provider().force_flush()

High memory in notebook:

Close session when done

session = px.launch_app()

... do work ...

session.close() px.close_app()

Database connection issues:

Verify PostgreSQL connection

psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"

Check Phoenix logs

phoenix serve --log-level debug

References Advanced Usage - Custom evaluators, experiments, production setup Troubleshooting - Common issues, debugging, performance Resources Documentation: https://docs.arize.com/phoenix Repository: https://github.com/Arize-ai/phoenix Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix Version: 12.0.0+ License: Apache 2.0

返回排行榜