- langsmith — LLM Observability, Evaluation & Prompt Management
- Keyword
- :
- langsmith
- ·
- llm tracing
- ·
- llm evaluation
- ·
- @traceable
- ·
- langsmith evaluate
- LangSmith is a framework-agnostic platform for developing, debugging, and deploying LLM applications.
- It provides end-to-end tracing, quality evaluation, prompt versioning, and production monitoring.
- When to use this skill
- Add tracing to any LLM pipeline (OpenAI, Anthropic, LangChain, custom models)
- Run offline evaluations with
- evaluate()
- against a curated dataset
- Set up production monitoring and online evaluation
- Manage and version prompts in the Prompt Hub
- Create datasets for regression testing and benchmarking
- Attach human or automated feedback to traces
- Use LLM-as-judge scoring with
- openevals
- Debug agent failures with end-to-end trace inspection
- Instructions
- Install SDK:
- pip install -U langsmith
- (Python) or
- npm install langsmith
- (TypeScript)
- Set environment variables:
- LANGSMITH_TRACING=true
- ,
- LANGSMITH_API_KEY=lsv2_...
- Instrument with
- @traceable
- decorator or
- wrap_openai()
- wrapper
- View traces at
- smith.langchain.com
- For evaluation setup, see
- references/python-sdk.md
- For CLI commands, see
- references/cli.md
- Run
- bash scripts/setup.sh
- to auto-configure environment
- API Key
- Get from
smith.langchain.com → Settings → API Keys
Docs
:
https://docs.langchain.com/langsmith
Quick Start
Python
pip
install
-U
langsmith openai
export
LANGSMITH_TRACING
=
true
export
LANGSMITH_API_KEY
=
"lsv2_..."
export
OPENAI_API_KEY
=
"sk-..."
from
langsmith
import
traceable
from
langsmith
.
wrappers
import
wrap_openai
from
openai
import
OpenAI
client
=
wrap_openai
(
OpenAI
(
)
)
@traceable
def
rag_pipeline
(
question
:
str
)
-
str : """Automatically traced in LangSmith""" response = client . chat . completions . create ( model = "gpt-4o" , messages = [ { "role" : "user" , "content" : question } ] ) return response . choices [ 0 ] . message . content result = rag_pipeline ( "What is LangSmith?" ) TypeScript npm install langsmith openai export LANGSMITH_TRACING = true export LANGSMITH_API_KEY = "lsv2_..." import { traceable } from "langsmith/traceable" ; import { wrapOpenAI } from "langsmith/wrappers" ; import { OpenAI } from "openai" ; const client = wrapOpenAI ( new OpenAI ( ) ) ; const pipeline = traceable ( async ( question : string ) : Promise < string
=> { const res = await client . chat . completions . create ( { model : "gpt-4o" , messages : [ { role : "user" , content : question } ] , } ) ; return res . choices [ 0 ] . message . content ?? "" ; } , { name : "RAG Pipeline" } ) ; await pipeline ( "What is LangSmith?" ) ; Core Concepts Concept Description Run Individual operation (LLM call, tool call, retrieval). The fundamental unit. Trace All runs from a single user request, linked by trace_id . Thread Multiple traces in a conversation, linked by session_id or thread_id . Project Container grouping related traces (set via LANGSMITH_PROJECT ). Dataset Collection of {inputs, outputs} examples for offline evaluation. Experiment Result set from running evaluate() against a dataset. Feedback Score/label attached to a run — numeric, categorical, or freeform. Tracing @traceable decorator (Python) from langsmith import traceable @traceable ( run_type = "chain" ,
llm | chain | tool | retriever | embedding
name
"My Pipeline" , tags = [ "production" , "v2" ] , metadata = { "version" : "2.1" , "env" : "prod" } , project_name = "my-project" ) def pipeline ( question : str ) -
str : return generate_answer ( question ) Selective tracing context import langsmith as ls
Enable tracing for this block only
with ls . tracing_context ( enabled = True , project_name = "debug" ) : result = chain . invoke ( { "input" : "..." } )
Disable tracing despite LANGSMITH_TRACING=true
with ls . tracing_context ( enabled = False ) : result = chain . invoke ( { "input" : "..." } ) Wrap provider clients from langsmith . wrappers import wrap_openai , wrap_anthropic from openai import OpenAI import anthropic openai_client = wrap_openai ( OpenAI ( ) )
All calls auto-traced
anthropic_client
wrap_anthropic ( anthropic . Anthropic ( ) ) Distributed tracing (microservices) from langsmith . run_helpers import get_current_run_tree import langsmith @langsmith . traceable def service_a ( inputs ) : rt = get_current_run_tree ( ) headers = rt . to_headers ( )
Pass to child service
return call_service_b ( headers = headers ) @langsmith . traceable def service_b ( x , headers ) : with langsmith . tracing_context ( parent = headers ) : return process ( x ) Evaluation Basic evaluation with evaluate() from langsmith import Client from langsmith . wrappers import wrap_openai from openai import OpenAI client = Client ( ) oai = wrap_openai ( OpenAI ( ) )
1. Create dataset
dataset
client . create_dataset ( "Geography QA" ) client . create_examples ( dataset_id = dataset . id , examples = [ { "inputs" : { "q" : "Capital of France?" } , "outputs" : { "a" : "Paris" } } , { "inputs" : { "q" : "Capital of Germany?" } , "outputs" : { "a" : "Berlin" } } , ] )
2. Target function
def target ( inputs : dict ) -
dict : res = oai . chat . completions . create ( model = "gpt-4o-mini" , messages = [ { "role" : "user" , "content" : inputs [ "q" ] } ] ) return { "a" : res . choices [ 0 ] . message . content }
3. Evaluator
def exact_match ( inputs , outputs , reference_outputs ) : return outputs [ "a" ] . strip ( ) . lower ( ) == reference_outputs [ "a" ] . strip ( ) . lower ( )
4. Run experiment
results
client . evaluate ( target , data = "Geography QA" , evaluators = [ exact_match ] , experiment_prefix = "gpt-4o-mini-v1" , max_concurrency = 4 ) LLM-as-judge with openevals pip install - U openevals from openevals . llm import create_llm_as_judge from openevals . prompts import CORRECTNESS_PROMPT judge = create_llm_as_judge ( prompt = CORRECTNESS_PROMPT , model = "openai:o3-mini" , feedback_key = "correctness" , ) results = client . evaluate ( target , data = "my-dataset" , evaluators = [ judge ] ) Evaluation types Type When to use Code/Heuristic Exact match, format checks, rule-based LLM-as-judge Subjective quality, safety, reference-free Human Annotation queues, pairwise comparison Pairwise Compare two app versions Online Production traces, real traffic Prompt Hub from langsmith import Client from langchain_core . prompts import ChatPromptTemplate client = Client ( )
Push a prompt
prompt
ChatPromptTemplate ( [ ( "system" , "You are a helpful assistant." ) , ( "user" , "{question}" ) , ] ) client . push_prompt ( "my-assistant-prompt" , object = prompt )
Pull and use
prompt
client . pull_prompt ( "my-assistant-prompt" )
Pull specific version:
prompt
client . pull_prompt ( "my-assistant-prompt:abc123" ) Feedback from langsmith import Client import uuid client = Client ( )
Custom run ID for later feedback linking
my_run_id
str ( uuid . uuid4 ( ) ) result = chain . invoke ( { "input" : "..." } , { "run_id" : my_run_id } )
Attach feedback
client . create_feedback ( key = "correctness" , score = 1 ,
0-1 numeric or categorical
run_id
my_run_id , comment = "Accurate and concise" ) References Python SDK Reference — full Client API, @traceable signature, evaluate() TypeScript SDK Reference — Client, traceable, wrappers, evaluate CLI Reference — langsmith CLI commands Official Docs — langchain.com/langsmith SDK GitHub — MIT License, v0.7.17 openevals — Prebuilt LLM evaluators