langsmith — LLM Observability, Evaluation & Prompt Management
Keyword
:
langsmith
·
llm tracing
·
llm evaluation
·
@traceable
·
langsmith evaluate
LangSmith is a framework-agnostic platform for developing, debugging, and deploying LLM applications.
It provides end-to-end tracing, quality evaluation, prompt versioning, and production monitoring.
When to use this skill
Add tracing to any LLM pipeline (OpenAI, Anthropic, LangChain, custom models)
Run offline evaluations with
evaluate()
against a curated dataset
Set up production monitoring and online evaluation
Manage and version prompts in the Prompt Hub
Create datasets for regression testing and benchmarking
Attach human or automated feedback to traces
Use LLM-as-judge scoring with
openevals
Debug agent failures with end-to-end trace inspection
Instructions
Install SDK:
pip install -U langsmith
(Python) or
npm install langsmith
(TypeScript)
Set environment variables:
LANGSMITH_TRACING=true
,
LANGSMITH_API_KEY=lsv2_...
Instrument with
@traceable
decorator or
wrap_openai()
wrapper
View traces at
smith.langchain.com
For evaluation setup, see
references/python-sdk.md
For CLI commands, see
references/cli.md
Run
bash scripts/setup.sh
to auto-configure environment
API Key: Get from smith.langchain.com → Settings → API Keys Docs : https://docs.langchain.com/langsmith Quick Start Python pip install -U langsmith openai export LANGSMITH_TRACING = true export LANGSMITH_API_KEY = "lsv2_..." export OPENAI_API_KEY = "sk-..." from langsmith import traceable from langsmith . wrappers import wrap_openai from openai import OpenAI client = wrap_openai ( OpenAI ( ) ) @traceable def rag_pipeline ( question : str ) -

str : """Automatically traced in LangSmith""" response = client . chat . completions . create ( model = "gpt-4o" , messages = [ { "role" : "user" , "content" : question } ] ) return response . choices [ 0 ] . message . content result = rag_pipeline ( "What is LangSmith?" ) TypeScript npm install langsmith openai export LANGSMITH_TRACING = true export LANGSMITH_API_KEY = "lsv2_..." import { traceable } from "langsmith/traceable" ; import { wrapOpenAI } from "langsmith/wrappers" ; import { OpenAI } from "openai" ; const client = wrapOpenAI ( new OpenAI ( ) ) ; const pipeline = traceable ( async ( question : string ) : Promise < string

=> { const res = await client . chat . completions . create ( { model : "gpt-4o" , messages : [ { role : "user" , content : question } ] , } ) ; return res . choices [ 0 ] . message . content ?? "" ; } , { name : "RAG Pipeline" } ) ; await pipeline ( "What is LangSmith?" ) ; Core Concepts Concept Description Run Individual operation (LLM call, tool call, retrieval). The fundamental unit. Trace All runs from a single user request, linked by trace_id . Thread Multiple traces in a conversation, linked by session_id or thread_id . Project Container grouping related traces (set via LANGSMITH_PROJECT ). Dataset Collection of {inputs, outputs} examples for offline evaluation. Experiment Result set from running evaluate() against a dataset. Feedback Score/label attached to a run — numeric, categorical, or freeform. Tracing @traceable decorator (Python) from langsmith import traceable @traceable ( run_type = "chain" ,

llm | chain | tool | retriever | embedding

name

"My Pipeline" , tags = [ "production" , "v2" ] , metadata = { "version" : "2.1" , "env" : "prod" } , project_name = "my-project" ) def pipeline ( question : str ) -

str : return generate_answer ( question ) Selective tracing context import langsmith as ls

Enable tracing for this block only

with ls . tracing_context ( enabled = True , project_name = "debug" ) : result = chain . invoke ( { "input" : "..." } )

Disable tracing despite LANGSMITH_TRACING=true

with ls . tracing_context ( enabled = False ) : result = chain . invoke ( { "input" : "..." } ) Wrap provider clients from langsmith . wrappers import wrap_openai , wrap_anthropic from openai import OpenAI import anthropic openai_client = wrap_openai ( OpenAI ( ) )

All calls auto-traced

anthropic_client

wrap_anthropic ( anthropic . Anthropic ( ) ) Distributed tracing (microservices) from langsmith . run_helpers import get_current_run_tree import langsmith @langsmith . traceable def service_a ( inputs ) : rt = get_current_run_tree ( ) headers = rt . to_headers ( )

Pass to child service

return call_service_b ( headers = headers ) @langsmith . traceable def service_b ( x , headers ) : with langsmith . tracing_context ( parent = headers ) : return process ( x ) Evaluation Basic evaluation with evaluate() from langsmith import Client from langsmith . wrappers import wrap_openai from openai import OpenAI client = Client ( ) oai = wrap_openai ( OpenAI ( ) )

1. Create dataset

dataset

client . create_dataset ( "Geography QA" ) client . create_examples ( dataset_id = dataset . id , examples = [ { "inputs" : { "q" : "Capital of France?" } , "outputs" : { "a" : "Paris" } } , { "inputs" : { "q" : "Capital of Germany?" } , "outputs" : { "a" : "Berlin" } } , ] )

2. Target function

def target ( inputs : dict ) -

dict : res = oai . chat . completions . create ( model = "gpt-4o-mini" , messages = [ { "role" : "user" , "content" : inputs [ "q" ] } ] ) return { "a" : res . choices [ 0 ] . message . content }

3. Evaluator

def exact_match ( inputs , outputs , reference_outputs ) : return outputs [ "a" ] . strip ( ) . lower ( ) == reference_outputs [ "a" ] . strip ( ) . lower ( )

4. Run experiment

results

client . evaluate ( target , data = "Geography QA" , evaluators = [ exact_match ] , experiment_prefix = "gpt-4o-mini-v1" , max_concurrency = 4 ) LLM-as-judge with openevals pip install - U openevals from openevals . llm import create_llm_as_judge from openevals . prompts import CORRECTNESS_PROMPT judge = create_llm_as_judge ( prompt = CORRECTNESS_PROMPT , model = "openai:o3-mini" , feedback_key = "correctness" , ) results = client . evaluate ( target , data = "my-dataset" , evaluators = [ judge ] ) Evaluation types Type When to use Code/Heuristic Exact match, format checks, rule-based LLM-as-judge Subjective quality, safety, reference-free Human Annotation queues, pairwise comparison Pairwise Compare two app versions Online Production traces, real traffic Prompt Hub from langsmith import Client from langchain_core . prompts import ChatPromptTemplate client = Client ( )

Push a prompt

prompt

ChatPromptTemplate ( [ ( "system" , "You are a helpful assistant." ) , ( "user" , "{question}" ) , ] ) client . push_prompt ( "my-assistant-prompt" , object = prompt )

Pull and use

prompt

client . pull_prompt ( "my-assistant-prompt" )

Pull specific version:

prompt

client . pull_prompt ( "my-assistant-prompt:abc123" ) Feedback from langsmith import Client import uuid client = Client ( )

Custom run ID for later feedback linking

my_run_id

str ( uuid . uuid4 ( ) ) result = chain . invoke ( { "input" : "..." } , { "run_id" : my_run_id } )

Attach feedback

client . create_feedback ( key = "correctness" , score = 1 ,

0-1 numeric or categorical

run_id

my_run_id , comment = "Accurate and concise" ) References Python SDK Reference — full Client API, @traceable signature, evaluate() TypeScript SDK Reference — Client, traceable, wrappers, evaluate CLI Reference — langsmith CLI commands Official Docs — langchain.com/langsmith SDK GitHub — MIT License, v0.7.17 openevals — Prebuilt LLM evaluators

安装

llm | chain | tool | retriever | embedding

name

Enable tracing for this block only

Disable tracing despite LANGSMITH_TRACING=true

All calls auto-traced

anthropic_client

Pass to child service

1. Create dataset

dataset

2. Target function

3. Evaluator

4. Run experiment

results

Push a prompt

prompt

Pull and use

prompt

Pull specific version:

prompt

Custom run ID for later feedback linking

my_run_id

Attach feedback

0-1 numeric or categorical

run_id