LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

Debugging LLM application issues (prompts, chains, agents)

Evaluating model outputs systematically against datasets

Monitoring production LLM systems

Building regression testing for AI features

Analyzing latency, token usage, and costs

Collaborating on prompt engineering

Key features:

Tracing

Capture inputs, outputs, latency for all LLM calls

Evaluation

Systematic testing with built-in and custom evaluators

Datasets

Create test sets from production traces or manually

Monitoring

Track metrics, errors, and costs in production

Integrations

Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

Weights & Biases

Deep learning experiment tracking, model training

MLflow

General ML lifecycle, model registry focus
Arize/WhyLabs: ML monitoring, data drift detection Quick start Installation pip install langsmith

Set environment variables

export LANGSMITH_API_KEY = "your-api-key" export LANGSMITH_TRACING = true Basic tracing with @traceable from langsmith import traceable from openai import OpenAI client = OpenAI ( ) @traceable def generate_response ( prompt : str ) -

str : response = client . chat . completions . create ( model = "gpt-4o" , messages = [ { "role" : "user" , "content" : prompt } ] ) return response . choices [ 0 ] . message . content

Automatically traced to LangSmith

result

generate_response ( "What is machine learning?" ) OpenAI wrapper (automatic tracing) from langsmith . wrappers import wrap_openai from openai import OpenAI

Wrap client for automatic tracing

client

wrap_openai ( OpenAI ( ) )

All calls automatically traced

response

client . chat . completions . create ( model = "gpt-4o" , messages = [ { "role" : "user" , "content" : "Hello!" } ] ) Core concepts Runs and traces A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow. from langsmith import traceable @traceable ( run_type = "chain" ) def process_query ( query : str ) -

str :

Parent run

context

retrieve_context ( query )

Child run

response

generate_answer ( query , context )

Child run

return response @traceable ( run_type = "retriever" ) def retrieve_context ( query : str ) -

list : return vector_store . search ( query ) @traceable ( run_type = "llm" ) def generate_answer ( query : str , context : list ) -

str : return llm . invoke ( f"Context: { context } \n\nQuestion: { query } " ) Projects Projects organize related runs. Set via environment or code: import os os . environ [ "LANGSMITH_PROJECT" ] = "my-project"

Or per-function

@traceable ( project_name = "my-project" ) def my_function ( ) : pass Client API from langsmith import Client client = Client ( )

List runs

runs

list ( client . list_runs ( project_name = "my-project" , filter = 'eq(status, "success")' , limit = 100 ) )

Get run details

run

client . read_run ( run_id = "..." )

Create feedback

client . create_feedback ( run_id = "..." , key = "correctness" , score = 0.9 , comment = "Good answer" ) Datasets and evaluation Create dataset from langsmith import Client client = Client ( )

Create dataset

dataset

client . create_dataset ( "qa-test-set" , description = "QA evaluation" )

Add examples

client . create_examples ( inputs = [ { "question" : "What is Python?" } , { "question" : "What is ML?" } ] , outputs = [ { "answer" : "A programming language" } , { "answer" : "Machine learning" } ] , dataset_id = dataset . id ) Run evaluation from langsmith import evaluate def my_model ( inputs : dict ) -

dict :

Your model logic

return { "answer" : generate_answer ( inputs [ "question" ] ) } def correctness_evaluator ( run , example ) : prediction = run . outputs [ "answer" ] reference = example . outputs [ "answer" ] score = 1.0 if reference . lower ( ) in prediction . lower ( ) else 0.0 return { "key" : "correctness" , "score" : score } results = evaluate ( my_model , data = "qa-test-set" , evaluators = [ correctness_evaluator ] , experiment_prefix = "v1" ) print ( f"Average score: { results . aggregate_metrics [ 'correctness' ] } " ) Built-in evaluators from langsmith . evaluation import LangChainStringEvaluator

Use LangChain evaluators

results

evaluate ( my_model , data = "qa-test-set" , evaluators = [ LangChainStringEvaluator ( "qa" ) , LangChainStringEvaluator ( "cot_qa" ) ] ) Advanced tracing Tracing context from langsmith import tracing_context with tracing_context ( project_name = "experiment-1" , tags = [ "production" , "v2" ] , metadata = { "version" : "2.0" } ) :

All traceable calls inherit context

result

my_function ( ) Manual runs from langsmith import trace with trace ( name = "custom_operation" , run_type = "tool" , inputs = { "query" : "test" } ) as run : result = do_something ( ) run . end ( outputs = { "result" : result } ) Process inputs/outputs def sanitize_inputs ( inputs : dict ) -

dict : if "password" in inputs : inputs [ "password" ] = "***" return inputs @traceable ( process_inputs = sanitize_inputs ) def login ( username : str , password : str ) : return authenticate ( username , password ) Sampling import os os . environ [ "LANGSMITH_TRACING_SAMPLING_RATE" ] = "0.1"

10% sampling

LangChain integration from langchain_openai import ChatOpenAI from langchain_core . prompts import ChatPromptTemplate

Tracing enabled automatically with LANGSMITH_TRACING=true

llm

ChatOpenAI ( model = "gpt-4o" ) prompt = ChatPromptTemplate . from_messages ( [ ( "system" , "You are a helpful assistant." ) , ( "user" , "{input}" ) ] ) chain = prompt | llm

All chain runs traced automatically

response

chain . invoke ( { "input" : "Hello!" } ) Production monitoring Hub prompts from langsmith import Client client = Client ( )

Pull prompt from hub

prompt

client . pull_prompt ( "my-org/qa-prompt" )

Use in application

result

prompt . invoke ( { "question" : "What is AI?" } ) Async client from langsmith import AsyncClient async def main ( ) : client = AsyncClient ( ) runs = [ ] async for run in client . list_runs ( project_name = "my-project" ) : runs . append ( run ) return runs Feedback collection from langsmith import Client client = Client ( )

Collect user feedback

def record_feedback ( run_id : str , user_rating : int , comment : str = None ) : client . create_feedback ( run_id = run_id , key = "user_rating" , score = user_rating / 5.0 ,

Normalize to 0-1

comment

comment )

In your application

record_feedback ( run_id = "..." , user_rating = 4 , comment = "Helpful response" ) Testing integration Pytest integration from langsmith import test @test def test_qa_accuracy ( ) : result = my_qa_function ( "What is Python?" ) assert "programming" in result . lower ( ) Evaluation in CI/CD from langsmith import evaluate def run_evaluation ( ) : results = evaluate ( my_model , data = "regression-test-set" , evaluators = [ accuracy_evaluator ] )

Fail CI if accuracy drops

assert results . aggregate_metrics [ "accuracy" ]

= 0.9 , \ f"Accuracy { results . aggregate_metrics [ 'accuracy' ] } below threshold" Best practices Structured naming - Use consistent project/run naming conventions Add metadata - Include version, environment, user info Sample in production - Use sampling rate to control volume Create datasets - Build test sets from interesting production cases Automate evaluation - Run evaluations in CI/CD pipelines Monitor costs - Track token usage and latency trends Common issues Traces not appearing: import os

Ensure tracing is enabled

os . environ [ "LANGSMITH_TRACING" ] = "true" os . environ [ "LANGSMITH_API_KEY" ] = "your-key"

Verify connection

from langsmith import Client client = Client ( ) print ( client . list_projects ( ) )

Should work

High latency from tracing:

Enable background batching (default)

from langsmith import Client client = Client ( auto_batch_tracing = True )

Or use sampling

os . environ [ "LANGSMITH_TRACING_SAMPLING_RATE" ] = "0.1" Large payloads:

Hide sensitive/large fields

@traceable

(

process_inputs

=

lambda

x

:

{

k

:

v

for

k

,

v

in

x

.

items

(

)

if

k

!=

"large_field"

}

)

def

my_function

(

data

)

:

pass

References

Advanced Usage

- Custom evaluators, distributed tracing, hub prompts

Troubleshooting

- Common issues, debugging, performance

Resources

Documentation

:

https://docs.smith.langchain.com

Python SDK

:

https://github.com/langchain-ai/langsmith-sdk

Web App

:

https://smith.langchain.com

Version

0.2.0+
License: MIT

安装

Set environment variables

Automatically traced to LangSmith

result

Wrap client for automatic tracing

client

All calls automatically traced

response

Parent run

context

Child run

response

Child run

Or per-function

List runs

runs

Get run details

run

Create feedback

Create dataset

dataset

Add examples

Your model logic

Use LangChain evaluators

results

All traceable calls inherit context

result

10% sampling

Tracing enabled automatically with LANGSMITH_TRACING=true

llm

All chain runs traced automatically

response

Pull prompt from hub

prompt

Use in application

result

Collect user feedback

Normalize to 0-1

comment

In your application

Fail CI if accuracy drops

Ensure tracing is enabled

Verify connection

Should work

Enable background batching (default)

Or use sampling

Hide sensitive/large fields