- LangSmith - LLM Observability Platform
- Development platform for debugging, evaluating, and monitoring language models and AI applications.
- When to use LangSmith
- Use LangSmith when:
- Debugging LLM application issues (prompts, chains, agents)
- Evaluating model outputs systematically against datasets
- Monitoring production LLM systems
- Building regression testing for AI features
- Analyzing latency, token usage, and costs
- Collaborating on prompt engineering
- Key features:
- Tracing
-
- Capture inputs, outputs, latency for all LLM calls
- Evaluation
-
- Systematic testing with built-in and custom evaluators
- Datasets
-
- Create test sets from production traces or manually
- Monitoring
-
- Track metrics, errors, and costs in production
- Integrations
-
- Works with OpenAI, Anthropic, LangChain, LlamaIndex
- Use alternatives instead:
- Weights & Biases
-
- Deep learning experiment tracking, model training
- MLflow
-
- General ML lifecycle, model registry focus
- Arize/WhyLabs
- ML monitoring, data drift detection Quick start Installation pip install langsmith
Set environment variables
export LANGSMITH_API_KEY = "your-api-key" export LANGSMITH_TRACING = true Basic tracing with @traceable from langsmith import traceable from openai import OpenAI client = OpenAI ( ) @traceable def generate_response ( prompt : str ) -
str : response = client . chat . completions . create ( model = "gpt-4o" , messages = [ { "role" : "user" , "content" : prompt } ] ) return response . choices [ 0 ] . message . content
Automatically traced to LangSmith
result
generate_response ( "What is machine learning?" ) OpenAI wrapper (automatic tracing) from langsmith . wrappers import wrap_openai from openai import OpenAI
Wrap client for automatic tracing
client
wrap_openai ( OpenAI ( ) )
All calls automatically traced
response
client . chat . completions . create ( model = "gpt-4o" , messages = [ { "role" : "user" , "content" : "Hello!" } ] ) Core concepts Runs and traces A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow. from langsmith import traceable @traceable ( run_type = "chain" ) def process_query ( query : str ) -
str :
Parent run
context
retrieve_context ( query )
Child run
response
generate_answer ( query , context )
Child run
return response @traceable ( run_type = "retriever" ) def retrieve_context ( query : str ) -
list : return vector_store . search ( query ) @traceable ( run_type = "llm" ) def generate_answer ( query : str , context : list ) -
str : return llm . invoke ( f"Context: { context } \n\nQuestion: { query } " ) Projects Projects organize related runs. Set via environment or code: import os os . environ [ "LANGSMITH_PROJECT" ] = "my-project"
Or per-function
@traceable ( project_name = "my-project" ) def my_function ( ) : pass Client API from langsmith import Client client = Client ( )
List runs
runs
list ( client . list_runs ( project_name = "my-project" , filter = 'eq(status, "success")' , limit = 100 ) )
Get run details
run
client . read_run ( run_id = "..." )
Create feedback
client . create_feedback ( run_id = "..." , key = "correctness" , score = 0.9 , comment = "Good answer" ) Datasets and evaluation Create dataset from langsmith import Client client = Client ( )
Create dataset
dataset
client . create_dataset ( "qa-test-set" , description = "QA evaluation" )
Add examples
client . create_examples ( inputs = [ { "question" : "What is Python?" } , { "question" : "What is ML?" } ] , outputs = [ { "answer" : "A programming language" } , { "answer" : "Machine learning" } ] , dataset_id = dataset . id ) Run evaluation from langsmith import evaluate def my_model ( inputs : dict ) -
dict :
Your model logic
return { "answer" : generate_answer ( inputs [ "question" ] ) } def correctness_evaluator ( run , example ) : prediction = run . outputs [ "answer" ] reference = example . outputs [ "answer" ] score = 1.0 if reference . lower ( ) in prediction . lower ( ) else 0.0 return { "key" : "correctness" , "score" : score } results = evaluate ( my_model , data = "qa-test-set" , evaluators = [ correctness_evaluator ] , experiment_prefix = "v1" ) print ( f"Average score: { results . aggregate_metrics [ 'correctness' ] } " ) Built-in evaluators from langsmith . evaluation import LangChainStringEvaluator
Use LangChain evaluators
results
evaluate ( my_model , data = "qa-test-set" , evaluators = [ LangChainStringEvaluator ( "qa" ) , LangChainStringEvaluator ( "cot_qa" ) ] ) Advanced tracing Tracing context from langsmith import tracing_context with tracing_context ( project_name = "experiment-1" , tags = [ "production" , "v2" ] , metadata = { "version" : "2.0" } ) :
All traceable calls inherit context
result
my_function ( ) Manual runs from langsmith import trace with trace ( name = "custom_operation" , run_type = "tool" , inputs = { "query" : "test" } ) as run : result = do_something ( ) run . end ( outputs = { "result" : result } ) Process inputs/outputs def sanitize_inputs ( inputs : dict ) -
dict : if "password" in inputs : inputs [ "password" ] = "***" return inputs @traceable ( process_inputs = sanitize_inputs ) def login ( username : str , password : str ) : return authenticate ( username , password ) Sampling import os os . environ [ "LANGSMITH_TRACING_SAMPLING_RATE" ] = "0.1"
10% sampling
LangChain integration from langchain_openai import ChatOpenAI from langchain_core . prompts import ChatPromptTemplate
Tracing enabled automatically with LANGSMITH_TRACING=true
llm
ChatOpenAI ( model = "gpt-4o" ) prompt = ChatPromptTemplate . from_messages ( [ ( "system" , "You are a helpful assistant." ) , ( "user" , "{input}" ) ] ) chain = prompt | llm
All chain runs traced automatically
response
chain . invoke ( { "input" : "Hello!" } ) Production monitoring Hub prompts from langsmith import Client client = Client ( )
Pull prompt from hub
prompt
client . pull_prompt ( "my-org/qa-prompt" )
Use in application
result
prompt . invoke ( { "question" : "What is AI?" } ) Async client from langsmith import AsyncClient async def main ( ) : client = AsyncClient ( ) runs = [ ] async for run in client . list_runs ( project_name = "my-project" ) : runs . append ( run ) return runs Feedback collection from langsmith import Client client = Client ( )
Collect user feedback
def record_feedback ( run_id : str , user_rating : int , comment : str = None ) : client . create_feedback ( run_id = run_id , key = "user_rating" , score = user_rating / 5.0 ,
Normalize to 0-1
comment
comment )
In your application
record_feedback ( run_id = "..." , user_rating = 4 , comment = "Helpful response" ) Testing integration Pytest integration from langsmith import test @test def test_qa_accuracy ( ) : result = my_qa_function ( "What is Python?" ) assert "programming" in result . lower ( ) Evaluation in CI/CD from langsmith import evaluate def run_evaluation ( ) : results = evaluate ( my_model , data = "regression-test-set" , evaluators = [ accuracy_evaluator ] )
Fail CI if accuracy drops
assert results . aggregate_metrics [ "accuracy" ]
= 0.9 , \ f"Accuracy { results . aggregate_metrics [ 'accuracy' ] } below threshold" Best practices Structured naming - Use consistent project/run naming conventions Add metadata - Include version, environment, user info Sample in production - Use sampling rate to control volume Create datasets - Build test sets from interesting production cases Automate evaluation - Run evaluations in CI/CD pipelines Monitor costs - Track token usage and latency trends Common issues Traces not appearing: import os
Ensure tracing is enabled
os . environ [ "LANGSMITH_TRACING" ] = "true" os . environ [ "LANGSMITH_API_KEY" ] = "your-key"
Verify connection
from langsmith import Client client = Client ( ) print ( client . list_projects ( ) )
Should work
High latency from tracing:
Enable background batching (default)
from langsmith import Client client = Client ( auto_batch_tracing = True )
Or use sampling
os . environ [ "LANGSMITH_TRACING_SAMPLING_RATE" ] = "0.1" Large payloads:
Hide sensitive/large fields
- @traceable
- (
- process_inputs
- =
- lambda
- x
- :
- {
- k
- :
- v
- for
- k
- ,
- v
- in
- x
- .
- items
- (
- )
- if
- k
- !=
- "large_field"
- }
- )
- def
- my_function
- (
- data
- )
- :
- pass
- References
- Advanced Usage
- - Custom evaluators, distributed tracing, hub prompts
- Troubleshooting
- - Common issues, debugging, performance
- Resources
- Documentation
- :
- https://docs.smith.langchain.com
- Python SDK
- :
- https://github.com/langchain-ai/langsmith-sdk
- Web App
- :
- https://smith.langchain.com
- Version
-
- 0.2.0+
- License
- MIT