SAELens: Sparse Autoencoders for Mechanistic Interpretability

SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.

GitHub: jbloomAus/SAELens (1,100+ stars)

The Problem: Polysemanticity & Superposition

Individual neurons in neural networks are polysemantic - they activate in multiple, semantically distinct contexts. This happens because models use superposition to represent more features than they have neurons, making interpretability difficult.

SAEs solve this by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept.

When to Use SAELens

Use SAELens when you need to:

Discover interpretable features in model activations Understand what concepts a model has learned Study superposition and feature geometry Perform feature-based steering or ablation Analyze safety-relevant features (deception, bias, harmful content)

Consider alternatives when:

You need basic activation analysis → Use TransformerLens directly You want causal intervention experiments → Use pyvene or TransformerLens You need production steering → Consider direct activation engineering Installation pip install sae-lens

Requirements: Python 3.10+, transformer-lens>=2.0.0

Core Concepts What SAEs Learn

SAEs are trained to reconstruct model activations through a sparse bottleneck:

Input Activation → Encoder → Sparse Features → Decoder → Reconstructed Activation (d_model) ↓ (d_sae >> d_model) ↓ (d_model) sparsity reconstruction penalty loss

Loss Function: MSE(original, reconstructed) + L1_coefficient × L1(features)

Key Validation (Anthropic Research)

In "Towards Monosemanticity", human evaluators found 70% of SAE features genuinely interpretable. Features discovered include:

DNA sequences, legal language, HTTP requests Hebrew text, nutrition statements, code syntax Sentiment, named entities, grammatical structures Workflow 1: Loading and Analyzing Pre-trained SAEs Step-by-Step from transformer_lens import HookedTransformer from sae_lens import SAE

1. Load model and pre-trained SAE

model = HookedTransformer.from_pretrained("gpt2-small", device="cuda") sae, cfg_dict, sparsity = SAE.from_pretrained( release="gpt2-small-res-jb", sae_id="blocks.8.hook_resid_pre", device="cuda" )

2. Get model activations

tokens = model.to_tokens("The capital of France is Paris") _, cache = model.run_with_cache(tokens) activations = cache["resid_pre", 8] # [batch, pos, d_model]

3. Encode to SAE features

sae_features = sae.encode(activations) # [batch, pos, d_sae] print(f"Active features: {(sae_features > 0).sum()}")

4. Find top features for each position

for pos in range(tokens.shape[1]): top_features = sae_features[0, pos].topk(5) token = model.to_str_tokens(tokens[0, pos:pos+1])[0] print(f"Token '{token}': features {top_features.indices.tolist()}")

5. Reconstruct activations

reconstructed = sae.decode(sae_features) reconstruction_error = (activations - reconstructed).norm()

Available Pre-trained SAEs Release Model Layers gpt2-small-res-jb GPT-2 Small Multiple residual streams gemma-2b-res Gemma 2B Residual streams Various on HuggingFace Search tag saelens Various Checklist Load model with TransformerLens Load matching SAE for target layer Encode activations to sparse features Identify top-activating features per token Validate reconstruction quality Workflow 2: Training a Custom SAE Step-by-Step from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner

1. Configure training

cfg = LanguageModelSAERunnerConfig( # Model model_name="gpt2-small", hook_name="blocks.8.hook_resid_pre", hook_layer=8, d_in=768, # Model dimension

# SAE architecture
architecture="standard",  # or "gated", "topk"
d_sae=768 * 8,  # Expansion factor of 8
activation_fn="relu",

# Training
lr=4e-4,
l1_coefficient=8e-5,  # Sparsity penalty
l1_warm_up_steps=1000,
train_batch_size_tokens=4096,
training_tokens=100_000_000,

# Data
dataset_path="monology/pile-uncopyrighted",
context_size=128,

# Logging
log_to_wandb=True,
wandb_project="sae-training",

# Checkpointing
checkpoint_path="checkpoints",
n_checkpoints=5,

)

2. Train

trainer = SAETrainingRunner(cfg) sae = trainer.run()

3. Evaluate

print(f"L0 (avg active features): {trainer.metrics['l0']}") print(f"CE Loss Recovered: {trainer.metrics['ce_loss_score']}")

Key Hyperparameters Parameter Typical Value Effect d_sae 4-16× d_model More features, higher capacity l1_coefficient 5e-5 to 1e-4 Higher = sparser, less accurate lr 1e-4 to 1e-3 Standard optimizer LR l1_warm_up_steps 500-2000 Prevents early feature death Evaluation Metrics Metric Target Meaning L0 50-200 Average active features per token CE Loss Score 80-95% Cross-entropy recovered vs original Dead Features <5% Features that never activate Explained Variance >90% Reconstruction quality Checklist Choose target layer and hook point Set expansion factor (d_sae = 4-16× d_model) Tune L1 coefficient for desired sparsity Enable L1 warm-up to prevent dead features Monitor metrics during training (W&B) Validate L0 and CE loss recovery Check dead feature ratio Workflow 3: Feature Analysis and Steering Analyzing Individual Features from transformer_lens import HookedTransformer from sae_lens import SAE import torch

model = HookedTransformer.from_pretrained("gpt2-small", device="cuda") sae, _, _ = SAE.from_pretrained( release="gpt2-small-res-jb", sae_id="blocks.8.hook_resid_pre", device="cuda" )

Find what activates a specific feature

feature_idx = 1234 test_texts = [ "The scientist conducted an experiment", "I love chocolate cake", "The code compiles successfully", "Paris is beautiful in spring", ]

for text in test_texts: tokens = model.to_tokens(text) _, cache = model.run_with_cache(tokens) features = sae.encode(cache["resid_pre", 8]) activation = features[0, :, feature_idx].max().item() print(f"{activation:.3f}: {text}")

Feature Steering def steer_with_feature(model, sae, prompt, feature_idx, strength=5.0): """Add SAE feature direction to residual stream.""" tokens = model.to_tokens(prompt)

# Get feature direction from decoder
feature_direction = sae.W_dec[feature_idx]  # [d_model]

def steering_hook(activation, hook):
    # Add scaled feature direction at all positions
    activation += strength * feature_direction
    return activation

# Generate with steering
output = model.generate(
    tokens,
    max_new_tokens=50,
    fwd_hooks=[("blocks.8.hook_resid_pre", steering_hook)]
)
return model.to_string(output[0])

Feature Attribution

Which features most affect a specific output?

tokens = model.to_tokens("The capital of France is") _, cache = model.run_with_cache(tokens)

Get features at final position

features = sae.encode(cache["resid_pre", 8])[0, -1] # [d_sae]

Get logit attribution per feature

Feature contribution = feature_activation × decoder_weight × unembedding

W_dec = sae.W_dec # [d_sae, d_model] W_U = model.W_U # [d_model, vocab]

Contribution to "Paris" logit

paris_token = model.to_single_token(" Paris") feature_contributions = features * (W_dec @ W_U[:, paris_token])

top_features = feature_contributions.topk(10) print("Top features for 'Paris' prediction:") for idx, val in zip(top_features.indices, top_features.values): print(f" Feature {idx.item()}: {val.item():.3f}")

Common Issues & Solutions Issue: High dead feature ratio

WRONG: No warm-up, features die early

cfg = LanguageModelSAERunnerConfig( l1_coefficient=1e-4, l1_warm_up_steps=0, # Bad! )

RIGHT: Warm-up L1 penalty

cfg = LanguageModelSAERunnerConfig( l1_coefficient=8e-5, l1_warm_up_steps=1000, # Gradually increase use_ghost_grads=True, # Revive dead features )

Issue: Poor reconstruction (low CE recovery)

Reduce sparsity penalty

cfg = LanguageModelSAERunnerConfig( l1_coefficient=5e-5, # Lower = better reconstruction d_sae=768 * 16, # More capacity )

Issue: Features not interpretable

Increase sparsity (higher L1)

cfg = LanguageModelSAERunnerConfig( l1_coefficient=1e-4, # Higher = sparser, more interpretable )

Or use TopK architecture

cfg = LanguageModelSAERunnerConfig( architecture="topk", activation_fn_kwargs={"k": 50}, # Exactly 50 active features )

Issue: Memory errors during training cfg = LanguageModelSAERunnerConfig( train_batch_size_tokens=2048, # Reduce batch size store_batch_size_prompts=4, # Fewer prompts in buffer n_batches_in_buffer=8, # Smaller activation buffer )

Integration with Neuronpedia

Browse pre-trained SAE features at neuronpedia.org:

Features are indexed by SAE ID

Example: gpt2-small layer 8 feature 1234

→ neuronpedia.org/gpt2-small/8-res-jb/1234

Key Classes Reference Class Purpose SAE Sparse Autoencoder model LanguageModelSAERunnerConfig Training configuration SAETrainingRunner Training loop manager ActivationsStore Activation collection and batching HookedSAETransformer TransformerLens + SAE integration Reference Documentation

For detailed API documentation, tutorials, and advanced usage, see the references/ folder:

File Contents references/README.md Overview and quick start guide references/api.md Complete API reference for SAE, TrainingSAE, configurations references/tutorials.md Step-by-step tutorials for training, analysis, steering External Resources Tutorials Basic Loading & Analysis Training a Sparse Autoencoder ARENA SAE Curriculum Papers Towards Monosemanticity - Anthropic (2023) Scaling Monosemanticity - Anthropic (2024) Sparse Autoencoders Find Highly Interpretable Features - Cunningham et al. (ICLR 2024) Official Documentation SAELens Docs Neuronpedia - Feature browser SAE Architectures Architecture Description Use Case Standard ReLU + L1 penalty General purpose Gated Learned gating mechanism Better sparsity control TopK Exactly K active features Consistent sparsity

TopK SAE (exactly 50 features active)

cfg = LanguageModelSAERunnerConfig( architecture="topk", activation_fn="topk", activation_fn_kwargs={"k": 50}, )

sparse-autoencoder-training

安装