autoresearch "The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies train.py , runs fixed 5-minute GPU experiments, evaluates with a single metric ( val_bpb ), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model. When to use this skill Setting up autoresearch on a GPU machine for the first time Writing or refining program.md research directives for the agent Launching an overnight autonomous experiment loop Interpreting results.tsv to understand what the agent found Configuring the system for constrained hardware (limited VRAM) Understanding the ratcheting mechanism and git workflow Porting to Apple Silicon (MLX) or Windows RTX Core Architecture Human authors program.md │ ▼ Agent reads program.md + train.py │ ▼ Agent modifies train.py → git commit │ ▼ uv run train.py (exactly 300 seconds) │ ▼ Extract val_bpb + peak_vram_mb │ ┌────┴────┐ improved? no improvement │ │ keep commit git reset HEAD~1 │ │ └──────┬───────┘ │ log to results.tsv │ ▼ repeat ∞ Mutable vs. Immutable Files File Agent access Purpose train.py Read + Write Model, optimizer, training loop (~630 lines) program.md Read-only Human research directives prepare.py Read-only Data pipeline + evaluate_bpb() harness constants.py Read-only TIME_BUDGET=300 , MAX_SEQ_LEN , EVAL_TOKENS pyproject.toml Read-only Locked dependencies (no new packages) results.tsv Append All experiments: kept and discarded Instructions Step 1: Install Prerequisites
Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
Clone the repository
git clone https://github.com/karpathy/autoresearch cd autoresearch
Install locked dependencies
uv sync Step 2: Prepare Data (One-Time, ~2 Minutes)
Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
Last shard is reserved for validation — never seen during training
uv run prepare.py For constrained hardware, edit prepare.py before running:
Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN
256
default: 2048
Step 3: Run a Baseline Experiment
Single 5-minute experiment to verify setup
uv run train.py
run.log 2
&1
Extract key metrics
grep "^val_bpb:|^peak_vram_mb:" run.log Expected output: val_bpb: 0.9979 peak_vram_mb: 38420 Step 4: Author program.md program.md is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:
Research Program
Goal Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
Current Baseline val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
Directions to Explore 1. Attention variants: MLA, GQA, sliding window, local-global hybrid 2. Layer types: MoE FFN layers, SwiGLU activations 3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule 4. Architectural depth/width tradeoffs within VRAM budget
Constraints
Must complete within 300 seconds
Peak VRAM must stay under 39GB
No new packages (use only what is in pyproject.toml)
Do not modify prepare.py or constants.py
Notes from Previous Runs
Depth-12 improvements transfer to depth-24 (scale-invariant gains)
RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb) Effective program.md principles: Be specific about what to explore — vague directives waste experiments Record what has already been tried (prevents redundant experiments) Note hardware constraints explicitly Use the current best val_bpb as a reference point Step 5: Run the Autonomous Agent Loop Point your AI agent (Claude Code, Codex, etc.) at the repository with program.md as its research context. The agent will: Read program.md + current train.py Hypothesize an improvement Modify train.py + commit Execute uv run train.py (300 seconds) Extract val_bpb ; keep or revert via git Append to results.tsv Repeat With Claude Code (OMC):
From inside autoresearch/
Give Claude the context: "Run the autoresearch loop following program.md"
With Claude Code CLI directly: claude "Follow program.md. Run autonomous research loop on train.py. Execute: uv run train.py, extract val_bpb, keep improvements, revert failures. Log everything to results.tsv. Do not stop until I say so." Step 6: Monitor Results
Live monitoring during a run
watch -n 30 "tail -20 results.tsv"
Count kept vs. discarded
awk -F '\t' '{print $4}' results.tsv | sort | uniq -c
Find the best experiment
sort -t $' \t ' -k2 -n results.tsv | head -5
Check current best val_bpb
git log --oneline -5 Step 7: Interpret results.tsv commit val_bpb memory_gb status description a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12 b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain c1a5f30 crash — crash OOM: sequence length 4096 Status Meaning keep val_bpb improved; commit retained on branch discard No improvement; git reset HEAD~1 applied crash OOM, syntax error, or timeout; always reverted Examples Example 1: Overnight Run Summary Session summary: 126 experiments, 18 improvements Best val_bpb: 0.9697 (started: 0.9979) Top improvements: - SwiGLU activation: -0.012 val_bpb - GQA with 4 KV heads: -0.009 val_bpb - Muon momentum 0.92→0.95: -0.006 val_bpb Example 2: Low-VRAM Configuration (6GB GPU)
In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN
256
was 2048
EVAL_TOKENS
2_097_152
was 20_971_520 (scale down proportionally)
Example 3: Extract Experiments by Category
Find all attention-related experiments
grep -i "attention|GQA|MLA|MHA" results.tsv
List only improvements sorted by gain
awk -F '\t' '$4=="keep"' results.tsv | sort -t $' \t ' -k2 -n Available scripts Run from inside the autoresearch repository directory: Script Purpose Usage setup.sh One-time environment setup bash scripts/setup.sh [--seq-len 512] run-experiment.sh Single 5-min experiment + metric extraction bash scripts/run-experiment.sh run-loop.sh Autonomous loop: run → keep/revert → repeat bash scripts/run-loop.sh [--max 20] show-results.sh Human-readable results.tsv report bash scripts/show-results.sh [--top 10] check-hardware.sh GPU/CUDA/uv availability check (JSON output) bash scripts/check-hardware.sh
Typical overnight session
bash scripts/check-hardware.sh bash scripts/setup.sh --seq-len 512
adjust for your VRAM
Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1" bash scripts/show-results.sh --kept-only References Detailed documentation in references/ : File Contents references/architecture.md System design, immutability contract, git ratcheting, key design decisions references/program-md-guide.md How to write effective program.md directives; full template + principles references/hardware-config.md VRAM settings by GPU, memory optimization techniques, troubleshooting Best practices Write program.md before running — the agent is only as good as its directives; vague programs waste compute Start with the baseline first — always uv run train.py manually before launching the loop to confirm the setup works Keep MAX_SEQ_LEN in prepare.py consistent — changing it mid-run invalidates val_bpb comparisons Never modify prepare.py or constants.py — the evaluation harness must stay fixed for results to be meaningful Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain Commit program.md updates — version-control your research directives alongside results.tsv for reproducibility Monitor VRAM — add peak_vram_mb constraints in program.md for your GPU's headroom No new dependencies — the agent cannot pip install ; it can only use what is in pyproject.toml Hardware Requirements Hardware Status Notes H100 80GB Recommended Default config, full MAX_SEQ_LEN=2048 A100 40GB Supported Lower MAX_SEQ_LEN if needed RTX 4090 24GB Community Reduce MAX_SEQ_LEN to 512 GTX 1660 Ti 6GB Community fork MAX_SEQ_LEN=256, reduced EVAL_TOKENS Apple Silicon (M-series) MLX port Community fork; different optimizer API Windows RTX Community WSL2 + CUDA recommended Key Metrics Reference Metric Direction Description val_bpb Lower = better Validation bits-per-byte; vocabulary-size-independent peak_vram_mb Lower = more headroom Peak GPU memory during the training run Experiments/hour Higher = faster search ~12 at TIME_BUDGET=300 References GitHub — karpathy/autoresearch nanochat — the underlying LLM training framework Karpathy's original announcement (X/Twitter) DeepWiki — autoresearch architecture MIT License