nanoGPT - Minimalist GPT Training Quick start

nanoGPT is a simplified GPT implementation designed for learning and experimentation.

Installation:

pip install torch numpy transformers datasets tiktoken wandb tqdm

Train on Shakespeare (CPU-friendly):

Prepare data

python data/shakespeare_char/prepare.py

Train (5 minutes on CPU)

python train.py config/train_shakespeare_char.py

Generate text

python sample.py --out_dir=out-shakespeare-char

Output:

ROMEO: What say'st thou? Shall I speak, and be a man?

JULIET: I am afeard, and yet I'll speak; for thou art One that hath been a man, and yet I know not What thou art.

Common workflows Workflow 1: Character-level Shakespeare

Complete training pipeline:

Step 1: Prepare data (creates train.bin, val.bin)

python data/shakespeare_char/prepare.py

Step 2: Train small model

python train.py config/train_shakespeare_char.py

Step 3: Generate text

python sample.py --out_dir=out-shakespeare-char

Config (config/train_shakespeare_char.py):

Model config

n_layer = 6 # 6 transformer layers n_head = 6 # 6 attention heads n_embd = 384 # 384-dim embeddings block_size = 256 # 256 char context

Training config

batch_size = 64 learning_rate = 1e-3 max_iters = 5000 eval_interval = 500

Hardware

device = 'cpu' # Or 'cuda' compile = False # Set True for PyTorch 2.0

Training time: ~5 minutes (CPU), ~1 minute (GPU)

Workflow 2: Reproduce GPT-2 (124M)

Multi-GPU training on OpenWebText:

Step 1: Prepare OpenWebText (takes ~1 hour)

python data/openwebtext/prepare.py

Step 2: Train GPT-2 124M with DDP (8 GPUs)

torchrun --standalone --nproc_per_node=8 \ train.py config/train_gpt2.py

Step 3: Sample from trained model

python sample.py --out_dir=out

Config (config/train_gpt2.py):

GPT-2 (124M) architecture

n_layer = 12 n_head = 12 n_embd = 768 block_size = 1024 dropout = 0.0

Training

batch_size = 12 gradient_accumulation_steps = 5 * 8 # Total batch ~0.5M tokens learning_rate = 6e-4 max_iters = 600000 lr_decay_iters = 600000

System

compile = True # PyTorch 2.0

Training time: ~4 days (8× A100)

Workflow 3: Fine-tune pretrained GPT-2

Start from OpenAI checkpoint:

In train.py or config

init_from = 'gpt2' # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl

Model loads OpenAI weights automatically

python train.py config/finetune_shakespeare.py

Example config (config/finetune_shakespeare.py):

Start from GPT-2

init_from = 'gpt2'

Dataset

dataset = 'shakespeare_char' batch_size = 1 block_size = 1024

Fine-tuning

learning_rate = 3e-5 # Lower LR for fine-tuning max_iters = 2000 warmup_iters = 100

Regularization

weight_decay = 1e-1

Workflow 4: Custom dataset

Train on your own text:

data/custom/prepare.py

import numpy as np

Load your data

with open('my_data.txt', 'r') as f: text = f.read()

Create character mappings

chars = sorted(list(set(text))) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for i, ch in enumerate(chars)}

Tokenize

data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

Split train/val

n = len(data) train_data = data[:int(n0.9)] val_data = data[int(n0.9):]

Save

train_data.tofile('data/custom/train.bin') val_data.tofile('data/custom/val.bin')

Train:

python data/custom/prepare.py python train.py --dataset=custom

When to use vs alternatives

Use nanoGPT when:

Learning how GPT works Experimenting with transformer variants Teaching/education purposes Quick prototyping Limited compute (can run on CPU)

Simplicity advantages:

~300 lines: Entire model in model.py ~300 lines: Training loop in train.py Hackable: Easy to modify No abstractions: Pure PyTorch

Use alternatives instead:

HuggingFace Transformers: Production use, many models Megatron-LM: Large-scale distributed training LitGPT: More architectures, production-ready PyTorch Lightning: Need high-level framework Common issues

Issue: CUDA out of memory

Reduce batch size or context length:

batch_size = 1 # Reduce from 12 block_size = 512 # Reduce from 1024 gradient_accumulation_steps = 40 # Increase to maintain effective batch

Issue: Training too slow

Enable compilation (PyTorch 2.0+):

compile = True # 2× speedup

Use mixed precision:

dtype = 'bfloat16' # Or 'float16'

Issue: Poor generation quality

Train longer:

max_iters = 10000 # Increase from 5000

Lower temperature:

In sample.py

temperature = 0.7 # Lower from 1.0 top_k = 200 # Add top-k sampling

Issue: Can't load GPT-2 weights

Install transformers:

pip install transformers

Check model name:

init_from = 'gpt2' # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl

Advanced topics

Model architecture: See references/architecture.md for GPT block structure, multi-head attention, and MLP layers explained simply.

Training loop: See references/training.md for learning rate schedule, gradient accumulation, and distributed data parallel setup.

Data preparation: See references/data.md for tokenization strategies (character-level vs BPE) and binary format details.

Hardware requirements

Shakespeare (char-level):

CPU: 5 minutes GPU (T4): 1 minute VRAM: <1GB

GPT-2 (124M):

1× A100: ~1 week 8× A100: ~4 days VRAM: ~16GB per GPU

GPT-2 Medium (350M):

8× A100: ~2 weeks VRAM: ~40GB per GPU

Performance:

With compile=True: 2× speedup With dtype=bfloat16: 50% memory reduction Resources GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+ Video: "Let's build GPT" by Andrej Karpathy Paper: "Attention is All You Need" (Vaswani et al.) OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext Educational: Best for understanding transformers from scratch

nanogpt

安装