- paper2code — Arxiv Paper to Working Implementation
- Skill by
- ara.so
- — Daily 2026 Skills collection.
- paper2code is a Claude Code agent skill that converts any arxiv paper URL into a citation-anchored Python implementation. Every code decision references the exact paper section and equation it implements, and all gaps/ambiguities are explicitly flagged rather than silently filled in.
- Install
- npx skills
- add
- PrathamLearnsToCode/paper2code/skills/paper2code
- During install you'll choose:
- Agents
-
- which coding agents get the skill (e.g., Claude Code)
- Scope
-
- Global (recommended) or project-level
- Method
- Symlink (recommended) or copy Then launch your agent: claude Core Commands Basic usage /paper2code https://arxiv.org/abs/1706.03762 With framework override /paper2code https://arxiv.org/abs/2006.11239 --framework jax /paper2code https://arxiv.org/abs/2006.11239 --framework pytorch # default /paper2code https://arxiv.org/abs/2006.11239 --framework tensorflow With mode flag /paper2code 1706.03762 --mode minimal # architecture only (default) /paper2code 1706.03762 --mode full # includes training loop + data pipeline /paper2code 1706.03762 --mode educational # extra comments + pedagogical notebook Bare arxiv ID (no URL required) /paper2code 1706.03762 /paper2code 2106.09685 Output Structure Every run produces a directory named after the paper slug: attention_is_all_you_need/ ├── README.md # Paper summary + quick-start ├── REPRODUCTION_NOTES.md # Ambiguity audit, unspecified choices, known deviations ├── requirements.txt # Pinned dependencies ├── src/ │ ├── model.py # Architecture — every layer cited to paper section │ ├── loss.py # Loss functions with equation references │ ├── data.py # Dataset skeleton with preprocessing TODOs │ ├── train.py # Training loop (full/educational mode) │ ├── evaluate.py # Metric computation │ └── utils.py # Shared utilities ├── configs/ │ └── base.yaml # All hyperparams — each cited or flagged [UNSPECIFIED] └── notebooks/ └── walkthrough.ipynb # Paper section → code → shape checks Citation Anchoring Convention The core value of paper2code is traceability. Every non-trivial decision is tagged: Tag Meaning §X.Y Directly specified in section X.Y §X.Y, Eq. N Implements equation N from section X.Y [UNSPECIFIED] Paper doesn't state this — choice made with alternatives listed [PARTIALLY_SPECIFIED] Paper mentions it but is ambiguous — quote included [ASSUMPTION] Reasonable inference — reasoning explained [FROM_OFFICIAL_CODE] Taken from authors' official implementation Example — model.py with citation anchors import torch import torch . nn as nn import math class MultiHeadAttention ( nn . Module ) : """§3.2 — Multi-Head Attention Implements Eq. 4: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V) """ def init ( self , d_model : int , num_heads : int , dropout : float = 0.1 ) : super ( ) . init ( )
§3.2 — d_model = 512, h = 8 stated in Table 1
assert d_model % num_heads == 0 self . d_k = d_model // num_heads
§3.2 — d_k = d_v = d_model / h = 64
self . num_heads = num_heads self . W_q = nn . Linear ( d_model , d_model ) self . W_k = nn . Linear ( d_model , d_model ) self . W_v = nn . Linear ( d_model , d_model ) self . W_o = nn . Linear ( d_model , d_model )
§3.2, Eq. 4 — W^O projection
[UNSPECIFIED] Dropout rate for attention weights not stated in §3.2
Using 0.1 matching the model-wide dropout (§5.4, Table 3)
self . dropout = nn . Dropout ( dropout ) def forward ( self , q , k , v , mask = None ) : batch_size = q . size ( 0 )
§3.2, Eq. 4 — project into h heads
Q
self . W_q ( q ) . view ( batch_size , - 1 , self . num_heads , self . d_k ) . transpose ( 1 , 2 ) K = self . W_k ( k ) . view ( batch_size , - 1 , self . num_heads , self . d_k ) . transpose ( 1 , 2 ) V = self . W_v ( v ) . view ( batch_size , - 1 , self . num_heads , self . d_k ) . transpose ( 1 , 2 )
§3.2.1, Eq. 1 — Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
scores
torch . matmul ( Q , K . transpose ( - 2 , - 1 ) ) / math . sqrt ( self . d_k ) if mask is not None :
§3.2.3 — decoder masks future positions with -inf before softmax
scores
scores . masked_fill ( mask == 0 , float ( '-inf' ) ) attn_weights = torch . softmax ( scores , dim = - 1 ) attn_weights = self . dropout ( attn_weights ) out = torch . matmul ( attn_weights , V )
(batch, heads, seq, d_k)
out
out . transpose ( 1 , 2 ) . contiguous ( ) . view ( batch_size , - 1 , self . num_heads * self . d_k ) return self . W_o ( out )
§3.2, Eq. 4 — W^O output projection
class TransformerBlock ( nn . Module ) : """§3.1 — Encoder/Decoder layer structure""" def init ( self , d_model : int , num_heads : int , d_ff : int , dropout : float = 0.1 ) : super ( ) . init ( ) self . attention = MultiHeadAttention ( d_model , num_heads , dropout )
[ASSUMPTION] Using pre-norm based on stability; paper Figure 1 shows post-norm
Post-norm: x = LayerNorm(x + sublayer(x)) — §3.1
[PARTIALLY_SPECIFIED] "We apply layer normalization" — position ambiguous
self . norm1 = nn . LayerNorm ( d_model ) self . norm2 = nn . LayerNorm ( d_model )
§3.3 — FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
self . ff = nn . Sequential ( nn . Linear ( d_model , d_ff ) , nn . ReLU ( ) ,
§3.3 — "ReLU activation"
nn . Dropout ( dropout ) , nn . Linear ( d_ff , d_model ) , ) self . dropout = nn . Dropout ( dropout ) def forward ( self , x , mask = None ) :
§3.1 — residual connection around each sub-layer
attn_out
self . attention ( self . norm1 ( x ) , self . norm1 ( x ) , self . norm1 ( x ) , mask ) x = x + self . dropout ( attn_out ) x = x + self . dropout ( self . ff ( self . norm2 ( x ) ) ) return x Example — configs/base.yaml with citations
base.yaml — All hyperparameters for attention_is_all_you_need
Each value is either cited from the paper or flagged [UNSPECIFIED]
model : d_model : 512
§3, Table 1 — "d_model = 512"
num_heads : 8
§3.2, Table 1 — "h = 8"
d_ff : 2048
§3.3, Table 1 — "d_ff = 2048"
num_encoder_layers : 6
§3, Table 1 — "N = 6"
num_decoder_layers : 6
§3, Table 1 — "N = 6"
dropout : 0.1
§5.4, Table 3 — "P_drop = 0.1"
max_seq_len : 512
[UNSPECIFIED] not stated; using 512 (common default)
Alternatives: 256, 1024
training : batch_size : 25000
§5.1 — "each batch ~25,000 source + target tokens"
optimizer : adam
§5.3 — "Adam optimizer"
beta1 : 0.9
§5.3 — "β1 = 0.9"
beta2 : 0.98
§5.3 — "β2 = 0.98"
epsilon : 1.0e-9
§5.3 — "ε = 10^-9"
warmup_steps : 4000
§5.3 — "warmup_steps = 4000"
label_smoothing : 0.1
§5.4 — "ε_ls = 0.1"
Example — REPRODUCTION_NOTES.md structure
Reproduction Notes — Attention Is All You Need
Ambiguity Audit
SPECIFIED (high confidence) | Choice | Value | Source | |
|
|
| | d_model | 512 | §3, Table 1 | | num_heads | 8 | §3.2, Table 1 | | optimizer | Adam β1=0.9, β2=0.98 | §5.3 |
PARTIALLY_SPECIFIED (judgment call made) | Choice | Our Decision | Paper Quote | Alternatives | |
|
|
|
| | Norm position | pre-norm | "layer norm before each sub-layer" (§3.1) conflicts with Figure 1 | post-norm |
UNSPECIFIED (our defaults) | Choice | Our Default | Rationale | Alternatives | |
|
|
|
| | LayerNorm epsilon | 1e-6 | common default | 1e-5, 1e-8 | | max_seq_len | 512 | common for WMT | 256, 1024 |
Known Deviations
data.py provides skeleton only; WMT14 preprocessing not implemented
- No beam search decoding (§5 mentions beam size 4, not fully implemented)
- What paper2code Will NOT Do
- Understanding limits prevents wasted debugging time:
- Won't guarantee correctness
- — matches what the paper describes; if the paper is wrong, the code is wrong
- Won't invent details silently
- — gaps are always
- [UNSPECIFIED]
- , never filled confidently
- Won't download datasets
- —
- data.py
- gives a
- Dataset
- skeleton with instructions
- Won't set up training infrastructure
- — no distributed training, no experiment tracking
- Won't implement baselines
- — only the paper's core contribution
- Won't reimplement standard components
- — imports them or notes the dependency
- Common Patterns
- Pattern 1 — Implement a new architecture paper
- /paper2code https://arxiv.org/abs/2010.11929 --mode minimal
- Focus:
- src/model.py
- will contain the full architecture. Review
- REPRODUCTION_NOTES.md
- to understand every ambiguous choice before running.
- Pattern 2 — Reproduce a training method
- /paper2code https://arxiv.org/abs/2006.11239 --mode full --framework pytorch
- Focus:
- src/train.py
- will contain the full training loop.
- configs/base.yaml
- will list every hyperparameter with paper citations.
- Pattern 3 — Educational deep-dive
- /paper2code 1706.03762 --mode educational
- Focus:
- notebooks/walkthrough.ipynb
- walks through each paper section, shows corresponding code, and runs CPU-safe shape checks.
- Pattern 4 — Quick architecture prototype
- /paper2code 2106.09685 # ViT
- Then inspect and run:
- cd
- vision_transformer/
- pip
- install
- -r
- requirements.txt
- python
- -c
- "
- from src.model import VisionTransformer
- import torch
- model = VisionTransformer() # toy config
- x = torch.randn(2, 3, 224, 224)
- print(model(x).shape)
- "
- Troubleshooting
- Skill not triggering
- Confirm install completed:
- npx skills list
- should show
- paper2code-arxiv-implementation
- Use the explicit trigger:
- /paper2code
- Try bare arxiv ID format:
- /paper2code 1706.03762
- Generated code has import errors
- Run
- pip install -r requirements.txt
- first
- Check
- REPRODUCTION_NOTES.md
- for noted dependencies
- Standard components (e.g., HuggingFace transformers) are imported, not reimplemented — install them separately
- "Paper not found" or fetch errors
- Confirm the arxiv ID exists:
- https://arxiv.org/abs/
- Try the full URL instead of bare ID
- Some very new papers (hours old) may not be indexed yet
- Silent assumptions in generated code
- This should not happen by design — if you find one, it's a bug
- Check
- REPRODUCTION_NOTES.md
- first; the assumption may be documented there
- Report via the repo issues if a gap was genuinely filled silently
- Framework-specific issues
- Default framework is PyTorch — omitting
- --framework
- gives PyTorch output
- JAX output requires
- jax
- ,
- flax
- ,
- optax
- — listed in
- requirements.txt
- TensorFlow output requires
- tensorflow>=2.x
- Contributing
- Add a worked example
- Run:
- /paper2code https://arxiv.org/abs/XXXX.XXXXX
- Save output to
- skills/paper2code/worked/{paper_slug}/
- Write
- review.md
- evaluating correctness, flagged ambiguities, and any mistakes
- Submit PR
- Improve guardrails
- Add patterns where the skill makes silent assumptions to
- guardrails/
- .
- Add domain knowledge
- Papers in your subfield reference common components? Add a knowledge file to
- knowledge/
- (e.g.,
- knowledge/graph_neural_networks.md
- ).
- Resources
- Repo
- :
- https://github.com/PrathamLearnsToCode/paper2code
- Worked examples
- :
- skills/paper2code/worked/
- in the repo
- Issues
- :
- https://github.com/PrathamLearnsToCode/paper2code/issues
- License
- MIT