paper2code — Arxiv Paper to Working Implementation

Skill by

ara.so

— Daily 2026 Skills collection.

paper2code is a Claude Code agent skill that converts any arxiv paper URL into a citation-anchored Python implementation. Every code decision references the exact paper section and equation it implements, and all gaps/ambiguities are explicitly flagged rather than silently filled in.

Install

npx skills

add

PrathamLearnsToCode/paper2code/skills/paper2code

During install you'll choose:

Agents

which coding agents get the skill (e.g., Claude Code)

Scope

Global (recommended) or project-level
Method: Symlink (recommended) or copy Then launch your agent: claude Core Commands Basic usage /paper2code https://arxiv.org/abs/1706.03762 With framework override /paper2code https://arxiv.org/abs/2006.11239 --framework jax /paper2code https://arxiv.org/abs/2006.11239 --framework pytorch # default /paper2code https://arxiv.org/abs/2006.11239 --framework tensorflow With mode flag /paper2code 1706.03762 --mode minimal # architecture only (default) /paper2code 1706.03762 --mode full # includes training loop + data pipeline /paper2code 1706.03762 --mode educational # extra comments + pedagogical notebook Bare arxiv ID (no URL required) /paper2code 1706.03762 /paper2code 2106.09685 Output Structure Every run produces a directory named after the paper slug: attention_is_all_you_need/ ├── README.md # Paper summary + quick-start ├── REPRODUCTION_NOTES.md # Ambiguity audit, unspecified choices, known deviations ├── requirements.txt # Pinned dependencies ├── src/ │ ├── model.py # Architecture — every layer cited to paper section │ ├── loss.py # Loss functions with equation references │ ├── data.py # Dataset skeleton with preprocessing TODOs │ ├── train.py # Training loop (full/educational mode) │ ├── evaluate.py # Metric computation │ └── utils.py # Shared utilities ├── configs/ │ └── base.yaml # All hyperparams — each cited or flagged [UNSPECIFIED] └── notebooks/ └── walkthrough.ipynb # Paper section → code → shape checks Citation Anchoring Convention The core value of paper2code is traceability. Every non-trivial decision is tagged: Tag Meaning §X.Y Directly specified in section X.Y §X.Y, Eq. N Implements equation N from section X.Y [UNSPECIFIED] Paper doesn't state this — choice made with alternatives listed [PARTIALLY_SPECIFIED] Paper mentions it but is ambiguous — quote included [ASSUMPTION] Reasonable inference — reasoning explained [FROM_OFFICIAL_CODE] Taken from authors' official implementation Example — model.py with citation anchors import torch import torch . nn as nn import math class MultiHeadAttention ( nn . Module ) : """§3.2 — Multi-Head Attention Implements Eq. 4: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V) """ def init ( self , d_model : int , num_heads : int , dropout : float = 0.1 ) : super ( ) . init ( )

§3.2 — d_model = 512, h = 8 stated in Table 1

assert d_model % num_heads == 0 self . d_k = d_model // num_heads

§3.2 — d_k = d_v = d_model / h = 64

self . num_heads = num_heads self . W_q = nn . Linear ( d_model , d_model ) self . W_k = nn . Linear ( d_model , d_model ) self . W_v = nn . Linear ( d_model , d_model ) self . W_o = nn . Linear ( d_model , d_model )

§3.2, Eq. 4 — W^O projection

[UNSPECIFIED] Dropout rate for attention weights not stated in §3.2

Using 0.1 matching the model-wide dropout (§5.4, Table 3)

self . dropout = nn . Dropout ( dropout ) def forward ( self , q , k , v , mask = None ) : batch_size = q . size ( 0 )

§3.2, Eq. 4 — project into h heads

Q

self . W_q ( q ) . view ( batch_size , - 1 , self . num_heads , self . d_k ) . transpose ( 1 , 2 ) K = self . W_k ( k ) . view ( batch_size , - 1 , self . num_heads , self . d_k ) . transpose ( 1 , 2 ) V = self . W_v ( v ) . view ( batch_size , - 1 , self . num_heads , self . d_k ) . transpose ( 1 , 2 )

§3.2.1, Eq. 1 — Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V

scores

torch . matmul ( Q , K . transpose ( - 2 , - 1 ) ) / math . sqrt ( self . d_k ) if mask is not None :

§3.2.3 — decoder masks future positions with -inf before softmax

scores

scores . masked_fill ( mask == 0 , float ( '-inf' ) ) attn_weights = torch . softmax ( scores , dim = - 1 ) attn_weights = self . dropout ( attn_weights ) out = torch . matmul ( attn_weights , V )

(batch, heads, seq, d_k)

out

out . transpose ( 1 , 2 ) . contiguous ( ) . view ( batch_size , - 1 , self . num_heads * self . d_k ) return self . W_o ( out )

§3.2, Eq. 4 — W^O output projection

class TransformerBlock ( nn . Module ) : """§3.1 — Encoder/Decoder layer structure""" def init ( self , d_model : int , num_heads : int , d_ff : int , dropout : float = 0.1 ) : super ( ) . init ( ) self . attention = MultiHeadAttention ( d_model , num_heads , dropout )

[ASSUMPTION] Using pre-norm based on stability; paper Figure 1 shows post-norm

Post-norm: x = LayerNorm(x + sublayer(x)) — §3.1

[PARTIALLY_SPECIFIED] "We apply layer normalization" — position ambiguous

self . norm1 = nn . LayerNorm ( d_model ) self . norm2 = nn . LayerNorm ( d_model )

§3.3 — FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

self . ff = nn . Sequential ( nn . Linear ( d_model , d_ff ) , nn . ReLU ( ) ,

§3.3 — "ReLU activation"

nn . Dropout ( dropout ) , nn . Linear ( d_ff , d_model ) , ) self . dropout = nn . Dropout ( dropout ) def forward ( self , x , mask = None ) :

§3.1 — residual connection around each sub-layer

attn_out

self . attention ( self . norm1 ( x ) , self . norm1 ( x ) , self . norm1 ( x ) , mask ) x = x + self . dropout ( attn_out ) x = x + self . dropout ( self . ff ( self . norm2 ( x ) ) ) return x Example — configs/base.yaml with citations

base.yaml — All hyperparameters for attention_is_all_you_need

Each value is either cited from the paper or flagged [UNSPECIFIED]

model : d_model : 512

§3, Table 1 — "d_model = 512"

num_heads : 8

§3.2, Table 1 — "h = 8"

d_ff : 2048

§3.3, Table 1 — "d_ff = 2048"

num_encoder_layers : 6

§3, Table 1 — "N = 6"

num_decoder_layers : 6

§3, Table 1 — "N = 6"

dropout : 0.1

§5.4, Table 3 — "P_drop = 0.1"

max_seq_len : 512

[UNSPECIFIED] not stated; using 512 (common default)

Alternatives: 256, 1024

training : batch_size : 25000

§5.1 — "each batch ~25,000 source + target tokens"

optimizer : adam

§5.3 — "Adam optimizer"

beta1 : 0.9

§5.3 — "β1 = 0.9"

beta2 : 0.98

§5.3 — "β2 = 0.98"

epsilon : 1.0e-9

§5.3 — "ε = 10^-9"

warmup_steps : 4000

§5.3 — "warmup_steps = 4000"

label_smoothing : 0.1

§5.4 — "ε_ls = 0.1"

Example — REPRODUCTION_NOTES.md structure

Reproduction Notes — Attention Is All You Need

Ambiguity Audit

|

|

|

| | LayerNorm epsilon | 1e-6 | common default | 1e-5, 1e-8 | | max_seq_len | 512 | common for WMT | 256, 1024 |

Known Deviations

data.py provides skeleton only; WMT14 preprocessing not implemented

No beam search decoding (§5 mentions beam size 4, not fully implemented)
What paper2code Will NOT Do
Understanding limits prevents wasted debugging time:
Won't guarantee correctness
— matches what the paper describes; if the paper is wrong, the code is wrong
Won't invent details silently
— gaps are always
[UNSPECIFIED]
, never filled confidently
Won't download datasets
—
data.py
gives a
Dataset
skeleton with instructions
Won't set up training infrastructure
— no distributed training, no experiment tracking
Won't implement baselines
— only the paper's core contribution
Won't reimplement standard components
— imports them or notes the dependency
Common Patterns
Pattern 1 — Implement a new architecture paper
/paper2code https://arxiv.org/abs/2010.11929 --mode minimal
Focus:
src/model.py
will contain the full architecture. Review
REPRODUCTION_NOTES.md
to understand every ambiguous choice before running.
Pattern 2 — Reproduce a training method
/paper2code https://arxiv.org/abs/2006.11239 --mode full --framework pytorch
Focus:
src/train.py
will contain the full training loop.
configs/base.yaml
will list every hyperparameter with paper citations.
Pattern 3 — Educational deep-dive
/paper2code 1706.03762 --mode educational
Focus:
notebooks/walkthrough.ipynb
walks through each paper section, shows corresponding code, and runs CPU-safe shape checks.
Pattern 4 — Quick architecture prototype
/paper2code 2106.09685 # ViT
Then inspect and run:
cd
vision_transformer/
pip
install
-r
requirements.txt
python
-c
"
from src.model import VisionTransformer
import torch
model = VisionTransformer() # toy config
x = torch.randn(2, 3, 224, 224)
print(model(x).shape)
"
Troubleshooting
Skill not triggering
Confirm install completed:
npx skills list
should show
paper2code-arxiv-implementation
Use the explicit trigger:
/paper2code
Try bare arxiv ID format:
/paper2code 1706.03762
Generated code has import errors
Run
pip install -r requirements.txt
first
Check
REPRODUCTION_NOTES.md
for noted dependencies
Standard components (e.g., HuggingFace transformers) are imported, not reimplemented — install them separately
"Paper not found" or fetch errors
Confirm the arxiv ID exists:
https://arxiv.org/abs/
Try the full URL instead of bare ID
Some very new papers (hours old) may not be indexed yet
Silent assumptions in generated code
This should not happen by design — if you find one, it's a bug
Check
REPRODUCTION_NOTES.md
first; the assumption may be documented there
Report via the repo issues if a gap was genuinely filled silently
Framework-specific issues
Default framework is PyTorch — omitting
--framework
gives PyTorch output
JAX output requires
jax
,
flax
,
optax
— listed in
requirements.txt
TensorFlow output requires
tensorflow>=2.x
Contributing
Add a worked example
Run:
/paper2code https://arxiv.org/abs/XXXX.XXXXX
Save output to
skills/paper2code/worked/{paper_slug}/
Write
review.md
evaluating correctness, flagged ambiguities, and any mistakes
Submit PR
Improve guardrails
Add patterns where the skill makes silent assumptions to
guardrails/
.
Add domain knowledge
Papers in your subfield reference common components? Add a knowledge file to
knowledge/
(e.g.,
knowledge/graph_neural_networks.md
).
Resources
Repo
:
https://github.com/PrathamLearnsToCode/paper2code
Worked examples
:
skills/paper2code/worked/
in the repo
Issues
:
https://github.com/PrathamLearnsToCode/paper2code/issues
License: MIT

安装

§3.2 — d_model = 512, h = 8 stated in Table 1

§3.2 — d_k = d_v = d_model / h = 64

§3.2, Eq. 4 — W^O projection

[UNSPECIFIED] Dropout rate for attention weights not stated in §3.2

Using 0.1 matching the model-wide dropout (§5.4, Table 3)

§3.2, Eq. 4 — project into h heads

Q

§3.2.1, Eq. 1 — Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V

scores

§3.2.3 — decoder masks future positions with -inf before softmax

scores

(batch, heads, seq, d_k)

out

§3.2, Eq. 4 — W^O output projection

[ASSUMPTION] Using pre-norm based on stability; paper Figure 1 shows post-norm

Post-norm: x = LayerNorm(x + sublayer(x)) — §3.1

[PARTIALLY_SPECIFIED] "We apply layer normalization" — position ambiguous

§3.3 — FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

§3.3 — "ReLU activation"

§3.1 — residual connection around each sub-layer

attn_out

base.yaml — All hyperparameters for attention_is_all_you_need

Each value is either cited from the paper or flagged [UNSPECIFIED]

§3, Table 1 — "d_model = 512"

§3.2, Table 1 — "h = 8"

§3.3, Table 1 — "d_ff = 2048"

§3, Table 1 — "N = 6"

§3, Table 1 — "N = 6"

§5.4, Table 3 — "P_drop = 0.1"

[UNSPECIFIED] not stated; using 512 (common default)

Alternatives: 256, 1024

§5.1 — "each batch ~25,000 source + target tokens"

§5.3 — "Adam optimizer"

§5.3 — "β1 = 0.9"

§5.3 — "β2 = 0.98"

§5.3 — "ε = 10^-9"

§5.3 — "warmup_steps = 4000"

§5.4 — "ε_ls = 0.1"

|

|

|

|

|

|

|

|

Known Deviations

data.py provides skeleton only; WMT14 preprocessing not implemented