TorchCode — PyTorch Interview Practice
Skill by
ara.so
— Daily 2026 Skills collection.
TorchCode is a Jupyter-based, self-hosted coding practice environment for ML engineers. It provides 40 curated problems covering PyTorch fundamentals and architectures (softmax, LayerNorm, MultiHeadAttention, GPT-2, etc.) with an automated judge that gives instant pass/fail feedback, gradient verification, and timing — like LeetCode but for tensors.
Installation & Setup
Option 1: Online (zero install)
Hugging Face Spaces
:
https://huggingface.co/spaces/duoan/TorchCode
Google Colab: Every notebook has an "Open in Colab" badge Option 2: pip (for use inside Colab or existing environment) pip install torch-judge Option 3: Docker (pre-built image) docker run -p 8888 :8888 -e PORT = 8888 ghcr.io/duoan/torchcode:latest

Open http://localhost:8888

Option 4: Build locally git clone https://github.com/duoan/TorchCode.git cd TorchCode make run

Open http://localhost:8888

make run auto-detects Docker or Podman and falls back to local build if the registry image is unavailable (common on Apple Silicon/arm64). Judge API The torch_judge package provides the core API used in every notebook. from torch_judge import check , status , hint , reset_progress

List all 40 problems and your progress

status ( )

Run tests for a specific problem

check ( "relu" ) check ( "softmax" ) check ( "layernorm" ) check ( "attention" ) check ( "gpt2" )

Get a hint without spoilers

hint ( "softmax" )

Reset progress for a problem

reset_progress ( "relu" ) check() return values Colored pass/fail per test case Correctness check against PyTorch reference implementation Gradient verification (autograd compatibility) Timing measurement Problem Set Overview Difficulty levels: Easy → Medium → Hard

Problem Key Concepts 1 ReLU Activation functions, element-wise ops 2 Softmax Numerical stability, exp/log tricks 3 Linear Layer y = xW^T + b , Kaiming init, nn.Parameter 4 LayerNorm Normalization, affine transform 5 Self-Attention QKV projections, scaled dot-product 6 Multi-Head Attention Head splitting, concatenation 7 BatchNorm Batch vs layer statistics, train/eval 8 RMSNorm LLaMA-style norm 16 Cross-Entropy Loss Log-softmax, logsumexp trick 17 Dropout Train/eval mode, inverted scaling 18 Embedding Lookup table, weight[indices] 19 GELU torch.erf , Gaussian error linear unit 20 Kaiming Init std = sqrt(2/fan_in) 21 Gradient Clipping Norm-based clipping 31 Gradient Accumulation Micro-batching, loss scaling 40 Linear Regression Normal equation, GD from scratch Working Through a Problem Each problem notebook has the same structure: templates/ 01_relu.ipynb # Blank template — your workspace 02_softmax.ipynb ... solutions/ 01_relu.ipynb # Reference solution (study after attempt) Typical notebook workflow

Cell 1: Import judge

from torch_judge import check , hint import torch import torch . nn as nn

Cell 2: Your implementation

def my_relu ( x : torch . Tensor ) -

torch . Tensor :

TODO: implement ReLU without using torch.relu or F.relu

raise NotImplementedError

Cell 3: Run the judge

check ( "relu" ) Real Implementation Examples ReLU (Problem 1 — Easy) def my_relu ( x : torch . Tensor ) -

torch . Tensor : return torch . clamp ( x , min = 0 )

Alternative: return x * (x > 0)

Alternative: return torch.where(x > 0, x, torch.zeros_like(x))

Softmax (Problem 2 — Easy, numerically stable) def my_softmax ( x : torch . Tensor , dim : int = - 1 ) -

torch . Tensor :

Subtract max for numerical stability (prevents overflow)

x_max

x . max ( dim = dim , keepdim = True ) . values x_shifted = x - x_max exp_x = torch . exp ( x_shifted ) return exp_x / exp_x . sum ( dim = dim , keepdim = True ) LayerNorm (Problem 4 — Medium) def my_layer_norm ( x : torch . Tensor , weight : torch . Tensor ,

gamma (scale)

bias : torch . Tensor ,

beta (shift)

eps : float = 1e-5 ) -

torch . Tensor : mean = x . mean ( dim = - 1 , keepdim = True ) var = x . var ( dim = - 1 , keepdim = True , unbiased = False ) x_norm = ( x - mean ) / torch . sqrt ( var + eps ) return weight * x_norm + bias RMSNorm (Problem 8 — Medium, LLaMA-style) def rms_norm ( x : torch . Tensor , weight : torch . Tensor , eps : float = 1e-6 ) -

torch . Tensor : rms = torch . sqrt ( ( x ** 2 ) . mean ( dim = - 1 , keepdim = True ) + eps ) return ( x / rms ) * weight Scaled Dot-Product Self-Attention (Problem 5 — Medium) import torch . nn . functional as F import math def scaled_dot_product_attention ( Q : torch . Tensor ,

(B, heads, T, head_dim)

K : torch . Tensor , V : torch . Tensor , mask : torch . Tensor = None ) -

torch . Tensor : d_k = Q . size ( - 1 ) scores = torch . matmul ( Q , K . transpose ( - 2 , - 1 ) ) / math . sqrt ( d_k ) if mask is not None : scores = scores . masked_fill ( mask == 0 , float ( '-inf' ) ) attn_weights = F . softmax ( scores , dim = - 1 ) return torch . matmul ( attn_weights , V ) Multi-Head Attention (Problem 6 — Medium) class MyMultiHeadAttention ( nn . Module ) : def init ( self , d_model : int , num_heads : int ) : super ( ) . init ( ) assert d_model % num_heads == 0 self . num_heads = num_heads self . head_dim = d_model // num_heads self . d_model = d_model self . W_q = nn . Linear ( d_model , d_model ) self . W_k = nn . Linear ( d_model , d_model ) self . W_v = nn . Linear ( d_model , d_model ) self . W_o = nn . Linear ( d_model , d_model ) def forward ( self , x : torch . Tensor , mask : torch . Tensor = None ) -

torch . Tensor : B , T , C = x . shape def split_heads ( t ) : return t . view ( B , T , self . num_heads , self . head_dim ) . transpose ( 1 , 2 ) Q = split_heads ( self . W_q ( x ) ) K = split_heads ( self . W_k ( x ) ) V = split_heads ( self . W_v ( x ) ) attn_out = scaled_dot_product_attention ( Q , K , V , mask )

(B, heads, T, head_dim) -> (B, T, d_model)

attn_out

attn_out . transpose ( 1 , 2 ) . contiguous ( ) . view ( B , T , C ) return self . W_o ( attn_out ) Cross-Entropy Loss (Problem 16 — Easy) def cross_entropy_loss ( logits : torch . Tensor , targets : torch . Tensor ) -

torch . Tensor :

logits: (B, C), targets: (B,) with class indices

Use logsumexp trick for numerical stability

log_sum_exp

torch . logsumexp ( logits , dim = - 1 )

(B,)

log_probs

logits [ torch . arange ( len ( targets ) ) , targets ]

(B,)

return ( log_sum_exp - log_probs ) . mean ( ) Dropout (Problem 17 — Easy) class MyDropout ( nn . Module ) : def init ( self , p : float = 0.5 ) : super ( ) . init ( ) self . p = p def forward ( self , x : torch . Tensor ) -

torch . Tensor : if not self . training or self . p == 0 : return x mask = torch . bernoulli ( torch . ones_like ( x ) * ( 1 - self . p ) ) return x * mask / ( 1 - self . p )

inverted scaling

Kaiming Init (Problem 20 — Easy) def kaiming_init ( weight : torch . Tensor ) -

torch . Tensor : fan_in = weight . size ( 1 ) std = math . sqrt ( 2.0 / fan_in ) with torch . no_grad ( ) : weight . normal_ ( 0 , std ) return weight Gradient Clipping (Problem 21 — Easy) def clip_grad_norm ( parameters , max_norm : float ) -

float : params = [ p for p in parameters if p . grad is not None ] total_norm = torch . sqrt ( sum ( p . grad . data . norm ( ) ** 2 for p in params ) ) clip_coef = max_norm / ( total_norm + 1e-6 ) if clip_coef < 1 : for p in params : p . grad . data . mul_ ( clip_coef ) return total_norm . item ( ) Gradient Accumulation (Problem 31 — Easy) def train_with_accumulation ( model , optimizer , dataloader , accumulation_steps = 4 ) : optimizer . zero_grad ( ) for i , ( inputs , targets ) in enumerate ( dataloader ) : outputs = model ( inputs ) loss = criterion ( outputs , targets ) / accumulation_steps

scale loss

loss . backward ( ) if ( i + 1 ) % accumulation_steps == 0 : optimizer . step ( ) optimizer . zero_grad ( ) Common Patterns & Tips Numerical stability pattern Always subtract the max before exp() :

WRONG — can overflow for large values

exp_x

torch . exp ( x )

CORRECT — numerically stable

exp_x

torch . exp ( x - x . max ( dim = - 1 , keepdim = True ) . values ) Causal attention mask (for GPT-style models) def causal_mask ( T : int , device ) -

torch . Tensor : return torch . tril ( torch . ones ( T , T , device = device ) ) . unsqueeze ( 0 ) . unsqueeze ( 0 ) nn.Module skeleton (used in many problems) class MyLayer ( nn . Module ) : def init ( self , . . . ) : super ( ) . init ( ) self . weight = nn . Parameter ( torch . empty ( . . . ) ) self . bias = nn . Parameter ( torch . zeros ( . . . ) ) self . init_weights ( ) def _init_weights ( self ) : nn . init . kaiming_uniform ( self . weight ) def forward ( self , x : torch . Tensor ) -

torch . Tensor : . . . Train vs eval mode pattern def forward ( self , x ) : if self . training :

use batch statistics

mean

x . mean ( dim = 0 ) var = x . var ( dim = 0 , unbiased = False )

update running stats

self . running_mean = ( 1 - self . momentum ) * self . running_mean + self . momentum * mean self . running_var = ( 1 - self . momentum ) * self . running_var + self . momentum * var else :

use running statistics

mean

self . running_mean var = self . running_var return ( x - mean ) / torch . sqrt ( var + self . eps ) * self . weight + self . bias Project Structure TorchCode/ ├── templates/ # Blank notebooks for each problem (your workspace) │ ├── 01_relu.ipynb │ ├── 02_softmax.ipynb │ └── ... ├── solutions/ # Reference solutions (study after attempting) │ └── ... ├── torch_judge/ # Auto-grading package │ ├── init.py # check(), status(), hint(), reset_progress() │ └── tasks/ # Per-problem test cases ├── Dockerfile ├── Makefile └── pyproject.toml # torch-judge package definition Troubleshooting Docker image not available for Apple Silicon (arm64)

make run auto-falls back to local build, or force it:

make build make start check() not found in Colab ! pip install torch-judge

then restart runtime

Notebook reset to blank template Use the toolbar "Reset" button in JupyterLab to reset any notebook to its original blank state — useful for re-practicing a problem. Gradient check fails but output is correct Ensure your implementation uses PyTorch operations (not NumPy) so autograd works:

WRONG — breaks autograd

import numpy as np result = np . exp ( x . numpy ( ) )

CORRECT — autograd compatible

result

torch . exp ( x ) Viewing reference solution After attempting a problem, open the matching file in solutions/ : solutions/02_softmax.ipynb Key Concepts Tested Concept Problems Numerical stability Softmax, Cross-Entropy, LogSumExp Autograd / nn.Parameter Linear, LayerNorm, all nn.Module problems Train vs eval behavior BatchNorm, Dropout Broadcasting LayerNorm, RMSNorm, attention masking Shape manipulation Multi-Head Attention (view, transpose, contiguous) Weight initialization Kaiming Init, Linear Layer Memory-efficient training Gradient Accumulation, Gradient Clipping

安装

Open http://localhost:8888

Open http://localhost:8888

List all 40 problems and your progress

Run tests for a specific problem

Get a hint without spoilers

Reset progress for a problem

Cell 1: Import judge

Cell 2: Your implementation

TODO: implement ReLU without using torch.relu or F.relu

Cell 3: Run the judge

Alternative: return x * (x > 0)

Alternative: return torch.where(x > 0, x, torch.zeros_like(x))

Subtract max for numerical stability (prevents overflow)

x_max

gamma (scale)

beta (shift)

(B, heads, T, head_dim)

(B, heads, T, head_dim) -> (B, T, d_model)

attn_out

logits: (B, C), targets: (B,) with class indices

Use logsumexp trick for numerical stability

log_sum_exp

(B,)

log_probs

(B,)

inverted scaling

scale loss

WRONG — can overflow for large values

exp_x

CORRECT — numerically stable

exp_x

use batch statistics

mean

update running stats

use running statistics

mean

make run auto-falls back to local build, or force it:

then restart runtime

WRONG — breaks autograd

CORRECT — autograd compatible

result