PyTorch Development Patterns Idiomatic PyTorch patterns and best practices for building robust, efficient, and reproducible deep learning applications. When to Activate Writing new PyTorch models or training scripts Reviewing deep learning code Debugging training loops or data pipelines Optimizing GPU memory usage or training speed Setting up reproducible experiments Core Principles 1. Device-Agnostic Code Always write code that works on both CPU and GPU without hardcoding devices.

Good: Device-agnostic

device

torch . device ( "cuda" if torch . cuda . is_available ( ) else "cpu" ) model = MyModel ( ) . to ( device ) data = data . to ( device )

Bad: Hardcoded device

model

MyModel ( ) . cuda ( )

Crashes if no GPU

data

data . cuda ( ) 2. Reproducibility First Set all random seeds for reproducible results.

Good: Full reproducibility setup

def set_seed ( seed : int = 42 ) -

None : torch . manual_seed ( seed ) torch . cuda . manual_seed_all ( seed ) np . random . seed ( seed ) random . seed ( seed ) torch . backends . cudnn . deterministic = True torch . backends . cudnn . benchmark = False

Bad: No seed control

model

MyModel ( )

Different weights every run

Explicit Shape Management Always document and verify tensor shapes.

Good: Shape-annotated forward pass

def forward ( self , x : torch . Tensor ) -

torch . Tensor :

x: (batch_size, channels, height, width)

x

self . conv1 ( x )

-> (batch_size, 32, H, W)

x

self . pool ( x )

-> (batch_size, 32, H//2, W//2)

x

x . view ( x . size ( 0 ) , - 1 )

-> (batch_size, 32H//2W//2)

return self . fc ( x )

-> (batch_size, num_classes)

Bad: No shape tracking

def forward ( self , x ) : x = self . conv1 ( x ) x = self . pool ( x ) x = x . view ( x . size ( 0 ) , - 1 )

What size is this?

return self . fc ( x )

Will this even work?

Model Architecture Patterns Clean nn.Module Structure

Good: Well-organized module

class ImageClassifier ( nn . Module ) : def init ( self , num_classes : int , dropout : float = 0.5 ) -

None : super ( ) . init ( ) self . features = nn . Sequential ( nn . Conv2d ( 3 , 64 , kernel_size = 3 , padding = 1 ) , nn . BatchNorm2d ( 64 ) , nn . ReLU ( inplace = True ) , nn . MaxPool2d ( 2 ) , ) self . classifier = nn . Sequential ( nn . Dropout ( dropout ) , nn . Linear ( 64 * 16 * 16 , num_classes ) , ) def forward ( self , x : torch . Tensor ) -

torch . Tensor : x = self . features ( x ) x = x . view ( x . size ( 0 ) , - 1 ) return self . classifier ( x )

Bad: Everything in forward

class ImageClassifier ( nn . Module ) : def init ( self ) : super ( ) . init ( ) def forward ( self , x ) : x = F . conv2d ( x , weight = self . make_weight ( ) )

Creates weight each call!

return x Proper Weight Initialization

Good: Explicit initialization

def _init_weights ( self , module : nn . Module ) -

None : if isinstance ( module , nn . Linear ) : nn . init . kaiming_normal_ ( module . weight , mode = "fan_out" , nonlinearity = "relu" ) if module . bias is not None : nn . init . zeros_ ( module . bias ) elif isinstance ( module , nn . Conv2d ) : nn . init . kaiming_normal_ ( module . weight , mode = "fan_out" , nonlinearity = "relu" ) elif isinstance ( module , nn . BatchNorm2d ) : nn . init . ones_ ( module . weight ) nn . init . zeros_ ( module . bias ) model = MyModel ( ) model . apply ( model . _init_weights ) Training Loop Patterns Standard Training Loop

Good: Complete training loop with best practices

def train_one_epoch ( model : nn . Module , dataloader : DataLoader , optimizer : torch . optim . Optimizer , criterion : nn . Module , device : torch . device , scaler : torch . amp . GradScaler | None = None , ) -

float : model . train ( )

Always set train mode

total_loss

0.0 for batch_idx , ( data , target ) in enumerate ( dataloader ) : data , target = data . to ( device ) , target . to ( device ) optimizer . zero_grad ( set_to_none = True )

More efficient than zero_grad()

Mixed precision training

with torch . amp . autocast ( "cuda" , enabled = scaler is not None ) : output = model ( data ) loss = criterion ( output , target ) if scaler is not None : scaler . scale ( loss ) . backward ( ) scaler . unscale_ ( optimizer ) torch . nn . utils . clip_grad_norm_ ( model . parameters ( ) , max_norm = 1.0 ) scaler . step ( optimizer ) scaler . update ( ) else : loss . backward ( ) torch . nn . utils . clip_grad_norm_ ( model . parameters ( ) , max_norm = 1.0 ) optimizer . step ( ) total_loss += loss . item ( ) return total_loss / len ( dataloader ) Validation Loop

Good: Proper evaluation

@torch . no_grad ( )

More efficient than wrapping in torch.no_grad() block

def evaluate ( model : nn . Module , dataloader : DataLoader , criterion : nn . Module , device : torch . device , ) -

tuple [ float , float ] : model . eval ( )

Always set eval mode — disables dropout, uses running BN stats

total_loss

0.0 correct = 0 total = 0 for data , target in dataloader : data , target = data . to ( device ) , target . to ( device ) output = model ( data ) total_loss += criterion ( output , target ) . item ( ) correct += ( output . argmax ( 1 ) == target ) . sum ( ) . item ( ) total += target . size ( 0 ) return total_loss / len ( dataloader ) , correct / total Data Pipeline Patterns Custom Dataset

Good: Clean Dataset with type hints

class ImageDataset ( Dataset ) : def init ( self , image_dir : str , labels : dict [ str , int ] , transform : transforms . Compose | None = None , ) -

None : self . image_paths = list ( Path ( image_dir ) . glob ( "*.jpg" ) ) self . labels = labels self . transform = transform def len ( self ) -

int : return len ( self . image_paths ) def getitem ( self , idx : int ) -

tuple [ torch . Tensor , int ] : img = Image . open ( self . image_paths [ idx ] ) . convert ( "RGB" ) label = self . labels [ self . image_paths [ idx ] . stem ] if self . transform : img = self . transform ( img ) return img , label Efficient DataLoader Configuration

Good: Optimized DataLoader

dataloader

DataLoader ( dataset , batch_size = 32 , shuffle = True ,

Shuffle for training

num_workers

4 ,

Parallel data loading

pin_memory

True ,

Faster CPU->GPU transfer

persistent_workers

True ,

Keep workers alive between epochs

drop_last

True ,

Consistent batch sizes for BatchNorm

)

Bad: Slow defaults

dataloader

DataLoader ( dataset , batch_size = 32 )

num_workers=0, no pin_memory

Custom Collate for Variable-Length Data

Good: Pad sequences in collate_fn

def collate_fn ( batch : list [ tuple [ torch . Tensor , int ] ] ) -

tuple [ torch . Tensor , torch . Tensor ] : sequences , labels = zip ( * batch )

Pad to max length in batch

padded

nn . utils . rnn . pad_sequence ( sequences , batch_first = True , padding_value = 0 ) return padded , torch . tensor ( labels ) dataloader = DataLoader ( dataset , batch_size = 32 , collate_fn = collate_fn ) Checkpointing Patterns Save and Load Checkpoints

Good: Complete checkpoint with all training state

def save_checkpoint ( model : nn . Module , optimizer : torch . optim . Optimizer , epoch : int , loss : float , path : str , ) -

None : torch . save ( { "epoch" : epoch , "model_state_dict" : model . state_dict ( ) , "optimizer_state_dict" : optimizer . state_dict ( ) , "loss" : loss , } , path ) def load_checkpoint ( path : str , model : nn . Module , optimizer : torch . optim . Optimizer | None = None , ) -

dict : checkpoint = torch . load ( path , map_location = "cpu" , weights_only = True ) model . load_state_dict ( checkpoint [ "model_state_dict" ] ) if optimizer : optimizer . load_state_dict ( checkpoint [ "optimizer_state_dict" ] ) return checkpoint

Bad: Only saving model weights (can't resume training)

torch . save ( model . state_dict ( ) , "model.pt" ) Performance Optimization Mixed Precision Training

Good: AMP with GradScaler

scaler

torch . amp . GradScaler ( "cuda" ) for data , target in dataloader : with torch . amp . autocast ( "cuda" ) : output = model ( data ) loss = criterion ( output , target ) scaler . scale ( loss ) . backward ( ) scaler . step ( optimizer ) scaler . update ( ) optimizer . zero_grad ( set_to_none = True ) Gradient Checkpointing for Large Models

Good: Trade compute for memory

from torch . utils . checkpoint import checkpoint class LargeModel ( nn . Module ) : def forward ( self , x : torch . Tensor ) -

torch . Tensor :

Recompute activations during backward to save memory

x

checkpoint ( self . block1 , x , use_reentrant = False ) x = checkpoint ( self . block2 , x , use_reentrant = False ) return self . head ( x ) torch.compile for Speed

Good: Compile the model for faster execution (PyTorch 2.0+)

model

MyModel ( ) . to ( device ) model = torch . compile ( model , mode = "reduce-overhead" )

Modes: "default" (safe), "reduce-overhead" (faster), "max-autotune" (fastest)

Quick Reference: PyTorch Idioms Idiom Description model.train() / model.eval() Always set mode before train/eval torch.no_grad() Disable gradients for inference optimizer.zero_grad(set_to_none=True) More efficient gradient clearing .to(device) Device-agnostic tensor/model placement torch.amp.autocast Mixed precision for 2x speed pin_memory=True Faster CPU→GPU data transfer torch.compile JIT compilation for speed (2.0+) weights_only=True Secure model loading torch.manual_seed Reproducible experiments gradient_checkpointing Trade compute for memory Anti-Patterns to Avoid

Bad: Forgetting model.eval() during validation

model . train ( ) with torch . no_grad ( ) : output = model ( val_data )

Dropout still active! BatchNorm uses batch stats!

Good: Always set eval mode

model . eval ( ) with torch . no_grad ( ) : output = model ( val_data )

Bad: In-place operations breaking autograd

x

F . relu ( x , inplace = True )

Can break gradient computation

x += residual

In-place add breaks autograd graph

Good: Out-of-place operations

x

F . relu ( x ) x = x + residual

Bad: Moving data to GPU inside the training loop repeatedly

for data , target in dataloader : model = model . cuda ( )

Moves model EVERY iteration!

Good: Move model once before the loop

model

model . to ( device ) for data , target in dataloader : data , target = data . to ( device ) , target . to ( device )

Bad: Using .item() before backward

loss

criterion ( output , target ) . item ( )

Detaches from graph!

loss . backward ( )

Error: can't backprop through .item()

Good: Call .item() only for logging

loss

criterion ( output , target ) loss . backward ( ) print ( f"Loss: { loss . item ( ) : .4f } " )

.item() after backward is fine

Bad: Not using torch.save properly

torch . save ( model , "model.pt" )

Saves entire model (fragile, not portable)

Good: Save state_dict

torch
.
save
(
model
.
state_dict
(
)
,
"model.pt"
)
Remember: PyTorch code should be device-agnostic, reproducible, and memory-conscious. When in doubt, profile with torch.profiler and check GPU memory with torch.cuda.memory_summary() .

安装

Good: Device-agnostic

device

Bad: Hardcoded device

model

Crashes if no GPU

data

Good: Full reproducibility setup

Bad: No seed control

model

Different weights every run

Good: Shape-annotated forward pass

x: (batch_size, channels, height, width)

x

-> (batch_size, 32, H, W)

x

-> (batch_size, 32, H//2, W//2)

x

-> (batch_size, 32H//2W//2)

-> (batch_size, num_classes)

Bad: No shape tracking

What size is this?

Will this even work?

Good: Well-organized module

Bad: Everything in forward

Creates weight each call!

Good: Explicit initialization

Good: Complete training loop with best practices

Always set train mode

total_loss

More efficient than zero_grad()

Mixed precision training

Good: Proper evaluation

More efficient than wrapping in torch.no_grad() block

Always set eval mode — disables dropout, uses running BN stats

total_loss

Good: Clean Dataset with type hints

Good: Optimized DataLoader

dataloader

Shuffle for training

num_workers

Parallel data loading

pin_memory

Faster CPU->GPU transfer

persistent_workers

Keep workers alive between epochs

drop_last

Consistent batch sizes for BatchNorm

Bad: Slow defaults

dataloader

num_workers=0, no pin_memory

Good: Pad sequences in collate_fn

Pad to max length in batch

padded

Good: Complete checkpoint with all training state

Bad: Only saving model weights (can't resume training)

Good: AMP with GradScaler

scaler

Good: Trade compute for memory

Recompute activations during backward to save memory

x

Good: Compile the model for faster execution (PyTorch 2.0+)

model

Modes: "default" (safe), "reduce-overhead" (faster), "max-autotune" (fastest)

Bad: Forgetting model.eval() during validation

Dropout still active! BatchNorm uses batch stats!

Good: Always set eval mode

Bad: In-place operations breaking autograd

x

Can break gradient computation

In-place add breaks autograd graph

Good: Out-of-place operations

x

Bad: Moving data to GPU inside the training loop repeatedly

Moves model EVERY iteration!

Good: Move model once before the loop

model

Bad: Using .item() before backward

loss

Detaches from graph!