torchforge: PyTorch-Native Agentic RL Library

torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.

When to Use torchforge

Choose torchforge when you need:

Clean separation between RL algorithms and infrastructure

PyTorch-native abstractions (no Ray dependency)

Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)

Scalable training with Monarch actor system

Integration with TorchTitan for model parallelism

Consider alternatives when:

You need production-ready stability → use

miles

or

verl

You want Megatron-native training → use

slime

torchforge is experimental and APIs may change

Key Features

Algorithm isolation

Implement RL algorithms without touching infrastructure

Scalability

From single GPU to thousands via Monarch

Modern stack

TorchTitan (training), vLLM (inference), TorchStore (sync)
Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in Architecture Overview ┌─────────────────────────────────────────────────────────┐ │ Application Layer (Your Code) │ │ - Define reward models, loss functions, sampling │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Forge API Layer │ │ - Episode, Group dataclasses │ │ - Service interfaces (async/await) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Distributed Services (Monarch) │ │ ├── Trainer (TorchTitan FSDP) │ │ ├── Generator (vLLM inference) │ │ ├── Reference Model (frozen KL baseline) │ │ └── Reward Actors (compute rewards) │ └─────────────────────────────────────────────────────────┘ Installation

Create environment

conda create -n forge python = 3.12 conda activate forge

Install (handles PyTorch nightly + dependencies)

./scripts/install.sh

Verify

python -c "import torch, forge, vllm; print('OK')" ROCm Installation ./scripts/install_rocm.sh Quick Start SFT Training (2+ GPUs) python -m apps.sft.main --config apps/sft/llama3_8b.yaml GRPO Training (3+ GPUs) python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml Workflow 1: GRPO Training for Math Reasoning Use this workflow for training reasoning models with group-relative advantages. Prerequisites Checklist 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator) Model from HuggingFace Hub Training dataset (GSM8K, MATH, etc.) Step 1: Create Configuration

config/grpo_math.yaml

model : "Qwen/Qwen2.5-7B-Instruct" dataset : path : "openai/gsm8k" split : "train" streaming : true training : batch_size : 4 learning_rate : 1e-6 seq_len : 4096 dtype : bfloat16 gradient_accumulation_steps : 4 grpo : n_samples : 8

Responses per prompt

clip_low : 0.2 clip_high : 0.28 beta : 0.1

KL penalty coefficient

temperature : 0.7 services : generator : procs : 1 num_replicas : 1 with_gpus : true trainer : procs : 1 num_replicas : 1 with_gpus : true ref_model : procs : 1 num_replicas : 1 with_gpus : true Step 2: Define Reward Function

rewards.py

Reward functions are in forge.data.rewards

from forge . data . rewards import MathReward , ThinkingReward import re

Or define your own reward function

class CustomMathReward : def call ( self , prompt : str , response : str , target : str ) -

float :

Extract answer from response

match

re . search ( r'\boxed{([^}]+)}' , response ) if not match : return 0.0 answer = match . group ( 1 ) . strip ( ) return 1.0 if answer == target else 0.0 Step 3: Launch Training python -m apps.grpo.main --config config/grpo_math.yaml Step 4: Monitor Progress Check W&B dashboard for loss curves Verify entropy is decreasing (policy becoming more deterministic) Monitor KL divergence (should stay bounded) Workflow 2: Custom Loss Function Use this workflow to implement new RL algorithms. Step 1: Create Loss Class

src/forge/losses/custom_loss.py

import torch import torch . nn as nn class CustomLoss ( nn . Module ) : def init ( self , clip_range : float = 0.2 , beta : float = 0.1 ) : super ( ) . init ( ) self . clip_range = clip_range self . beta = beta def forward ( self , logprobs : torch . Tensor , ref_logprobs : torch . Tensor , advantages : torch . Tensor , padding_mask : torch . Tensor , ) -

torch . Tensor :

Compute importance ratio

ratio

torch . exp ( logprobs - ref_logprobs )

Clipped policy gradient

clipped_ratio

torch . clamp ( ratio , 1 - self . clip_range , 1 + self . clip_range ) pg_loss = - torch . min ( ratio * advantages , clipped_ratio * advantages )

KL penalty

kl

ref_logprobs

logprobs

Apply mask and aggregate

masked_loss

( pg_loss + self . beta * kl ) * padding_mask loss = masked_loss . sum ( ) / padding_mask . sum ( ) return loss Step 2: Integrate into Application

apps/custom/main.py

from forge . losses . custom_loss import CustomLoss loss_fn = CustomLoss ( clip_range = 0.2 , beta = 0.1 )

In training loop

loss

loss_fn ( logprobs = logprobs , ref_logprobs = ref_logprobs , advantages = advantages , padding_mask = padding_mask , ) Workflow 3: Multi-GPU Distributed Training Use this workflow for scaling to multiple GPUs or nodes. Configuration for Distributed

config/distributed.yaml

model : "meta-llama/Meta-Llama-3.1-8B-Instruct" parallelism : tensor_parallel_degree : 2

Split model across GPUs

pipeline_parallel_degree : 1 data_parallel_shard_degree : 2 services : generator : procs : 2

2 processes for TP=2

num_replicas : 1 with_gpus : true trainer : procs : 2 num_replicas : 1 with_gpus : true Launch with SLURM

Submit job

sbatch --nodes = 2 --gpus-per-node = 8 run_grpo.sh Launch Locally (Multi-GPU)

8 GPU setup

python -m apps.grpo.main \ --config config/distributed.yaml \ --trainer.procs 4 \ --generator.procs 4 Core API Reference Training Batch Format torchforge uses dictionary-based batches for training:

inputs: list of dicts with torch.Tensor values

inputs

[ { "tokens" : torch . Tensor } ]

targets: list of dicts with training signals

targets

[ { "response" : torch . Tensor , "ref_logprobs" : torch . Tensor , "advantages" : torch . Tensor , "padding_mask" : torch . Tensor } ]

train_step returns loss as float

loss

trainer . train_step ( inputs , targets ) Completion Generated output from vLLM: @dataclass class Completion : text : str

Generated text

token_ids : list [ int ]

Token IDs

logprobs : list [ float ]

Log probabilities

metadata : dict

Custom metadata

Built-in Loss Functions Loss Functions Loss functions are in the forge.losses module: from forge . losses import SimpleGRPOLoss , ReinforceLoss

SimpleGRPOLoss for GRPO training

loss_fn

SimpleGRPOLoss ( beta = 0.1 )

Forward pass

loss

loss_fn ( logprobs = logprobs , ref_logprobs = ref_logprobs , advantages = advantages , padding_mask = padding_mask ) ReinforceLoss from forge . losses . reinforce_loss import ReinforceLoss

With optional importance ratio clipping

loss_fn

ReinforceLoss
(
clip_ratio
=
0.2
)
Common Issues and Solutions
Issue: Not Enough GPUs
Symptoms: "Insufficient GPU resources" error Solutions :

Reduce service requirements

services : generator : procs : 1 with_gpus : true trainer : procs : 1 with_gpus : true

Remove ref_model (uses generator weights)

Or use CPU for reference model:
ref_model
:
with_gpus
:
false
Issue: OOM During Generation
Symptoms: CUDA OOM in vLLM Solutions :

Reduce batch size

grpo : n_samples : 4

Reduce from 8

Or reduce sequence length

training
:
seq_len
:
2048
Issue: Slow Weight Sync
Symptoms: Long pauses between training and generation Solutions :

Enable RDMA (if available)

export TORCHSTORE_USE_RDMA = 1

Or reduce sync frequency

training: sync_interval: 10

Sync every 10 steps

Issue: Policy Collapse
Symptoms: Entropy drops to zero, reward stops improving Solutions :

Increase KL penalty

grpo : beta : 0.2

Increase from 0.1

Or add entropy bonus

training : entropy_coef : 0.01 Resources Documentation : https://meta-pytorch.org/torchforge GitHub : https://github.com/meta-pytorch/torchforge Discord : https://discord.gg/YsTYBh6PD9 TorchTitan : https://github.com/pytorch/torchtitan Monarch : https://github.com/meta-pytorch/monarch

安装

Create environment

Install (handles PyTorch nightly + dependencies)

Verify

config/grpo_math.yaml

Responses per prompt

KL penalty coefficient

rewards.py

Reward functions are in forge.data.rewards

Or define your own reward function

Extract answer from response

match

src/forge/losses/custom_loss.py

Compute importance ratio

ratio

Clipped policy gradient

clipped_ratio

KL penalty

kl

ref_logprobs

Apply mask and aggregate

masked_loss

apps/custom/main.py

In training loop

loss

config/distributed.yaml

Split model across GPUs

2 processes for TP=2

Submit job

8 GPU setup

inputs: list of dicts with torch.Tensor values

inputs

targets: list of dicts with training signals

targets

train_step returns loss as float

loss

Generated text

Token IDs

Log probabilities

Custom metadata

SimpleGRPOLoss for GRPO training

loss_fn

Forward pass

loss

With optional importance ratio clipping

loss_fn

Reduce service requirements

Remove ref_model (uses generator weights)

Reduce batch size

Reduce from 8

Or reduce sequence length

Enable RDMA (if available)

Or reduce sync frequency

Sync every 10 steps

Increase KL penalty

Increase from 0.1

Or add entropy bonus