OpenClaw-RL Training Skill by ara.so — Daily 2026 Skills collection. OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via OpenClaw , intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents. Architecture Overview Four independent async loops that never block each other: Agent Serving — OpenClaw-compatible API serving rollouts Rollout Collection — Captures multi-turn conversations as training trajectories PRM/Judge Evaluation — Scores turns using next-state feedback (majority voting optional) Policy Training — GRPO/OPD/Combine training via slime or Tinker Installation git clone https://github.com/Gen-Verse/OpenClaw-RL cd OpenClaw-RL

Install core dependencies

pip install -r requirements.txt

Install slime (training backend)

cd slime && pip install -e . && cd ..

Optional: install SGLang for fast inference

pip install sglang Project Structure OpenClaw-RL/ ├── openclaw-rl/ # Binary RL (GRPO) method ├── openclaw-opd/ # On-Policy Distillation method ├── openclaw-combine/ # Combined Binary RL + OPD ├── openclaw-test/ # Evaluation utilities ├── terminal-rl/ # Track 2: Terminal agent RL ├── gui-rl/ # Track 2: GUI agent RL ├── swe-rl/ # Track 2: SWE agent RL ├── toolcall-rl/ # Track 2: Tool-call agent RL ├── slime/ # Core training framework └── openclaw/ # Runtime / API server Three Learning Paradigms 1. Binary RL (GRPO) A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss. 2. On-Policy Distillation (OPD) When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal. 3. Combination Method (Recommended) Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization. Quick Start — Personal Agent (Track 1) Binary RL Launch Script

openclaw-rl/run_qwen3_7b_openclaw_rl.sh

export MODEL_PATH = /path/to/qwen3-7b export DATA_PATH = /path/to/conversation/data export CKPT_SAVE_DIR = /path/to/checkpoints bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh OPD Launch Script export MODEL_PATH = /path/to/qwen3-7b export JUDGE_MODEL_PATH = /path/to/judge-model export DATA_PATH = /path/to/conversation/data bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh Combination Method (One Line)

Launch with combined Binary RL + OPD

bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh Configuration — Key Environment Variables

Model configuration

export MODEL_PATH = /path/to/base/model export JUDGE_MODEL_PATH = /path/to/judge/model

For OPD

export PRM_MODEL_PATH = /path/to/prm/model

For Binary RL

Training configuration

export CKPT_SAVE_DIR = ./checkpoints export CKPT_ARGS = "--save-interval 100 --save-dir $CKPT_SAVE_DIR "

Rollout configuration

export ROLLOUT_ARGS = "--rollout-batch-size 64 --num-rollouts-per-prompt 4"

Optimizer configuration

export OPTIMIZER_ARGS = "--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"

GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)

export TRAIN_GPUS = "0,1,2,3" export ROLLOUT_GPUS = "4,5,6,7"

LoRA (optional, reduces GPU memory)

export LORA_ARGS = "--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05" LoRA Training

Add LoRA args to any launch script

export LORA_ARGS = "--use-lora --lora-rank 64 --lora-alpha 128"

Example: LoRA Binary RL

bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh Custom Loss / Rollout Functions (Plugin API) The slime framework exposes extension points without modifying core code:

Custom loss function

--custom-loss-function-path ./my_method/custom_loss.py

Custom rollout function

--rollout-function-path ./my_method/custom_rollout.py

Custom generation function

--custom-generate-function-path ./my_method/custom_generate.py

Custom reward model

--custom-rm-path ./my_method/custom_rm.py Example Custom Loss (TypeScript-style config, Python implementation)

my_method/custom_loss.py

import torch from typing import Dict , Any def compute_loss ( policy_logits : torch . Tensor , reference_logits : torch . Tensor , rewards : torch . Tensor , advantages : torch . Tensor , config : Dict [ str , Any ] ) -

torch . Tensor : """ Custom GRPO-style loss with clipped surrogate objective. """

Log-ratio between policy and reference

log_ratio

policy_logits

reference_logits ratio = torch . exp ( log_ratio ) clip_range = config . get ( "clip_range" , 0.2 )

PPO-style clipped objective

clipped

torch . clamp ( ratio , 1 - clip_range , 1 + clip_range ) loss = - torch . min ( ratio * advantages , clipped * advantages ) . mean ( )

KL penalty

kl_coeff

config . get ( "kl_coeff" , 0.01 ) kl_penalty = kl_coeff * log_ratio . mean ( ) return loss + kl_penalty Example Custom Reward Model

my_method/custom_rm.py

from transformers import AutoModelForSequenceClassification , AutoTokenizer import torch class CustomPRM : def init ( self , model_path : str ) : self . tokenizer = AutoTokenizer . from_pretrained ( model_path ) self . model = AutoModelForSequenceClassification . from_pretrained ( model_path , torch_dtype = torch . bfloat16 ) self . model . eval ( ) def score ( self , prompt : str , response : str , next_state : str ) -

float : """ Score a turn given prompt, response, and next-state feedback. """ combined = f"Prompt: { prompt } \nResponse: { response } \nOutcome: { next_state } " inputs = self . tokenizer ( combined , return_tensors = "pt" , truncation = True , max_length = 2048 ) with torch . no_grad ( ) : logits = self . model ( ** inputs ) . logits

Binary reward: positive class probability

return torch . softmax ( logits , dim = - 1 ) [ 0 , 1 ] . item ( ) def get_reward_model ( config ) : return CustomPRM ( config [ "prm_model_path" ] ) Deploying on Tinker (Cloud)

One-line cloud deployment — Hybrid RL, OPD, Binary RL all supported

export TINKER_API_KEY = $TINKER_API_KEY export TINKER_ENDPOINT = $TINKER_ENDPOINT

Submit job via Ray

ray job submit --address $TINKER_ENDPOINT \ --working-dir . \ -- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh Track 2 — General Agentic RL Terminal Agent RL export ENV_TYPE = terminal export MAX_STEPS = 20 export PARALLEL_ENVS = 32

Number of parallel environment instances

bash terminal-rl/run_terminal_rl.sh GUI Agent RL export ENV_TYPE = gui export SCREENSHOT_BACKEND = playwright

or selenium

export PARALLEL_ENVS = 16 bash gui-rl/run_gui_rl.sh Tool-Call Agent RL export ENV_TYPE = toolcall export TOOLS_CONFIG = ./toolcall-rl/tools_config.json export PARALLEL_ENVS = 64 bash toolcall-rl/run_toolcall_rl.sh SWE Agent RL export ENV_TYPE = swe export SWE_BENCH_PATH = /path/to/swe-bench export PARALLEL_ENVS = 8

SWE environments are heavier

bash

swe-rl/run_swe_rl.sh

Data Format — Conversation Trajectories

OpenClaw-RL automatically classifies API messages. Manual format for custom data:

{

"session_id"

:

"user_session_abc123"

,

"turns"

:

[

{

"type"

:

"main"

,

"prompt"

:

"Help me refactor this function to use async/await"

,

"response"

:

"Here's the refactored version: ..."

,

"next_state"

:

"User accepted the change and said 'perfect, thanks!'"

,

"trainable"

:

true

}

,

{

"type"

:

"side"

,

"prompt"

:

"What is 2+2?"

,

"response"

:

"4"

,

"trainable"

:

false

}

]

}

main

turns

Multi-turn interactions that form training trajectories
side
turns: Non-trainable system/utility turns excluded from training OpenClaw API Server Setup

Start OpenClaw-compatible API server wrapping your model

export BASE_MODEL_PATH = /path/to/your/model export OPENCLAW_PORT = 8000 export OPENCLAW_HOST = 0.0 .0.0

Using SGLang backend (recommended for speed)

python -m openclaw.server \ --model-path $BASE_MODEL_PATH \ --port $OPENCLAW_PORT \ --backend sglang \ --enable-rl-intercept

Enable conversation capture for RL

--rl-buffer-dir ./rl_buffer

Where to store captured trajectories

// Using the server as OpenAI-compatible API in TypeScript import OpenAI from "openai" ; const client = new OpenAI ( { baseURL : "http://localhost:8000/v1" , apiKey : process . env . OPENCLAW_API_KEY ?? "local" , } ) ; const response = await client . chat . completions . create ( { model : "your-model-name" , messages : [ { role : "user" , content : "Help me write a sorting algorithm" } ] , stream : true , } ) ; for await ( const chunk of response ) { process . stdout . write ( chunk . choices [ 0 ] ?. delta ?. content ?? "" ) ; } Majority Voting for Robust PRM Scoring

Enable majority voting for more robust reward estimation

export MAJORITY_VOTE_N = 5

Number of judge calls per turn

export MAJORITY_VOTE_THRESHOLD = 0.6

Add to your launch script args:

--majority-vote-n $MAJORITY_VOTE_N \ --majority-vote-threshold $MAJORITY_VOTE_THRESHOLD Adding a New Method (Contribution Pattern)

1. Create a new top-level folder

mkdir my-new-method cd my-new-method

2. Required files

touch README.md

Document what, how, env vars

touch run_qwen3_7b_my_method.sh

Launch script

touch custom_loss.py

If custom loss needed

touch custom_rollout.py

If custom rollout needed

run_qwen3_7b_my_method.sh — follow existing conventions

!/bin/bash

set -e MODEL_SIZE = "7b" MODEL_PATH = ${MODEL_PATH :- / path / to / qwen3-7b} CKPT_SAVE_DIR = ${CKPT_SAVE_DIR :- . / checkpoints / my-method} CKPT_ARGS = "--save-interval 50 --save-dir $CKPT_SAVE_DIR " ROLLOUT_ARGS = "--rollout-batch-size 32 --num-rollouts-per-prompt 4" OPTIMIZER_ARGS = "--lr 1e-6 --weight-decay 0.01" ray job submit --working-dir .. -- \ python slime/train.py \ --model-path $MODEL_PATH \ --custom-loss-function-path my-new-method/custom_loss.py \ $CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS Common Patterns Monitor Training Progress

View Ray dashboard

ray dashboard

Opens at http://localhost:8265

Watch checkpoint saves

watch -n 10 ls -la $CKPT_SAVE_DIR

Stream training logs

tail -f ./logs/training.log Resume from Checkpoint export RESUME_CKPT = $CKPT_SAVE_DIR /checkpoint-500

Add to launch script:

--resume-from-checkpoint $RESUME_CKPT Evaluate Trained Checkpoints bash openclaw-test/run_eval.sh \ --model-path $CKPT_SAVE_DIR /checkpoint-latest \ --eval-tasks "conversation,coding,tool-use" Troubleshooting Out of GPU memory during rollout + training:

Use LoRA to reduce memory footprint

export LORA_ARGS = "--use-lora --lora-rank 32"

Or reduce parallel environments

export PARALLEL_ENVS = 8

Or use offloading

--offload-optimizer-state Async loop falling behind (buffer overflow):

Reduce rollout batch size or increase judge throughput

export ROLLOUT_ARGS = "--rollout-batch-size 16"

Or add more judge workers

--num-judge-workers 4 PRM scores all near 0.5 (reward collapse): Verify next_state fields contain meaningful feedback signals Check judge model prompt template matches expected format Try increasing majority vote N: --majority-vote-n 7 SGLang server not starting:

Check SGLang version compatibility

pip install sglang == 0.4 .x

Check slime/requirements.txt for pinned version

Fallback to vLLM backend

--backend vllm Ray job submission fails:

Start Ray cluster first

ray start --head --num-gpus = $( nvidia-smi -L | wc -l )

Then submit job

ray job submit --address auto -- bash run.sh Key References Technical Report (arXiv) OpenClaw Plugin Slime Training Framework Tinker Cloud Platform SDFT Paper — integrated in openclaw-opd SDPO Paper — integrated in openclaw-opd

安装

Install core dependencies

Install slime (training backend)

Optional: install SGLang for fast inference

openclaw-rl/run_qwen3_7b_openclaw_rl.sh

Launch with combined Binary RL + OPD

Model configuration

For OPD

For Binary RL

Training configuration

Rollout configuration

Optimizer configuration

GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)

LoRA (optional, reduces GPU memory)

Add LoRA args to any launch script

Example: LoRA Binary RL

Custom loss function

Custom rollout function

Custom generation function

Custom reward model

my_method/custom_loss.py

Log-ratio between policy and reference

log_ratio

policy_logits

PPO-style clipped objective

clipped

KL penalty

kl_coeff

my_method/custom_rm.py

Binary reward: positive class probability

One-line cloud deployment — Hybrid RL, OPD, Binary RL all supported

Submit job via Ray

Number of parallel environment instances

or selenium

SWE environments are heavier

Start OpenClaw-compatible API server wrapping your model

Using SGLang backend (recommended for speed)

Enable conversation capture for RL

Where to store captured trajectories

Enable majority voting for more robust reward estimation

Number of judge calls per turn

Add to your launch script args:

1. Create a new top-level folder

2. Required files

Document what, how, env vars

Launch script

If custom loss needed

If custom rollout needed

run_qwen3_7b_my_method.sh — follow existing conventions

!/bin/bash

View Ray dashboard

Opens at http://localhost:8265

Watch checkpoint saves

Stream training logs

Add to launch script:

Use LoRA to reduce memory footprint

Or reduce parallel environments

Or use offloading

Reduce rollout batch size or increase judge throughput

Or add more judge workers

Check SGLang version compatibility

Check slime/requirements.txt for pinned version

Fallback to vLLM backend

Start Ray cluster first

Then submit job