NOWAIT Reasoning Optimizer

Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).

Overview

NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.

When to Use Deploying R1-style reasoning models with limited compute Reducing inference latency for production systems Optimizing token costs for reasoning tasks Working with verbose CoT outputs that need streamlining Supported Models Model Series Type Token Reduction QwQ-32B RL-based 16-31% Phi4-Reasoning-Plus RL-based 23-28% Qwen3-32B RL-based 13-16% Kimi-VL-A3B Multimodal 40-60% QvQ-72B-Preview Multimodal 20-30%

Important: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.

Quick Start 1. Basic Implementation from scripts.nowait_processor import NOWAITLogitProcessor

Initialize processor for your model's tokenizer

processor = NOWAITLogitProcessor(tokenizer)

Use during generation

outputs = model.generate( inputs, logits_processor=[processor], max_new_tokens=32768 )

Keywords Suppressed

See references/keywords.md for the complete list. Core keywords:

wait, alternatively, hmm, but, however, check, double-check, maybe, verify, again, oh, ah

How It Works Initialize Keywords: Identify reflection keywords from empirical analysis Expand to Token Variants: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT") Suppress During Inference: Set logits of reflection tokens to large negative values during decoding Logits (Before) Logits (After) Wait 0.8 → Wait -inf First 0.6 → First 0.6 Hmm 0.5 → Hmm -inf Let 0.4 → Let 0.4

Key Findings Why It Works NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip unnecessary "waiting" reasoning Models still perform essential verification at key decision points Results in more linear, straightforward reasoning paths RL vs Distilled Models Model Type NOWAIT Effect Recommendation RL-based (QwQ, Phi4, Qwen3-32B) Stable accuracy, significant token reduction ✅ Recommended Distilled (Qwen3-4B/8B/14B) Accuracy degradation on hard tasks ⚠️ Use with caution

Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.

Integration Examples HuggingFace Transformers from transformers import AutoModelForCausalLM, AutoTokenizer from scripts.nowait_processor import NOWAITLogitProcessor

model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B") tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

processor = NOWAITLogitProcessor(tokenizer)

response = model.generate( tokenizer(prompt, return_tensors="pt").input_ids, logits_processor=[processor], max_new_tokens=32768, do_sample=True, temperature=0.7 )

vLLM from vllm import LLM, SamplingParams from scripts.nowait_processor import get_nowait_bad_words_ids

llm = LLM(model="Qwen/QwQ-32B") bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())

sampling_params = SamplingParams( max_tokens=32768, bad_words_ids=bad_words_ids )

Expected Results Task Type Original Tokens NOWAIT Tokens Reduction Math (AIME) 15,000 10,500 30% Visual QA (MMMU) 2,900 1,450 50% Video QA (MMVU) 1,700 1,250 27% Limitations Less effective on very simple problems where CoT overhead is already minimal Distilled models may suffer accuracy loss on challenging tasks Some domains may require model-specific keyword tuning References Paper: arXiv:2506.08343v2 Complete keyword list: references/keywords.md Implementation: scripts/nowait_processor.py

nowait-reasoning-optimizer

安装

Initialize processor for your model's tokenizer

Use during generation