- Prompt Repetition
- Problem Being Solved
- LLMs are trained as
- Causal Language Models
- , where each token attends only to
- previous tokens
- . This leads to:
- Context-Question Problem
-
- The question is unknown when processing context
- Options-First MCQ Problem
-
- Cannot fully understand the question context when viewing answer choices
- Position/Index Problem
-
- Attention weights weaken for specific position information in long lists
- Prompt repetition
- enables the second pass to reference the entire first pass, effectively
- mimicking some benefits of bidirectional attention
- .
- When to use this skill
- When using lightweight models
-
- claude-haiku, gemini-flash, gpt-4o-mini, etc.
- Options-First MCQ
-
- Multiple choice where answer choices appear before the question
- Context + Question
-
- Searching for specific information in long contexts
- Index/Position Tasks
-
- Position-based queries in inventories or lists
- NPC Dialogue
-
- Maintaining consistency for game AI characters
- Non-Reasoning Tasks
-
- Tasks that do not use Chain-of-Thought
- How It Works
- Limitations of Causal Attention
- [Context] → [Question]
- ↓
- Cannot reference Question content when processing Context tokens
- Attention weights for Context are already finalized by the time Question tokens appear
- How Prompt Repetition Solves This
- [First Pass] [Second Pass]
- Context → Question → Context' → Question'
- ↑ ↑
- Can reference entire first pass
- In the second repetition, the model
- reprocesses information across the entire first prompt
- and
- strengthens attention weights on key concepts
- , resulting in improved performance.
- Note
- This does not change the model architecture to bidirectional; it is a prompt engineering technique to mitigate the limitations of causal models. Research Results (Google Research 2025) Metric Result Significant improvement (p < 0.1) 47 / 70 benchmarks Performance degradation 0 Neutral 23 Improvement rate 67% Most dramatic improvement: Gemini 2.0 Flash-Lite on NameIndex: 21.33% → 97.33% (+76%p) Tested Models Gemini 2.0 Flash / Flash Lite GPT-4o / GPT-4o-mini Claude 3.7 Sonnet / Claude 3 Haiku Deepseek V3 Tested Benchmarks ARC (Challenge) - Scientific reasoning OpenBookQA - Open-domain QA GSM8K - Math problems MMLU-Pro - Multitask language understanding MATH - Mathematical problem solving NameIndex / MiddleMatch - Custom position tasks Application Procedure Step 1: Verify Auto-Apply Target Models Provider Auto-apply models Excluded models Claude haiku series opus, sonnet Gemini flash, flash-lite pro, ultra OpenAI gpt-4o-mini, gpt-low gpt-4o, gpt-4 Step 2: Determine Repetition Count by Task Type Task Type Keyword Pattern Repetitions Expected Improvement Options-First MCQ A. B. C. D. choices first 2× +15-40%p Index/Position slot , position , index , N-th 3× +50-76%p Context + Question General question 2× +5-15%p With CoT step by step , think through 0× (not applied) ~0% Step 3: Check Token Limits
Check context before auto-apply
max_context
model_context_window * 0.8
80% safety margin
- if
- len
- (
- prompt_tokens
- )
- *
- repetitions
- >
- max_context
- :
- repetitions
- =
- max
- (
- 1
- ,
- int
- (
- max_context
- /
- len
- (
- prompt_tokens
- )
- )
- )
- Step 4: Prompt Transformation
- def
- apply_prompt_repetition
- (
- prompt
- :
- str
- ,
- times
- :
- int
- =
- 2
- )
- -
- >
- str
- :
- """Repeat the prompt a specified number of times
- Args:
- prompt: Original prompt
- times: Number of repetitions (default 2)
- Returns:
- Repeated prompt
- """
- if
- times
- <=
- 1
- :
- return
- prompt
- return
- "\n\n"
- .
- join
- (
- [
- prompt
- ]
- *
- times
- )
- Practical Examples
- Example 1: Options-First MCQ (Greatest Effect)
- Before:
- A. Paris
- B. London
- C. Berlin
- D. Madrid
- Which city is the capital of France?
- Reply with one letter.
- After (repetition ×2 applied):
- A. Paris
- B. London
- C. Berlin
- D. Madrid
- Which city is the capital of France?
- Reply with one letter.
- A. Paris
- B. London
- C. Berlin
- D. Madrid
- Which city is the capital of France?
- Reply with one letter.
- Expected output:
- A
- Accuracy: original 78% → after repetition 93% (+15%p)
- Example 2: Index/Position Tasks (Maximum Effect)
- Before:
- Inventory:
- 1. Iron Sword
- 2. Leather Armor
- 3. Health Potion (x5)
- 4. Magic Staff
- ...
- 25. Dragon Scale
- ...
- 50. Ancient Map
- What item is in slot 25?
- After (repetition ×3 applied):
- Prompt repeated 3 times
- Expected output:
- Dragon Scale
- Accuracy: original 21% → after repetition 97% (+76%p)
- Example 3: Tool Call Prompt Handling
- Note
- Prompts containing tool call instructions are also repeated in their entirety . The full-repetition approach was adopted for implementation simplicity and consistency. Before: Use the calculator tool to compute 234 * 567. What is the result? After (repetition ×2): Use the calculator tool to compute 234 * 567. What is the result? Use the calculator tool to compute 234 * 567. What is the result? Research results show that full repetition including tool call sections is also effective. Production-Ready Implementation Auto-Apply Transformer """prompt_repetition_transformer.py""" from dataclasses import dataclass , field from typing import Optional , Callable , List import re
Context window per model (in tokens)
MODEL_CONTEXT_WINDOWS
{ "claude-3-haiku" : 200_000 , "claude-haiku" : 200_000 , "gemini-flash" : 1_000_000 , "gemini-flash-lite" : 1_000_000 , "gemini-2.0-flash" : 1_000_000 , "gpt-4o-mini" : 128_000 , "gpt-low" : 128_000 , }
Models targeted for auto-apply
AUTO_APPLY_MODELS
list ( MODEL_CONTEXT_WINDOWS . keys ( ) )
CoT patterns (excluded from apply)
COT_PATTERNS
[ r"step by step" , r"think through" , r"let's think" , r"reasoning:" , r"chain of thought" , ]
Position/Index patterns (3× repetition)
POSITION_PATTERNS
[ r"slot \d+" , r"position \d+" , r"index \d+" , r"\d+(st|nd|rd|th)" , r"item \d+" , r"row \d+" , r"column \d+" , ] @dataclass class PromptRepetitionConfig : """Prompt repetition configuration""" default_repetitions : int = 2 position_repetitions : int = 3 separator : str = "\n\n" max_context_ratio : float = 0.8 applied_marker : str = "" class PromptRepetitionTransformer : """Auto-apply prompt repetition transformer for lightweight models""" def init ( self , config : Optional [ PromptRepetitionConfig ] = None ) : self . config = config or PromptRepetitionConfig ( ) def should_apply ( self , model : str , prompt : str ) -
bool : """Determine whether to auto-apply"""
Skip if already applied
if self . config . applied_marker in prompt : return False
Check target model
model_lower
model . lower ( ) if not any ( m in model_lower for m in AUTO_APPLY_MODELS ) : return False
Skip when CoT pattern detected
prompt_lower
prompt . lower ( ) for pattern in COT_PATTERNS : if re . search ( pattern , prompt_lower ) : return False return True def determine_repetitions ( self , prompt : str , model : str ) -
int : """Determine repetition count based on task type""" prompt_lower = prompt . lower ( )
Position/Index pattern detected → 3×
for pattern in POSITION_PATTERNS : if re . search ( pattern , prompt_lower ) : return self . config . position_repetitions return self . config . default_repetitions def estimate_tokens ( self , text : str ) -
int : """Simple token count estimation (speed over precision)"""
Estimate approximately 4 characters = 1 token
return len ( text ) // 4 def transform ( self , prompt : str , model : str ) -
str : """Apply repetition to prompt""" if not self . should_apply ( model , prompt ) : return prompt repetitions = self . determine_repetitions ( prompt , model )
Check context limit
model_lower
model . lower ( ) max_tokens = 128_000
Default value
for m , tokens in MODEL_CONTEXT_WINDOWS . items ( ) : if m in model_lower : max_tokens = tokens break max_allowed = int ( max_tokens * self . config . max_context_ratio ) prompt_tokens = self . estimate_tokens ( prompt )
Reduce repetitions if token limit exceeded
while prompt_tokens * repetitions
max_allowed and repetitions
1 : repetitions -= 1 if repetitions <= 1 : return prompt
Apply repetition + add marker
repeated
self . config . separator . join ( [ prompt ] * repetitions ) return f" { self . config . applied_marker } \n { repeated } " def wrap_llm_call ( self , llm_fn : Callable , model : str ) -
Callable : """Wrap LLM call function""" def wrapped ( prompt : str , ** kwargs ) : transformed = self . transform ( prompt , model ) return llm_fn ( transformed , ** kwargs ) return wrapped How to Measure Effectiveness (Verification) A/B Testing Method def run_ab_test ( prompts : List [ str ] , llm_fn , model : str , ground_truth : List [ str ] ) : """A/B test for prompt repetition effectiveness""" transformer = PromptRepetitionTransformer ( ) results = { "baseline" : [ ] , "repeated" : [ ] } for prompt , expected in zip ( prompts , ground_truth ) :
Baseline
response_a
llm_fn ( prompt ) results [ "baseline" ] . append ( response_a == expected )
With Repetition
repeated_prompt
- transformer
- .
- transform
- (
- prompt
- ,
- model
- )
- response_b
- =
- llm_fn
- (
- repeated_prompt
- )
- results
- [
- "repeated"
- ]
- .
- append
- (
- response_b
- ==
- expected
- )
- baseline_acc
- =
- sum
- (
- results
- [
- "baseline"
- ]
- )
- /
- len
- (
- prompts
- )
- repeated_acc
- =
- sum
- (
- results
- [
- "repeated"
- ]
- )
- /
- len
- (
- prompts
- )
- (
- f"Baseline accuracy:
- {
- baseline_acc
- :
- .2%
- }
- "
- )
- (
- f"Repeated accuracy:
- {
- repeated_acc
- :
- .2%
- }
- "
- )
- (
- f"Improvement:
- {
- repeated_acc
- -
- baseline_acc
- :
- +.2%
- }
- p"
- )
- Key Metrics
- Metric
- Measurement Method
- Accuracy
- Compare correct answer rates
- Consistency
- Variance across 10 runs of same prompt
- Token cost
- Input token increase rate
- Latency
- Compare p50, p99 latency
- When NOT to Use
- Case
- Reason
- Using CoT
- Reasoning process already provides context
- Reasoning models
- (opus, sonnet)
- Already optimized; minimal effect
- Very long prompts
- Risk of exceeding context limit
- Already repeated
- Duplicate application wastes tokens
- Cost-Accuracy Analysis
- Metric
- Baseline
- With Repetition
- Change
- Input tokens
- 500/req
- 1000/req
- +100%
- Output tokens
- 100/req
- 100/req
- 0%
- Latency (p50)
- 450ms
- 460ms
- +2%
- Latency (p99)
- 1200ms
- 1250ms
- +4%
- Accuracy
- 78%
- 89%
- +14%p
- Cost per correct answer
- $0.019
- $0.020
- +5%
- Key insight:
- The prefill phase is highly parallelized on GPU, so doubling input tokens has minimal impact on latency.
- Multi-Agent Integration
- Auto-Apply Strategy Per Agent
- Agent
- Model
- Repetition Applied
- Applied At
- Claude Orchestrator
- opus/sonnet
- Optional
- -
- Claude Executor
- haiku
- Auto
- skill_loader.py
- Gemini Analyst
- flash
- Auto
- On MCP call
- OpenAI
- gpt-4o-mini
- Auto
- skill_loader.py
- Preventing Duplicate Application
- To prevent duplicate application in multi-agent pipelines:
- Use markers
- Detect already-applied prompts with
- marker
- Pass metadata
-
- Pass
- x-prompt-repetition-applied: true
- header between agents
- Orchestrator management
- Claude Orchestrator tracks whether repetition is applied when calling sub-agents Application Pattern [Claude Sonnet] Planning (no repetition needed) ↓ [Gemini Flash] Analysis (repetition ×2 auto-applied, marker added) ↓ [Claude Haiku] Execution (marker detected → skip duplicate apply) skill_loader.py Integration Guide Recommended Implementation
Code to add to skill_loader.py
from prompt_repetition_transformer import PromptRepetitionTransformer class SkillLoader : def init ( self , . . . ) :
... existing code ...
self . prompt_transformer = PromptRepetitionTransformer ( ) def apply_auto_skills ( self , prompt : str , model : str ) -
str : """Handle auto-apply skills"""
Auto-apply prompt-repetition
- for
- skill
- in
- self
- .
- skills
- .
- values
- (
- )
- :
- auto_apply
- =
- skill
- .
- get
- (
- 'data'
- ,
- {
- }
- )
- .
- get
- (
- 'auto-apply'
- ,
- {
- }
- )
- if
- auto_apply
- .
- get
- (
- 'trigger'
- )
- ==
- 'auto'
- :
- target_models
- =
- auto_apply
- .
- get
- (
- 'models'
- ,
- [
- ]
- )
- if
- any
- (
- m
- in
- model
- .
- lower
- (
- )
- for
- m
- in
- target_models
- )
- :
- prompt
- =
- self
- .
- prompt_transformer
- .
- transform
- (
- prompt
- ,
- model
- )
- return
- prompt
- Constraints
- Required Rules
- Lightweight models first
-
- Most effective for haiku, flash, mini series
- Limit repetitions
-
- 2× for general tasks, max 3× for position tasks
- Context monitoring
-
- Be cautious of context overflow due to repetition
- Check markers
-
- Mandatory marker check to prevent duplicate application
- Prohibited Rules
- No padding substitution
-
- Increasing length with
- .
- etc. has no effect (per research)
- Do not combine with CoT
-
- Effects cancel out
- Do not force-apply to reasoning models
-
- Already optimized
- No duplicate application
- Consecutive application without markers wastes tokens Quick Reference === Auto-Apply Target Models === claude-3-haiku, claude-haiku gemini-flash, gemini-flash-lite, gemini-2.0-flash gpt-4o-mini, gpt-low === Repetition Count === General tasks: 2× Position/Index (slot/position/index keywords): 3× With CoT: 0× (not applied) === Effect (Google Research 2025) === Improvement rate: 67% (47/70 benchmarks) Performance degradation: 0 cases Maximum improvement: +76%p (NameIndex) === Cost === Input tokens: +100% Latency: +2% (Prefill parallelization) Cost per correct answer: +5% === Duplicate Application Prevention === Marker: References Prompt Repetition Improves Non-Reasoning LLMs (Leviathan et al., 2025) Chain-of-Thought Prompting Elicits Reasoning (Wei et al., 2023) Re-Reading Improves Reasoning in LLMs (Xu et al., 2024)