Codex Autoresearch Skill by ara.so — Daily 2026 Skills collection. Codex Autoresearch is a Codex skill that runs an autonomous modify→verify→keep/revert loop on your codebase. You describe a measurable goal in one sentence; Codex confirms the plan, then iterates unattended — every improvement stacks in git, every failure reverts automatically — until interrupted or a cap is reached. Inspired by Karpathy's autoresearch concept, generalized beyond ML training to any software metric. Installation Option A — manual copy into your project: git clone https://github.com/leo-lilinxiao/codex-autoresearch.git cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch Option B — Codex skill installer: $skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch The skill lives at .agents/skills/codex-autoresearch/ inside your project. No config file is required before first use. How to Activate Open Codex in your project directory and prefix your goal with $codex-autoresearch : $codex-autoresearch I want to get rid of all any types in my TypeScript code Codex will: Scan the repo and infer scope, metric, verify command, and guard command. Present a confirmation summary — reply go (or correct anything). Run the loop unattended until you interrupt it or the goal is met. You never write config. Codex infers everything. Confirmation Flow Before the loop starts Codex always shows what it found and asks you to confirm. Example exchange: Codex: I found 47 any occurrences across src//*.ts. Confirmed: - Target: eliminate any types in src//.ts - Metric: any count (current: 47), direction: lower - Verify: grep + tsc --noEmit as guard Need to confirm: - Run until all gone, or cap at N iterations? Reply "go" to start, or tell me what to change. You: Go, run overnight. Codex: Starting — baseline: 47. Iterating until interrupted. Up to five confirmation rounds are possible. After that, Codex proceeds. The Loop (internals) PHASE 0: Probe environment (CPU/GPU/RAM/toolchains), check for session resume PHASE 1: Read context + lessons file from prior run (if any) LOOP (forever or N times): 1. Review current state, git history, results log, lessons 2. Pick ONE hypothesis (apply perspectives, filter by environment) -- or N hypotheses if parallel mode is active 3. Make ONE atomic change 4. git commit (before verification) 5. Run verify command → did the target metric improve? Run guard command → did anything else break? 6. Improved → keep (extract lesson) Worse → approved rollback strategy (git revert) Crashed → fix or skip 7. Log the result to results log 8. Health check (disk, git, verify health) 9. If 3+ discards → REFINE; 5+ → PIVOT; 2 PIVOTs → web search 10. Repeat. Never stop. Never ask. The loop runs unbounded unless you say Iterations: N during confirmation. Dual-Gate Verification Two commands serve distinct purposes: Gate Purpose Fails means Verify Did the target metric improve? Change discarded, reverted Guard Did anything else break? Change reworked (up to 2 attempts), then reverted Guard files are never modified by the loop. Example verify + guard pair for a Python coverage run: Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}' Guard: python -m mypy src --ignore-missing-imports Example for TypeScript type cleanup: Verify: grep -r "any" src --include=".ts" | wc -l Guard: npx tsc --noEmit Modes Codex maps your sentence to one of seven modes automatically — you never pick a mode explicitly. loop — iterate toward a measurable target (default) $codex-autoresearch Improve test coverage in src/ to at least 80% $codex-autoresearch Reduce bundle size — it's currently 2.3 MB, get it under 1 MB plan — turn a vague goal into a validated loop config $codex-autoresearch I want to make our API faster but I don't know where to start Codex will interview you (p95 latency vs throughput? which endpoint?) and produce a ready-to-run loop config. fix — repair errors until count reaches zero $codex-autoresearch pytest is failing, 12 tests broken after the refactor — fix them all debug — evidence-driven root-cause hunting $codex-autoresearch Our API returns 503 randomly under load, no idea why Each iteration tests one falsifiable hypothesis. Codex presents evidence, not guesses. security — read-only STRIDE + OWASP audit $codex-autoresearch Is this code secure? ship — readiness verification and release gating $codex-autoresearch Ship it exec — one-shot execution with no loop $codex-autoresearch Run the benchmark suite and summarize results Inline Configuration (optional) You can override defaults inline during the confirmation step — no file edits needed: Phrase Effect Iterations: 20 Cap the loop at 20 iterations Parallel: 3 Test 3 hypotheses concurrently per round Guard: npm test Override the inferred guard command Verify: Override the inferred verify command Scope: src/api/ Restrict changes to a subdirectory Example during confirmation: You: Go. Iterations: 30, Guard: npm test, Scope: src/api/ Cross-Run Learning At the end of each iteration Codex writes a structured lesson to .agents/skills/codex-autoresearch/lessons.md : Iteration 7 — KEPT Hypothesis: replace explicit any with inferred generic in src/utils/mapper.ts Change: added > to mapKeys() Result: any count 31 → 29 Lesson: Generic constraints on utility functions eliminate clusters of any downstream. On session resume Codex reads this file first. Each new run benefits from prior runs. To resume an interrupted run: $codex-autoresearch Resume Codex re-reads the lessons file, checks git state, re-establishes the baseline, and continues. Parallel Experiments Request parallel mode during confirmation or at any time: You: Go, parallel 4 Codex runs four hypotheses concurrently, keeps the best result, discards the rest. Useful when hypothesis space is large. Pivot Protocol If the loop stalls, escalation happens automatically: Consecutive discards Action 3 REFINE — narrow hypothesis, try smaller atomic changes 5 PIVOT — change strategy entirely 2 PIVOTs Web search — Codex fetches external references to unstick itself You are never asked for permission during escalation. The loop continues. Real Code Examples Example 1 — TypeScript any elimination (Python verify script) If you want a custom verify script instead of a one-liner:

scripts/count_any.py

import subprocess , sys result = subprocess . run ( [ "grep" , "-r" , "--include=*.ts" , r"\bany\b" , "src/" ] , capture_output = True , text = True ) count = len ( result . stdout . strip ( ) . splitlines ( ) ) print ( count ) sys . exit ( 0 )

always exit 0; the number is what matters

Tell Codex during confirmation: Verify: python scripts/count_any.py Guard: npx tsc --noEmit Example 2 — pytest coverage loop (Python)

scripts/coverage_pct.py

import subprocess , re , sys out = subprocess . check_output ( [ "pytest" , "--cov=src" , "--cov-report=term" , "-q" ] , stderr = subprocess . STDOUT , text = True ) match = re . search ( r"TOTAL\s+\d+\s+\d+\s+(\d+)%" , out ) if match : print ( int ( match . group ( 1 ) ) ) sys . exit ( 0 ) print ( 0 ) sys . exit ( 0 ) $codex-autoresearch Improve test coverage — target 85% Verify: python scripts/coverage_pct.py Guard: python -m mypy src Direction: higher Target: 85 Iterations: 50 Example 3 — bundle size loop (Node.js project)

scripts/bundle_size.sh

!/usr/bin/env bash

npm run build --silent 2

/dev/null du -k dist/bundle.js | awk '{print $1}' $codex-autoresearch Reduce our JS bundle size, currently ~2300 KB, target under 900 KB Verify: bash scripts/bundle_size.sh Guard: npm test Direction: lower Target: 900 Example 4 — lint warning count (any language)

scripts/lint_count.sh

!/usr/bin/env bash

npx eslint src/ --format json 2

/dev/null \ | python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))" $codex-autoresearch Get our ESLint warning count to zero Verify: bash scripts/lint_count.sh Direction: lower Target: 0 Unattended Runs For overnight or long runs, ensure Codex CLI approval settings do not interrupt git commit or git revert commands. The simplest option is to run in a disposable or sandboxed repo clone: git clone . /tmp/autoresearch-sandbox cd /tmp/autoresearch-sandbox

launch Codex here with full permissions

Results accumulate in git history. Pull the winning commits back to your main repo when done:

in your main repo

git fetch /tmp/autoresearch-sandbox main git cherry-pick < winning-commit-sha

Session Artifacts File Contents .agents/skills/codex-autoresearch/lessons.md Structured lessons from every iteration .agents/skills/codex-autoresearch/results.log Full per-iteration log (metric value, kept/reverted, elapsed) .agents/skills/codex-autoresearch/session.json Current session state for resume These files persist across Codex sessions. Delete them to start fresh. Troubleshooting Loop reverts every change: Verify command may be returning a non-numeric value. Test it manually: bash -c "" should print a single number. Metric direction may be wrong. Confirm Direction: lower or Direction: higher during setup. Guard fires on unrelated files: Narrow scope: Scope: src/specific-module/ Or tell Codex explicitly: Do not touch tests/ during confirmation. Session resume picks up wrong baseline: Delete session.json to force a fresh baseline: rm .agents/skills/codex-autoresearch/session.json Parallel mode produces merge conflicts: Codex handles this internally via the pivot protocol, but if it gets stuck, reduce parallelism: Parallel: 2 Codex asks questions mid-loop: This means a guard crash produced ambiguous output. Pre-empt it by specifying Guard: || true if guard failures should be non-fatal, or by giving Codex fuller sandbox permissions so it can run git commands freely. Loop hits PIVOT but makes no progress: Supply a seed hypothesis during confirmation: Hint: try tree-shaking unused imports first Or run plan mode first to produce a richer hypothesis list before switching to loop . Quick Reference

Start a loop

$codex-autoresearch

Resume interrupted run

$codex-autoresearch Resume

Bounded run

$codex-autoresearch — Iterations: 25

Parallel hypotheses

$codex-autoresearch — Parallel: 4

Force a mode

$codex-autoresearch fix pytest has 8 failures, repair them

Read-only audit

$codex-autoresearch security Audit src/api/ for injection vulnerabilities

codex-autoresearch-loop

安装

scripts/count_any.py

always exit 0; the number is what matters

scripts/coverage_pct.py

scripts/bundle_size.sh

!/usr/bin/env bash

scripts/lint_count.sh

!/usr/bin/env bash

launch Codex here with full permissions

in your main repo

Start a loop

Resume interrupted run

Bounded run

Parallel hypotheses

Force a mode

Read-only audit