ARIS — Autonomous ML Research In Sleep Skill by ara.so — Daily 2026 Skills collection ARIS (Auto-Research-In-Sleep) turns Claude Code into an autonomous ML research engine. It chains idea discovery → cross-model review loops → paper writing → compiled PDF into hands-off overnight pipelines. Claude Code drives execution while an external model (Codex/GPT-5.4, GLM, DeepSeek, Kimi, etc.) acts as adversarial reviewer — breaking self-play blind spots that single-model review cannot escape. What It Does Workflow Trigger What Runs Idea Discovery /idea-discovery Literature survey → 8–12 ideas → novelty check → pilot GPU runs → ranked report Auto Review Loop /auto-review-loop 4-round review/fix cycle, score tracked per round (e.g. 5/10 → 7.5/10) Paper Writing /paper-writing Narrative → outline → figures → LaTeX → PDF → 2-round auto-improvement Full Pipeline /research-pipeline Chains all three end-to-end from a single prompt Installation
1. Clone and install skills
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git cp -r Auto-claude-code-research-in-sleep/skills/* ~/.claude/skills/
2. Install Codex MCP (cross-model reviewer)
npm install -g @openai/codex codex setup
set model to gpt-5.4 when prompted
claude mcp add codex -s user -- codex mcp-server
3. Verify MCP is connected
claude mcp list
should show "codex" in the list
Codex Model Configuration The reviewer model is read from ~/.codex/config.toml , not from skill files. Edit directly if needed:
~/.codex/config.toml
model
"gpt-5.4"
recommended — most rigorous reviewer
model = "gpt-5.3-codex"
model = "gpt-5.2-codex"
model = "o3"
Core Workflows Workflow 1 — Idea Discovery /idea-discovery "factorized gap in discrete diffusion language models" Be specific — "NLP" produces weak ideas; "factorized gap in discrete diffusion LMs" targets a real research gap. What runs: Multi-source literature search (arXiv, Scholar, Zotero, Obsidian, local PDFs) Claude brainstorms 8–12 candidate ideas Codex reviewer cross-checks novelty against literature Pilot GPU experiments on top candidates Ranked idea report saved to idea_discovery_report.md Workflow 2 — Auto Review Loop /auto-review-loop Run from a directory containing your paper draft or experiment results. What runs: Claude submits current work to Codex reviewer Codex returns structured critique with score /10 Claude implements fixes (experiments, writing, ablations) Repeat up to 4 rounds or until score threshold met Score curve saved to docs/auto_review_score_curve.png Workflow 3 — Paper Writing /paper-writing "NARRATIVE_REPORT.md" Point at a narrative markdown file describing your findings. What runs: Outline generation (sections, figures, tables) Figure generation from experiment results LaTeX source assembly pdflatex compilation 2-round auto-review-and-improve cycle Final PDF + anti-hallucination BibTeX (fetched from DBLP/CrossRef) Full Pipeline /research-pipeline "your research direction" Chains Workflows 1 → 2 → 3 from a single prompt. Wake up to a scored, compiled paper. Inline Configuration Overrides Append — key: value to any command: /research-pipeline "topic" — AUTO_PROCEED: false /research-pipeline "topic" — human checkpoint: true /research-pipeline "topic" — arxiv download: true /research-pipeline "topic" — AUTO_PROCEED: false, human checkpoint: true Parameter Default Effect AUTO_PROCEED true false = pause at idea selection gate before committing GPU time human checkpoint false true = pause after each review round for manual feedback arxiv download false true = download full PDFs during literature survey (vs metadata only) DBLP_BIBTEX true false = use LLM-generated BibTeX (not recommended — hallucination risk) Alternative Model Combinations No Claude or OpenAI API required — swap any OpenAI-compatible endpoint via the llm-chat MCP server:
Install the bundled llm-chat MCP server
cd Auto-claude-code-research-in-sleep/mcp-servers/llm-chat pip install -r requirements.txt
Configure your provider
export LLM_CHAT_BASE_URL = "https://open.bigmodel.cn/api/paas/v4"
GLM-4
export LLM_CHAT_API_KEY = "your-key" export LLM_CHAT_MODEL = "glm-4-plus"
Add to Claude Code
claude mcp add llm-chat -s user -- python server.py Tested reviewer models: Provider Model Notes OpenAI gpt-5.4 Recommended — most rigorous Zhipu AI glm-4-plus Strong Chinese-language papers MiniMax abab6.5s-chat Fast, cost-effective Moonshot moonshot-v1-128k Kimi — long-context papers DeepSeek deepseek-chat Code-heavy experiments 01.ai yi-large LongCat — long context Anti-Hallucination Citations BibTeX is fetched from real databases by default — no manual flag needed:
skills/paper-writing/citation_fetcher.py pattern used internally
import requests def fetch_bibtex_dblp ( title : str ) -
str | None : """Fetch real BibTeX from DBLP by paper title.""" resp = requests . get ( "https://dblp.org/search/publ/api" , params = { "q" : title , "format" : "json" , "h" : 1 } ) hits = resp . json ( ) . get ( "result" , { } ) . get ( "hits" , { } ) . get ( "hit" , [ ] ) if not hits : return None key = hits [ 0 ] [ "info" ] . get ( "key" , "" ) bib_resp = requests . get ( f"https://dblp.org/rec/ { key } .bib" ) return bib_resp . text if bib_resp . ok else None def fetch_bibtex_crossref ( doi : str ) -
str | None : """Fallback: fetch BibTeX from CrossRef by DOI.""" resp = requests . get ( f"https://api.crossref.org/works/ { doi } /transform/application/x-bibtex" ) return resp . text if resp . ok else None Disable with — DBLP_BIBTEX: false if working fully offline. Optional Integrations Zotero
Install Zotero Better BibTeX plugin, then:
export ZOTERO_API_KEY = "your-zotero-web-api-key" export ZOTERO_LIBRARY_ID = "your-library-id" export ZOTERO_LIBRARY_TYPE = "user"
or "group"
Literature search will query your Zotero library before hitting arXiv. Obsidian export OBSIDIAN_VAULT_PATH = "/path/to/your/vault" Skill will search markdown notes in the vault for related work before external queries. Feishu / Lark Notifications export FEISHU_WEBHOOK_URL = "https://open.feishu.cn/open-apis/bot/v2/hook/your-token" export FEISHU_MODE = "push"
off | push | interactive
Mode Behaviour off No notifications push One-way alerts: review scores, experiment completions, checkpoints interactive Mobile approval buttons at AUTO_PROCEED: false gates Directory Layout After a Pipeline Run your-project/ ├── idea_discovery_report.md # ranked ideas with novelty scores ├── NARRATIVE_REPORT.md # auto-generated findings narrative ├── paper/ │ ├── main.tex # assembled LaTeX │ ├── main.pdf # compiled output │ ├── figures/ # auto-generated plots │ └── references.bib # real BibTeX from DBLP/CrossRef ├── experiments/ │ ├── pilot_runs/ # idea-discovery GPU pilots │ └── review_round_*/ # per-round experiment results └── docs/ └── auto_review_score_curve.png Python Integration Pattern Trigger ARIS workflows programmatically from a Python script (e.g. a cron job or CI step): import subprocess import json from pathlib import Path def run_aris_pipeline ( research_direction : str , output_dir : str = "." , auto_proceed : bool = True , human_checkpoint : bool = False , arxiv_download : bool = False , ) -
dict : """ Launch ARIS full pipeline via Claude Code CLI. Returns parsed score progression from the review curve JSON. """ overrides = ", " . join ( [ f"AUTO_PROCEED: { str ( auto_proceed ) . lower ( ) } " , f"human checkpoint: { str ( human_checkpoint ) . lower ( ) } " , f"arxiv download: { str ( arxiv_download ) . lower ( ) } " , ] ) command = f'/research-pipeline " { research_direction } " — { overrides } ' result = subprocess . run ( [ "claude" , "--print" , command ] , cwd = output_dir , capture_output = True , text = True , ) if result . returncode != 0 : raise RuntimeError ( f"ARIS pipeline failed:\n { result . stderr } " )
Parse score progression if available
score_json
Path ( output_dir ) / "docs" / "review_scores.json" if score_json . exists ( ) : return json . loads ( score_json . read_text ( ) ) return { "stdout" : result . stdout }
Example: nightly research job
if name == "main" : scores = run_aris_pipeline ( research_direction = "token-level uncertainty calibration in autoregressive LMs" , output_dir = "./nightly_research" , auto_proceed = True , human_checkpoint = False , ) print ( f"Final review score: {scores.get('rounds', [{}])[-1].get('score')}/10" ) Skill Composition ARIS ships 20 composable sub-skills. Chain them manually for custom workflows:
Literature only
/literature-survey "topic"
Brainstorm without pilot experiments
/idea-brainstorm "topic" — pilot experiments: false
Single review round (no loop)
/single-review "path/to/draft.md"
Proof-writing (community skill)
/proof-writer "theorem statement"
Write paper from existing narrative, skip review
/paper-writing "NARRATIVE.md" — auto-review: false Troubleshooting Codex MCP not found claude mcp list
verify "codex" appears
codex setup
re-run setup if missing
claude mcp remove codex && \ claude mcp add codex -s user -- codex mcp-server
re-add
Skills not loading in Claude Code ls ~/.claude/skills/
verify files copied
Each skill must be a directory with SKILL.md inside
ls ~/.claude/skills/auto-review-loop/SKILL.md pdflatex not found during paper writing
macOS
brew install --cask mactex-no-gui
Ubuntu/Debian
sudo apt install texlive-full
Then retry — skill auto-detects pdflatex on PATH
Reviewer returns empty critique Check ~/.codex/config.toml — ensure model is set and your API key is valid: codex "say hello"
quick smoke test outside Claude Code
GLM/DeepSeek reviewer not triggering Verify llm-chat MCP server is listed: claude mcp list
should show "llm-chat"
echo $LLM_CHAT_BASE_URL
must be set in the shell that launches claude
Score not improving after 4 rounds Add — human checkpoint: true and inspect each round's critique file in experiments/review_round_*/ Consider switching reviewer model — a different architecture surfaces different weaknesses Lower-level issues (bad data, flawed baseline) need manual intervention before another loop Community Skills Skill Description proof-writer Rigorous theorem proof drafting with anti-hallucination citations Add your own skill: create skills/your-skill-name/SKILL.md and open a PR. Cross-Model Review — Why It Works Claude Code (executor) Codex / external LLM (reviewer) ───────────────────── ─────────────────────────────── Fast, fluid code execution ←→ Deliberate, rigorous critique Broad context retention Adversarial probing of blind spots Narrative generation Structural weakness detection Single-model self-review falls into local minima — the same pattern-matching that generated the work also evaluates it. Cross-model review is adversarial: the reviewer actively probes weaknesses the executor didn't anticipate. The 1→2 model jump produces the largest quality gain; adding more reviewers yields diminishing returns.