GAN-Style Harness Skill Inspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026) A multi-agent harness that separates generation from evaluation , creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve. Core Insight When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique. This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration. When to Use Building complete applications from a one-line prompt Frontend design tasks requiring high visual quality Full-stack projects that need working features, not just code Any task where "AI slop" aesthetics are unacceptable Projects where you want to invest $50-200 for production-quality output When NOT to Use Quick single-file fixes (use standard claude -p ) Tasks with tight budget constraints (<$10) Simple refactoring (use de-sloppify pattern instead) Tasks that are already well-specified with tests (use TDD workflow) Architecture ┌─────────────┐ │ PLANNER │ │ (Opus 4.6) │ └──────┬──────┘ │ Product Spec │ (features, sprints, design direction) ▼ ┌────────────────────────┐ │ │ │ GENERATOR-EVALUATOR │ │ FEEDBACK LOOP │ │ │ │ ┌──────────┐ │ │ │GENERATOR │--build-->│──┐ │ │(Opus 4.6)│ │ │ │ └────▲─────┘ │ │ │ │ │ │ live app │ feedback │ │ │ │ │ │ │ ┌────┴─────┐ │ │ │ │EVALUATOR │<-test----│──┘ │ │(Opus 4.6)│ │ │ │+Playwright│ │ │ └──────────┘ │ │ │ │ 5-15 iterations │ └────────────────────────┘ The Three Agents 1. Planner Agent Role: Product manager — expands a brief prompt into a full product specification. Key behaviors: Takes a one-line prompt and produces a 16-feature, multi-sprint specification Defines user stories, technical requirements, and visual design direction Is deliberately ambitious — conservative planning leads to underwhelming results Produces evaluation criteria that the Evaluator will use later Model: Opus 4.6 (needs deep reasoning for spec expansion) 2. Generator Agent Role: Developer — implements features according to the spec. Key behaviors: Works in structured sprints (or continuous mode with newer models) Negotiates a "sprint contract" with the Evaluator before writing code Uses full-stack tooling: React, FastAPI/Express, databases, CSS Manages git for version control between iterations Reads Evaluator feedback and incorporates it in next iteration Model: Opus 4.6 (needs strong coding capability) 3. Evaluator Agent Role: QA engineer — tests the live running application, not just code. Key behaviors: Uses Playwright MCP to interact with the live application Clicks through features, fills forms, tests API endpoints Scores against four criteria (configurable): Design Quality — Does it feel like a coherent whole? Originality — Custom decisions vs. template/AI patterns? Craft — Typography, spacing, animations, micro-interactions? Functionality — Do all features actually work? Returns structured feedback with scores and specific issues Is engineered to be ruthlessly strict — never praises mediocre work Model: Opus 4.6 (needs strong judgment + tool use) Evaluation Criteria The default four criteria, each scored 1-10:

Evaluation Rubric

Design Quality (weight: 0.3)

1-3: Generic, template-like, "AI slop" aesthetics

4-6: Competent but unremarkable, follows conventions

7-8: Distinctive, cohesive visual identity

9-10: Could pass for a professional designer's work

Originality (weight: 0.2)

1-3: Default colors, stock layouts, no personality

4-6: Some custom choices, mostly standard patterns

7-8: Clear creative vision, unique approach

9-10: Surprising, delightful, genuinely novel

Craft (weight: 0.3)

1-3: Broken layouts, missing states, no animations

4-6: Works but feels rough, inconsistent spacing

7-8: Polished, smooth transitions, responsive

9-10: Pixel-perfect, delightful micro-interactions

Functionality (weight: 0.2)

1-3: Core features broken or missing

4-6: Happy path works, edge cases fail

7-8: All features work, good error handling

9-10: Bulletproof, handles every edge case Scoring Weighted score = sum of (criterion_score * weight) Pass threshold = 7.0 (configurable) Max iterations = 15 (configurable, typically 5-15 sufficient) Usage Via Command

Full three-agent harness

/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"

With custom config

/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5

Frontend design mode (generator + evaluator only, no planner)

/project:gan-design "Create a landing page for a crypto portfolio tracker" Via Shell Script

Basic usage

./scripts/gan-harness.sh "Build a music streaming dashboard"

With options

GAN_MAX_ITERATIONS

10 \ GAN_PASS_THRESHOLD = 7.5 \ GAN_EVAL_CRITERIA = "functionality,performance,security" \ ./scripts/gan-harness.sh "Build a REST API for task management" Via Claude Code (Manual)

Step 1: Plan

claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"

Step 2: Generate (iteration 1)

claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."

Step 3: Evaluate (iteration 1)

claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"

Step 4: Generate (iteration 2 — reads feedback)

claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."

Repeat steps 3-4 until pass threshold met

Evolution Across Model Capabilities The harness should simplify as models improve. Following Anthropic's evolution: Stage 1 — Weaker Models (Sonnet-class) Full sprint decomposition required Context resets between sprints (avoid context anxiety) 2-agent minimum: Initializer + Coding Agent Heavy scaffolding compensates for model limitations Stage 2 — Capable Models (Opus 4.5-class) Full 3-agent harness: Planner + Generator + Evaluator Sprint contracts before each implementation phase 10-sprint decomposition for complex apps Context resets still useful but less critical Stage 3 — Frontier Models (Opus 4.6-class) Simplified harness: single planning pass, continuous generation Evaluation reduced to single end-pass (model is smarter) No sprint structure needed Automatic compaction handles context growth Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed. Configuration Environment Variables Variable Default Description GAN_MAX_ITERATIONS 15 Maximum generator-evaluator cycles GAN_PASS_THRESHOLD 7.0 Weighted score to pass (1-10) GAN_PLANNER_MODEL opus Model for planning agent GAN_GENERATOR_MODEL opus Model for generator agent GAN_EVALUATOR_MODEL opus Model for evaluator agent GAN_EVAL_CRITERIA design,originality,craft,functionality Comma-separated criteria GAN_DEV_SERVER_PORT 3000 Port for the live app GAN_DEV_SERVER_CMD npm run dev Command to start dev server GAN_PROJECT_DIR . Project working directory GAN_SKIP_PLANNER false Skip planner, use spec directly GAN_EVAL_MODE playwright playwright , screenshot , or code-only Evaluation Modes Mode Tools Best For playwright Browser MCP + live interaction Full-stack apps with UI screenshot Screenshot + visual analysis Static sites, design-only code-only Tests + linting + build APIs, libraries, CLI tools Anti-Patterns Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns. Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read feedback-NNN.md at the start of each iteration. Infinite loops — Always set GAN_MAX_ITERATIONS . If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review. Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states. Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes. Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases. Results: What to Expect Based on Anthropic's published results: Metric Solo Agent GAN Harness Improvement Time 20 min 4-6 hours 12-18x longer Cost $9 $125-200 14-22x more Quality Barely functional Production-ready Phase change Core features Broken All working N/A Design Generic AI slop Distinctive, polished N/A The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters. References Anthropic: Harness Design for Long-Running Apps — Original paper by Prithvi Rajasekaran Epsilla: The GAN-Style Agent Loop — Architecture deconstruction Martin Fowler: Harness Engineering — Broader industry context OpenAI: Harness Engineering — OpenAI's parallel work

gan-style-harness

安装