Self-Improving Agent Builder Purpose Run a closed-loop improvement cycle on any goal-seeking agent implementation: EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat) Each iteration measures L1-L12 progressive test scores, identifies failures with error_analyzer.py , runs a research step with hypothesis/evidence/ counter-arguments, applies targeted fixes, and gates promotion through regression checks. When I Activate "improve agent" or "self-improving loop" "agent eval loop" or "run improvement cycle" "benchmark agents" or "compare SDK implementations" "iterate on agent scores" or "fix agent regressions" Quick Start User: "Run the self-improving loop on the mini-framework agent for 3 iterations" Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE Reports per-iteration scores, net improvement, and commits/reverts. Runner Script The self-improvement loop is implemented as a Python CLI:

Basic usage

python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3

Full options

python -m amplihack.eval.self_improve.runner \ --sdk mini \ --iterations 5 \ --improvement-threshold 2.0 \ --regression-tolerance 5.0 \ --levels L1 L2 L3 L4 L5 L6 \ --output-dir ./eval_results/self_improve \ --dry-run

evaluate only, don't apply changes

Source: src/amplihack/eval/self_improve/runner.py The Loop (6 Phases per Iteration) Phase 1: EVAL Run the L1-L12 progressive test suite on the current agent implementation. Execution: python -m amplihack.eval.progressive_test_suite \ --agent-name < agent_name

\ --output-dir < output_dir

/iteration_N/eval \ --levels L1 L2 L3 L4 L5 L6 Output: Per-level scores and overall baseline. Phase 2: ANALYZE Classify failures using error_analyzer.py . Maps each failed question to a failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and the specific code component responsible. from amplihack . eval . self_improve import analyze_eval_results analyses = analyze_eval_results ( level_results , score_threshold = 0.6 )

Each ErrorAnalysis maps to:

failure_mode -> affected_component -> prompt_template

Phase 3: RESEARCH (New)

The critical thinking step that prevents blind changes. For each proposed

improvement:

State hypothesis

What specific change will fix the failure?

Gather evidence

From eval results, failure patterns, baseline scores

Consider counter-arguments

What could go wrong? Risk of regression?

Make decision

Apply, skip, or defer with full reasoning

Decisions are logged in

research_decisions.json

for auditability.

Decision criteria:

Apply

Clear failure pattern + prompt template available + low score

Skip

Score above 50% (likely stochastic variation)

Defer

Ambiguous evidence, needs more data

Phase 4: IMPROVE

Apply the improvements approved by the research step. Priority order:

Prompt template improvements (safest, highest impact)

Retrieval strategy adjustments

Code logic fixes (most risky, needs careful review)

Phase 5: RE-EVAL

Re-run the same eval suite after applying fixes to measure impact.

Phase 6: DECIDE

Promotion gate:

Net improvement >= +2% overall score: COMMIT the changes

Any single level regression > 5%: REVERT all changes

Otherwise: COMMIT with marginal improvement note

Configuration

Parameter

Default

Description

sdk_type

mini

Which SDK: mini/claude/copilot/microsoft

max_iterations

5

Maximum improvement iterations

improvement_threshold

2.0

Minimum % improvement to commit

regression_tolerance

5.0

Maximum % regression on any level

levels

L1-L6

Which levels to evaluate

output_dir

./eval_results/self_improve

Results directory

dry_run

false

Evaluate only, don't apply changes

Programmatic Usage

from

amplihack

.

eval

.

self_improve

import

run_self_improvement

,

RunnerConfig

config

=

RunnerConfig

(

sdk_type

=

"mini"

,

max_iterations

=

3

,

improvement_threshold

=

2.0

,

regression_tolerance

=

5.0

,

levels

=

[

"L1"

,

"L2"

,

"L3"

,

"L4"

,

"L5"

,

"L6"

]

,

output_dir

=

"./eval_results/self_improve"

,

dry_run

=

False

,

)

result

=

run_self_improvement

(

config

)

print

(

f"Total improvement:

{

result

.

total_improvement

:

+.1f

}

%"

)

print

(

f"Final scores:

{

result

.

final_scores

}

"

)

4-Way Benchmark Mode

Compare all SDK implementations side by side:

User: "Run a 4-way benchmark comparing all SDK implementations"

Skill: Runs eval suite on mini, claude, copilot, microsoft

Generates comparison table with scores, LOC, and coverage.

Integration Points

src/amplihack/eval/self_improve/runner.py

Self-improvement loop runner

src/amplihack/eval/self_improve/error_analyzer.py

Failure classification

src/amplihack/eval/progressive_test_suite.py

L1-L12 eval runner

src/amplihack/agents/goal_seeking/sdk_adapters/

All 4 SDK implementations

src/amplihack/eval/metacognition_grader.py

Advanced eval dimensions
src/amplihack/eval/teaching_session.py: L7 teaching quality eval

self-improving-agent-builder

安装

Basic usage

Full options

evaluate only, don't apply changes

Each ErrorAnalysis maps to:

failure_mode -> affected_component -> prompt_template