self-improving-agent-builder

安装量: 52
排名: #14149

安装

npx skills add https://github.com/rysweet/amplihack --skill self-improving-agent-builder

Self-Improving Agent Builder Purpose Run a closed-loop improvement cycle on any goal-seeking agent implementation: EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat) Each iteration measures L1-L12 progressive test scores, identifies failures with error_analyzer.py , runs a research step with hypothesis/evidence/ counter-arguments, applies targeted fixes, and gates promotion through regression checks. When I Activate "improve agent" or "self-improving loop" "agent eval loop" or "run improvement cycle" "benchmark agents" or "compare SDK implementations" "iterate on agent scores" or "fix agent regressions" Quick Start User: "Run the self-improving loop on the mini-framework agent for 3 iterations" Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE Reports per-iteration scores, net improvement, and commits/reverts. Runner Script The self-improvement loop is implemented as a Python CLI:

Basic usage

python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3

Full options

python -m amplihack.eval.self_improve.runner \ --sdk mini \ --iterations 5 \ --improvement-threshold 2.0 \ --regression-tolerance 5.0 \ --levels L1 L2 L3 L4 L5 L6 \ --output-dir ./eval_results/self_improve \ --dry-run

evaluate only, don't apply changes

Source: src/amplihack/eval/self_improve/runner.py The Loop (6 Phases per Iteration) Phase 1: EVAL Run the L1-L12 progressive test suite on the current agent implementation. Execution: python -m amplihack.eval.progressive_test_suite \ --agent-name < agent_name

\ --output-dir < output_dir

/iteration_N/eval \ --levels L1 L2 L3 L4 L5 L6 Output: Per-level scores and overall baseline. Phase 2: ANALYZE Classify failures using error_analyzer.py . Maps each failed question to a failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and the specific code component responsible. from amplihack . eval . self_improve import analyze_eval_results analyses = analyze_eval_results ( level_results , score_threshold = 0.6 )

Each ErrorAnalysis maps to:

failure_mode -> affected_component -> prompt_template

Phase 3: RESEARCH (New)
The critical thinking step that prevents blind changes. For each proposed
improvement:
State hypothesis
What specific change will fix the failure?
Gather evidence
From eval results, failure patterns, baseline scores
Consider counter-arguments
What could go wrong? Risk of regression?
Make decision
Apply, skip, or defer with full reasoning
Decisions are logged in
research_decisions.json
for auditability.
Decision criteria:
Apply
Clear failure pattern + prompt template available + low score
Skip
Score above 50% (likely stochastic variation)
Defer
Ambiguous evidence, needs more data
Phase 4: IMPROVE
Apply the improvements approved by the research step. Priority order:
Prompt template improvements (safest, highest impact)
Retrieval strategy adjustments
Code logic fixes (most risky, needs careful review)
Phase 5: RE-EVAL
Re-run the same eval suite after applying fixes to measure impact.
Phase 6: DECIDE
Promotion gate:
Net improvement >= +2% overall score: COMMIT the changes
Any single level regression > 5%: REVERT all changes
Otherwise: COMMIT with marginal improvement note
Configuration
Parameter
Default
Description
sdk_type
mini
Which SDK: mini/claude/copilot/microsoft
max_iterations
5
Maximum improvement iterations
improvement_threshold
2.0
Minimum % improvement to commit
regression_tolerance
5.0
Maximum % regression on any level
levels
L1-L6
Which levels to evaluate
output_dir
./eval_results/self_improve
Results directory
dry_run
false
Evaluate only, don't apply changes
Programmatic Usage
from
amplihack
.
eval
.
self_improve
import
run_self_improvement
,
RunnerConfig
config
=
RunnerConfig
(
sdk_type
=
"mini"
,
max_iterations
=
3
,
improvement_threshold
=
2.0
,
regression_tolerance
=
5.0
,
levels
=
[
"L1"
,
"L2"
,
"L3"
,
"L4"
,
"L5"
,
"L6"
]
,
output_dir
=
"./eval_results/self_improve"
,
dry_run
=
False
,
)
result
=
run_self_improvement
(
config
)
print
(
f"Total improvement:
{
result
.
total_improvement
:
+.1f
}
%"
)
print
(
f"Final scores:
{
result
.
final_scores
}
"
)
4-Way Benchmark Mode
Compare all SDK implementations side by side:
User: "Run a 4-way benchmark comparing all SDK implementations"
Skill: Runs eval suite on mini, claude, copilot, microsoft
Generates comparison table with scores, LOC, and coverage.
Integration Points
src/amplihack/eval/self_improve/runner.py
Self-improvement loop runner
src/amplihack/eval/self_improve/error_analyzer.py
Failure classification
src/amplihack/eval/progressive_test_suite.py
L1-L12 eval runner
src/amplihack/agents/goal_seeking/sdk_adapters/
All 4 SDK implementations
src/amplihack/eval/metacognition_grader.py
Advanced eval dimensions
src/amplihack/eval/teaching_session.py
L7 teaching quality eval
返回排行榜