Evolutionary Metric Ranking

Methodology for systematically zooming into high-quality configurations across multiple evaluation metrics using per-metric percentile cutoffs, intersection-based filtering, and evolutionary optimization. Domain-agnostic principles with quantitative trading case studies.

Companion skills

:

rangebar-eval-metrics

(metric definitions) |

adaptive-wfo-epoch

(WFO integration) |

backtesting-py-oracle

(SQL validation)

When to Use This Skill

Use this skill when:

Ranking and filtering configs/strategies/models across multiple quality metrics

Searching for optimal per-metric thresholds that select the best subset

Identifying which metrics are binding constraints vs inert dimensions

Running multi-objective optimization (Optuna TPE / NSGA-II) over filter parameters

Performing forensic analysis on optimization results (universal champions, feature themes)

Designing a metric registry for pluggable evaluation systems

Core Principles

P1 - Percentile Ranks, Not Raw Values

Raw metric values live on incompatible scales (Kelly in [-1,1], trade count in [50, 5000], Omega in [0.8, 2.0]). Percentile ranking normalizes every metric to [0, 100], making cross-metric comparison meaningful.

Rule: scipy.stats.rankdata(method='average') scaled to [0, 100]

None/NaN/Inf -> percentile 0 (worst)

"Lower is better" metrics -> negate before ranking (100 = best)

Why average ties

Tied values receive the mean of the ranks they would span. This prevents artificial discrimination between genuinely identical values.

P2 - Independent Per-Metric Cutoffs

Each metric gets its own independently-tunable cutoff.

cutoff=20

means "only configs in the top 20% survive this filter." This creates a 12-dimensional (or N-dimensional) search space where each axis controls one quality dimension.

cutoff=100 -> no filter (everything passes)

cutoff=50 -> top 50% survives

cutoff=10 -> top 10% survives (stringent)

cutoff=0 -> nothing passes

Why independent, not uniform

Different metrics have different discrimination power. Uniform tightening (all metrics at the same cutoff) wastes filtering budget on inert dimensions while under-filtering on binding constraints.

P3 - Intersection = Multi-Metric Excellence

A config survives the final filter only if it passes

ALL

per-metric cutoffs simultaneously. This intersection logic ensures no single-metric champion sneaks through with terrible performance elsewhere.

survivors = metric_1_pass AND metric_2_pass AND ... AND metric_N_pass

Why intersection, not scoring

Weighted-sum scoring hides metric failures. A config with 99th percentile Sharpe but 1st percentile regularity would score well in a weighted sum but is clearly deficient. Intersection enforces minimum quality across every dimension.

P4 - Start Wide Open, Tighten Evolutionarily

All cutoffs default to 100% (no filter). The optimizer progressively tightens cutoffs to find the combination that best satisfies the chosen objective. This is the opposite of starting strict and relaxing.

Initial state: All cutoffs = 100 (1008 configs survive)

After search: Each cutoff independently tuned (11 configs survive)

Why start wide

Starting strict risks missing the global optimum by immediately excluding configs that would survive under a different cutoff combination. Wide-to-narrow exploration is characteristic of global optimization.

P5 - Multiple Objectives Reveal Different Truths

No single objective function captures "quality." Run multiple objectives and compare survivor sets. Configs that survive

all

objectives are the most robust.

Objective

Asks

Reveals

max_survivors_min_cutoff

Most configs at tightest cutoffs?

Efficient frontier of quantity vs stringency

quality_at_target_n

Best quality in top N?

Optimal cutoffs for a target portfolio size

tightest_nonempty

Absolute tightest with >= 1 survivor?

Universal champion (sole survivor)

pareto_efficiency

Survivors vs tightness trade-off?

Full Pareto front (NSGA-II)

diversity_reward

Are cutoffs non-redundant?

Which metrics provide independent information

Cross-objective consistency

A config that appears in ALL objective survivor sets is the most defensible selection. One that appears in only one is likely an artifact of that objective's bias.

P6 - Binding Metrics Identification

After optimization, identify

binding metrics

- those that would increase the intersection if relaxed to 100%. Non-binding metrics are either already loose or perfectly correlated with a binding metric.

For each metric with cutoff < 100:

Relax this metric to 100, keep others fixed

If intersection grows: this metric IS binding

If intersection unchanged: this metric is redundant at current cutoffs

Why this matters

Binding metrics are the actual constraints on your quality frontier. Effort to improve configs should focus on binding dimensions.
P7 - Inert Dimension Detection
A metric is
inert
if it provides zero discrimination across the population. Detect this before optimization to reduce dimensionality.
If max(metric) == min(metric) across all configs: INERT
If percentile spread < 5 points: NEAR-INERT
Action: Remove inert metrics from the search space or permanently set their cutoff to 100. Including them wastes optimization budget. P8 - Forensic Post-Analysis After optimization, perform forensic analysis to extract actionable insights: Universal champions - configs surviving ALL objectives Feature frequency - which features appear most in survivors Metric binding sequence - order in which metrics become binding as cutoffs tighten Tightening curve - intersection size vs uniform cutoff (100% -> 5%) Metric discrimination power - which metric kills the most configs at each tightening step Architecture Pattern Metric JSONL files (pre-computed) | v MetricSpec Registry <-- Defines name, direction, source, cutoff var | v Percentile Ranker <-- scipy.stats.rankdata, None->0, flip lower-is-better | v Per-Metric Cutoff <-- Each metric independently filtered | v Intersection <-- Configs passing ALL cutoffs | v Evolutionary Search <-- Optuna TPE/NSGA-II tunes cutoffs | v Forensic Analysis <-- Cross-objective consistency, binding metrics MetricSpec Registry The registry is the single source of truth for metric definitions. Each entry is a frozen dataclass: @dataclass ( frozen = True ) class MetricSpec : name : str

Internal key (e.g., "tamrs")

label : str

Display label (e.g., "TAMRS")

higher_is_better : bool

Direction for percentile ranking

default_cutoff : int

Default percentile cutoff (100 = no filter)

source_file : str

JSONL filename containing raw values

source_field : str

Field name in JSONL records

Design principle

Adding a new metric = adding one MetricSpec entry. No other code changes required. The ranking, cutoff, intersection, and optimization machinery is fully generic.

Env Var Convention

Each metric's cutoff is controlled by a namespaced environment variable:

RBP_RANK_CUT_{METRIC_NAME_UPPER} = integer [0, 100]

This enables:

Shell-level override without code changes

Copy-paste of optimizer output directly into next run

CI/CD integration via environment configuration

Mise task integration via

[env]

blocks

Evolutionary Optimizer Design

Sampler Selection

Scenario

Sampler

Why

Single-objective

TPE (Tree-Parzen Estimator)

Bayesian, handles integer/categorical, good for 10-20 dimensions

Multi-objective (2+)

NSGA-II

Pareto-frontier discovery, population-based

Determinism

Always seed the sampler (

seed=42

). Optimization results must be reproducible.

Search Space Design

def

suggest_cutoffs

(

trial

)

:

cutoffs

=

{

}

for

spec

in

metric_registry

:

cutoffs

[

spec

.

name

]

=

trial

.

suggest_int

(

spec

.

name

,

5

,

100

,

step

=

5

)

return

cutoffs

Why step=5

Reduces the search space by 20x (20 values per metric vs 100) while maintaining sufficient granularity. For 12 metrics, this is 20^12 = 4 x 10^15 vs 100^12 = 10^24.
Why lower bound = 5: cutoff=0 always produces empty intersection. Values below 5 are too stringent to be useful in practice. Data Pre-Loading (Critical Performance Pattern)

metric_data

load_metric_data ( results_dir , metric_registry ) def objective ( trial ) : cutoffs = suggest_cutoffs ( trial )

Pass pre-loaded data - avoids disk I/O per trial

result

run_ranking_with_cutoffs
(
cutoffs
,
metric_data
=
metric_data
)
return
obj_fn
(
result
,
cutoffs
)
Why: Each trial evaluates in ~6ms when data is pre-loaded (pure NumPy/set operations). Without pre-loading, each trial incurs ~50ms of disk I/O. At 10,000 trials, this is 60 seconds vs 500 seconds. Objective Function Patterns Pattern 1 - Ratio Optimization def obj_max_survivors_min_cutoff ( result , cutoffs ) : n = result [ "n_intersection" ] if n == 0 : return 0.0 mean_cutoff = sum ( cutoffs . values ( ) ) / len ( cutoffs ) return n / mean_cutoff

More survivors per unit of looseness

Use when: Exploring the efficiency frontier - how much quality can you get for how much filtering? Pattern 2 - Constrained Quality def obj_quality_at_target_n ( result , cutoffs , target_n = 10 ) : n = result [ "n_intersection" ] avg_pct = result [ "avg_percentile" ] if n < target_n : return avg_pct * ( n / target_n )

Partial credit

return avg_pct

Full credit: maximize quality

Use when: You have a target portfolio size and want the highest quality subset. Pattern 3 - Minimum Budget def obj_tightest_nonempty ( result , cutoffs ) : n = result [ "n_intersection" ] if n == 0 : return 0.0 total_budget = sum ( cutoffs . values ( ) ) return max_possible_budget - total_budget

Lower budget = better

Use when

Finding the single most universally excellent config.
Pattern 4 - Diversity Reward
def
obj_diversity_reward
(
result
,
cutoffs
)
:
n
=
result
[
"n_intersection"
]
if
n
==
0
:
return
0.0
n_binding
=
result
[
"n_binding_metrics"
]
n_active
=
sum
(
1
for
v
in
cutoffs
.
values
(
)
if
v
<
100
)
if
n_active
==
0
:
return
0.0
efficiency
=
n_binding
/
n_active
return
n
*
efficiency
Use when: Ensuring that tightened cutoffs provide independent information, not redundant filtering. Pattern 5 - Pareto (Multi-Objective) study = optuna . create_study ( directions = [ "maximize" , "minimize" ] ,

max survivors, min cutoff

sampler

optuna
.
samplers
.
NSGAIISampler
(
seed
=
42
)
,
)
def
objective
(
trial
)
:
cutoffs
=
suggest_cutoffs
(
trial
)
result
=
run_ranking_with_cutoffs
(
cutoffs
,
metric_data
=
metric_data
)
return
result
[
"n_intersection"
]
,
sum
(
cutoffs
.
values
(
)
)
/
len
(
cutoffs
)
Use when: You want to see the full trade-off landscape between two competing objectives. Forensic Analysis Protocol After running all objectives, perform this analysis: Step 1 - Cross-Objective Survivor Sets For each objective: survivors_{objective} = set of configs in final intersection universal_champions = survivors_1 AND survivors_2 AND ... AND survivors_K If a config survives all K objective functions, it is robust to objective choice. Step 2 - Feature Theme Extraction Count feature appearances across all survivors: feature_counts = Counter() for config_id in quality_survivors: for feature in config_id.split("__"): feature_counts[feature.split("_")[0]] += 1 Dominant features reveal the underlying market microstructure that the ranking system is selecting for. Step 3 - Uniform Tightening Curve Apply the same cutoff to ALL metrics and plot intersection size: @100%: 1008 survivors (no filter) @80%: 502 survivors @60%: 210 survivors @40%: 68 survivors @20%: 12 survivors @10%: 3 survivors @5%: 0 survivors The shape of this curve reveals whether the metric space has natural clusters or is uniformly distributed. Step 4 - Binding Sequence Tighten uniformly and at each step identify which metric was the "tightest killer" - the metric that eliminated the most configs: @90%: 410 survivors | tightest killer: rachev (-57) @80%: 132 survivors | tightest killer: headroom (-27) @70%: 29 survivors | tightest killer: n_trades (-12) @60%: 6 survivors | tightest killer: dsr (-6) This reveals the binding constraint hierarchy. Implementation Checklist When implementing this methodology in a new domain: Define MetricSpec registry (name, direction, source, default cutoff) Implement percentile ranking (scipy.stats.rankdata) Implement per-metric cutoff application Implement set intersection across all metrics Add env var override for each cutoff Create run_ranking_with_cutoffs() API function Add binding metric detection Create tightening analysis function Write markdown report generator Add Optuna optimizer with at least 3 objective functions Pre-load metric data for optimizer performance Run 5-objective forensic analysis (10K+ trials per objective) Extract universal champions (cross-objective consistency) Identify inert dimensions (remove from search space) Document binding constraint sequence Record feature themes in survivors Anti-Patterns Anti-Pattern Symptom Fix Severity Weighted-sum scoring Single metric dominates, others ignored Use intersection (P3) CRITICAL Starting strict Miss global optimum, premature convergence Start at 100%, tighten (P4) HIGH Uniform cutoffs only Over-filters inert metrics, under-filters binding ones Per-metric independent cutoffs (P2) HIGH Single objective Artifact of objective bias Run 5+ objectives, check consistency (P5) HIGH Raw value comparison Scale-dependent, misleading Always use percentile ranks (P1) HIGH Including inert metrics Wastes optimization budget Detect and remove inert dimensions (P7) MEDIUM No data pre-loading Optimizer 10x slower Pre-load once, share across trials MEDIUM Unseeded optimizer Non-reproducible results Always seed sampler (seed=42) MEDIUM Missing forensic analysis Raw numbers without insight Run full forensic protocol (P8) MEDIUM References Topic Reference File Range Bar Case Study case-study-rangebar-ranking.md Objective Functions objective-functions.md Metric Design Guide metric-design-guide.md

evolutionary-metric-ranking

安装

Internal key (e.g., "tamrs")

Display label (e.g., "TAMRS")

Direction for percentile ranking

Default percentile cutoff (100 = no filter)

JSONL filename containing raw values

Field name in JSONL records

metric_data

Pass pre-loaded data - avoids disk I/O per trial

result

More survivors per unit of looseness

Partial credit

Full credit: maximize quality

Lower budget = better

max survivors, min cutoff

sampler

安装

Internal key (e.g., "tamrs")

Display label (e.g., "TAMRS")

Direction for percentile ranking

Default percentile cutoff (100 = no filter)

JSONL filename containing raw values

Field name in JSONL records

Load metric data ONCE, share across all trials

metric_data

Pass pre-loaded data - avoids disk I/O per trial

result

More survivors per unit of looseness

Partial credit

Full credit: maximize quality

Lower budget = better

max survivors, min cutoff

sampler