- Evolutionary Metric Ranking
- Methodology for systematically zooming into high-quality configurations across multiple evaluation metrics using per-metric percentile cutoffs, intersection-based filtering, and evolutionary optimization. Domain-agnostic principles with quantitative trading case studies.
- Companion skills
- :
- rangebar-eval-metrics
- (metric definitions) |
- adaptive-wfo-epoch
- (WFO integration) |
- backtesting-py-oracle
- (SQL validation)
- When to Use This Skill
- Use this skill when:
- Ranking and filtering configs/strategies/models across multiple quality metrics
- Searching for optimal per-metric thresholds that select the best subset
- Identifying which metrics are binding constraints vs inert dimensions
- Running multi-objective optimization (Optuna TPE / NSGA-II) over filter parameters
- Performing forensic analysis on optimization results (universal champions, feature themes)
- Designing a metric registry for pluggable evaluation systems
- Core Principles
- P1 - Percentile Ranks, Not Raw Values
- Raw metric values live on incompatible scales (Kelly in [-1,1], trade count in [50, 5000], Omega in [0.8, 2.0]). Percentile ranking normalizes every metric to [0, 100], making cross-metric comparison meaningful.
- Rule: scipy.stats.rankdata(method='average') scaled to [0, 100]
- None/NaN/Inf -> percentile 0 (worst)
- "Lower is better" metrics -> negate before ranking (100 = best)
- Why average ties
-
- Tied values receive the mean of the ranks they would span. This prevents artificial discrimination between genuinely identical values.
- P2 - Independent Per-Metric Cutoffs
- Each metric gets its own independently-tunable cutoff.
- cutoff=20
- means "only configs in the top 20% survive this filter." This creates a 12-dimensional (or N-dimensional) search space where each axis controls one quality dimension.
- cutoff=100 -> no filter (everything passes)
- cutoff=50 -> top 50% survives
- cutoff=10 -> top 10% survives (stringent)
- cutoff=0 -> nothing passes
- Why independent, not uniform
-
- Different metrics have different discrimination power. Uniform tightening (all metrics at the same cutoff) wastes filtering budget on inert dimensions while under-filtering on binding constraints.
- P3 - Intersection = Multi-Metric Excellence
- A config survives the final filter only if it passes
- ALL
- per-metric cutoffs simultaneously. This intersection logic ensures no single-metric champion sneaks through with terrible performance elsewhere.
- survivors = metric_1_pass AND metric_2_pass AND ... AND metric_N_pass
- Why intersection, not scoring
-
- Weighted-sum scoring hides metric failures. A config with 99th percentile Sharpe but 1st percentile regularity would score well in a weighted sum but is clearly deficient. Intersection enforces minimum quality across every dimension.
- P4 - Start Wide Open, Tighten Evolutionarily
- All cutoffs default to 100% (no filter). The optimizer progressively tightens cutoffs to find the combination that best satisfies the chosen objective. This is the opposite of starting strict and relaxing.
- Initial state: All cutoffs = 100 (1008 configs survive)
- After search: Each cutoff independently tuned (11 configs survive)
- Why start wide
-
- Starting strict risks missing the global optimum by immediately excluding configs that would survive under a different cutoff combination. Wide-to-narrow exploration is characteristic of global optimization.
- P5 - Multiple Objectives Reveal Different Truths
- No single objective function captures "quality." Run multiple objectives and compare survivor sets. Configs that survive
- all
- objectives are the most robust.
- Objective
- Asks
- Reveals
- max_survivors_min_cutoff
- Most configs at tightest cutoffs?
- Efficient frontier of quantity vs stringency
- quality_at_target_n
- Best quality in top N?
- Optimal cutoffs for a target portfolio size
- tightest_nonempty
- Absolute tightest with >= 1 survivor?
- Universal champion (sole survivor)
- pareto_efficiency
- Survivors vs tightness trade-off?
- Full Pareto front (NSGA-II)
- diversity_reward
- Are cutoffs non-redundant?
- Which metrics provide independent information
- Cross-objective consistency
-
- A config that appears in ALL objective survivor sets is the most defensible selection. One that appears in only one is likely an artifact of that objective's bias.
- P6 - Binding Metrics Identification
- After optimization, identify
- binding metrics
- - those that would increase the intersection if relaxed to 100%. Non-binding metrics are either already loose or perfectly correlated with a binding metric.
- For each metric with cutoff < 100:
- Relax this metric to 100, keep others fixed
- If intersection grows: this metric IS binding
- If intersection unchanged: this metric is redundant at current cutoffs
- Why this matters
-
- Binding metrics are the actual constraints on your quality frontier. Effort to improve configs should focus on binding dimensions.
- P7 - Inert Dimension Detection
- A metric is
- inert
- if it provides zero discrimination across the population. Detect this before optimization to reduce dimensionality.
- If max(metric) == min(metric) across all configs: INERT
- If percentile spread < 5 points: NEAR-INERT
- Action
- Remove inert metrics from the search space or permanently set their cutoff to 100. Including them wastes optimization budget. P8 - Forensic Post-Analysis After optimization, perform forensic analysis to extract actionable insights: Universal champions - configs surviving ALL objectives Feature frequency - which features appear most in survivors Metric binding sequence - order in which metrics become binding as cutoffs tighten Tightening curve - intersection size vs uniform cutoff (100% -> 5%) Metric discrimination power - which metric kills the most configs at each tightening step Architecture Pattern Metric JSONL files (pre-computed) | v MetricSpec Registry <-- Defines name, direction, source, cutoff var | v Percentile Ranker <-- scipy.stats.rankdata, None->0, flip lower-is-better | v Per-Metric Cutoff <-- Each metric independently filtered | v Intersection <-- Configs passing ALL cutoffs | v Evolutionary Search <-- Optuna TPE/NSGA-II tunes cutoffs | v Forensic Analysis <-- Cross-objective consistency, binding metrics MetricSpec Registry The registry is the single source of truth for metric definitions. Each entry is a frozen dataclass: @dataclass ( frozen = True ) class MetricSpec : name : str
Internal key (e.g., "tamrs")
label : str
Display label (e.g., "TAMRS")
higher_is_better : bool
Direction for percentile ranking
default_cutoff : int
Default percentile cutoff (100 = no filter)
source_file : str
JSONL filename containing raw values
source_field : str
Field name in JSONL records
- Design principle
-
- Adding a new metric = adding one MetricSpec entry. No other code changes required. The ranking, cutoff, intersection, and optimization machinery is fully generic.
- Env Var Convention
- Each metric's cutoff is controlled by a namespaced environment variable:
- RBP_RANK_CUT_{METRIC_NAME_UPPER} = integer [0, 100]
- This enables:
- Shell-level override without code changes
- Copy-paste of optimizer output directly into next run
- CI/CD integration via environment configuration
- Mise task integration via
- [env]
- blocks
- Evolutionary Optimizer Design
- Sampler Selection
- Scenario
- Sampler
- Why
- Single-objective
- TPE (Tree-Parzen Estimator)
- Bayesian, handles integer/categorical, good for 10-20 dimensions
- Multi-objective (2+)
- NSGA-II
- Pareto-frontier discovery, population-based
- Determinism
-
- Always seed the sampler (
- seed=42
- ). Optimization results must be reproducible.
- Search Space Design
- def
- suggest_cutoffs
- (
- trial
- )
- :
- cutoffs
- =
- {
- }
- for
- spec
- in
- metric_registry
- :
- cutoffs
- [
- spec
- .
- name
- ]
- =
- trial
- .
- suggest_int
- (
- spec
- .
- name
- ,
- 5
- ,
- 100
- ,
- step
- =
- 5
- )
- return
- cutoffs
- Why step=5
-
- Reduces the search space by 20x (20 values per metric vs 100) while maintaining sufficient granularity. For 12 metrics, this is 20^12 = 4 x 10^15 vs 100^12 = 10^24.
- Why lower bound = 5
- cutoff=0 always produces empty intersection. Values below 5 are too stringent to be useful in practice. Data Pre-Loading (Critical Performance Pattern)
Load metric data ONCE, share across all trials
metric_data
load_metric_data ( results_dir , metric_registry ) def objective ( trial ) : cutoffs = suggest_cutoffs ( trial )
Pass pre-loaded data - avoids disk I/O per trial
result
- run_ranking_with_cutoffs
- (
- cutoffs
- ,
- metric_data
- =
- metric_data
- )
- return
- obj_fn
- (
- result
- ,
- cutoffs
- )
- Why
- Each trial evaluates in ~6ms when data is pre-loaded (pure NumPy/set operations). Without pre-loading, each trial incurs ~50ms of disk I/O. At 10,000 trials, this is 60 seconds vs 500 seconds. Objective Function Patterns Pattern 1 - Ratio Optimization def obj_max_survivors_min_cutoff ( result , cutoffs ) : n = result [ "n_intersection" ] if n == 0 : return 0.0 mean_cutoff = sum ( cutoffs . values ( ) ) / len ( cutoffs ) return n / mean_cutoff
More survivors per unit of looseness
- Use when
- Exploring the efficiency frontier - how much quality can you get for how much filtering? Pattern 2 - Constrained Quality def obj_quality_at_target_n ( result , cutoffs , target_n = 10 ) : n = result [ "n_intersection" ] avg_pct = result [ "avg_percentile" ] if n < target_n : return avg_pct * ( n / target_n )
Partial credit
return avg_pct
Full credit: maximize quality
- Use when
- You have a target portfolio size and want the highest quality subset. Pattern 3 - Minimum Budget def obj_tightest_nonempty ( result , cutoffs ) : n = result [ "n_intersection" ] if n == 0 : return 0.0 total_budget = sum ( cutoffs . values ( ) ) return max_possible_budget - total_budget
Lower budget = better
- Use when
-
- Finding the single most universally excellent config.
- Pattern 4 - Diversity Reward
- def
- obj_diversity_reward
- (
- result
- ,
- cutoffs
- )
- :
- n
- =
- result
- [
- "n_intersection"
- ]
- if
- n
- ==
- 0
- :
- return
- 0.0
- n_binding
- =
- result
- [
- "n_binding_metrics"
- ]
- n_active
- =
- sum
- (
- 1
- for
- v
- in
- cutoffs
- .
- values
- (
- )
- if
- v
- <
- 100
- )
- if
- n_active
- ==
- 0
- :
- return
- 0.0
- efficiency
- =
- n_binding
- /
- n_active
- return
- n
- *
- efficiency
- Use when
- Ensuring that tightened cutoffs provide independent information, not redundant filtering. Pattern 5 - Pareto (Multi-Objective) study = optuna . create_study ( directions = [ "maximize" , "minimize" ] ,
max survivors, min cutoff
sampler
- optuna
- .
- samplers
- .
- NSGAIISampler
- (
- seed
- =
- 42
- )
- ,
- )
- def
- objective
- (
- trial
- )
- :
- cutoffs
- =
- suggest_cutoffs
- (
- trial
- )
- result
- =
- run_ranking_with_cutoffs
- (
- cutoffs
- ,
- metric_data
- =
- metric_data
- )
- return
- result
- [
- "n_intersection"
- ]
- ,
- sum
- (
- cutoffs
- .
- values
- (
- )
- )
- /
- len
- (
- cutoffs
- )
- Use when
- You want to see the full trade-off landscape between two competing objectives. Forensic Analysis Protocol After running all objectives, perform this analysis: Step 1 - Cross-Objective Survivor Sets For each objective: survivors_{objective} = set of configs in final intersection universal_champions = survivors_1 AND survivors_2 AND ... AND survivors_K If a config survives all K objective functions, it is robust to objective choice. Step 2 - Feature Theme Extraction Count feature appearances across all survivors: feature_counts = Counter() for config_id in quality_survivors: for feature in config_id.split("__"): feature_counts[feature.split("_")[0]] += 1 Dominant features reveal the underlying market microstructure that the ranking system is selecting for. Step 3 - Uniform Tightening Curve Apply the same cutoff to ALL metrics and plot intersection size: @100%: 1008 survivors (no filter) @80%: 502 survivors @60%: 210 survivors @40%: 68 survivors @20%: 12 survivors @10%: 3 survivors @5%: 0 survivors The shape of this curve reveals whether the metric space has natural clusters or is uniformly distributed. Step 4 - Binding Sequence Tighten uniformly and at each step identify which metric was the "tightest killer" - the metric that eliminated the most configs: @90%: 410 survivors | tightest killer: rachev (-57) @80%: 132 survivors | tightest killer: headroom (-27) @70%: 29 survivors | tightest killer: n_trades (-12) @60%: 6 survivors | tightest killer: dsr (-6) This reveals the binding constraint hierarchy. Implementation Checklist When implementing this methodology in a new domain: Define MetricSpec registry (name, direction, source, default cutoff) Implement percentile ranking (scipy.stats.rankdata) Implement per-metric cutoff application Implement set intersection across all metrics Add env var override for each cutoff Create run_ranking_with_cutoffs() API function Add binding metric detection Create tightening analysis function Write markdown report generator Add Optuna optimizer with at least 3 objective functions Pre-load metric data for optimizer performance Run 5-objective forensic analysis (10K+ trials per objective) Extract universal champions (cross-objective consistency) Identify inert dimensions (remove from search space) Document binding constraint sequence Record feature themes in survivors Anti-Patterns Anti-Pattern Symptom Fix Severity Weighted-sum scoring Single metric dominates, others ignored Use intersection (P3) CRITICAL Starting strict Miss global optimum, premature convergence Start at 100%, tighten (P4) HIGH Uniform cutoffs only Over-filters inert metrics, under-filters binding ones Per-metric independent cutoffs (P2) HIGH Single objective Artifact of objective bias Run 5+ objectives, check consistency (P5) HIGH Raw value comparison Scale-dependent, misleading Always use percentile ranks (P1) HIGH Including inert metrics Wastes optimization budget Detect and remove inert dimensions (P7) MEDIUM No data pre-loading Optimizer 10x slower Pre-load once, share across trials MEDIUM Unseeded optimizer Non-reproducible results Always seed sampler (seed=42) MEDIUM Missing forensic analysis Raw numbers without insight Run full forensic protocol (P8) MEDIUM References Topic Reference File Range Bar Case Study case-study-rangebar-ranking.md Objective Functions objective-functions.md Metric Design Guide metric-design-guide.md