ML Fail-Fast Validation
POC validation patterns to catch issues before committing to long-running ML experiments.
When to Use This Skill
Use this skill when:
Starting a new ML experiment that will run for hours
Validating model architecture before full training
Checking gradient flow and data pipeline integrity
Implementing POC validation checklists
Debugging prediction collapse or gradient explosion issues
1. Why Fail-Fast?
Without Fail-Fast
With Fail-Fast
Discover crash 4 hours in
Catch in 30 seconds
Debug from cryptic error
Clear error message
Lose GPU time
Validate before commit
Silent data issues
Explicit schema checks
Principle: Validate everything that can go wrong BEFORE the expensive computation. 2. POC Validation Checklist Minimum Viable POC (5 Checks) def run_poc_validation ( ) : """Fast validation before full experiment.""" print ( "=" * 60 ) print ( "FAIL-FAST POC VALIDATION" ) print ( "=" * 60 )

[1/5] Model instantiation

print ( "\n[1/5] Model instantiation..." ) model = create_model ( architecture , input_size = n_features ) x = torch . randn ( 32 , seq_len , n_features ) . to ( device ) out = model ( x ) assert out . shape == ( 32 , 1 ) , f"Output shape wrong: { out . shape } " print ( f" Input: (32, { seq_len } , { n_features } ) -> Output: { out . shape } " ) print ( " Status: PASS" )

[2/5] Gradient flow

print ( "\n[2/5] Gradient flow..." ) y = torch . randn ( 32 , 1 ) . to ( device ) loss = F . mse_loss ( out , y ) loss . backward ( ) grad_norms = [ p . grad . norm ( ) . item ( ) for p in model . parameters ( ) if p . grad is not None ] assert len ( grad_norms )

0 , "No gradients!" assert all ( np . isfinite ( g ) for g in grad_norms ) , "NaN/Inf gradients!" print ( f" Max grad norm: { max ( grad_norms ) : .4f } " ) print ( " Status: PASS" )

[3/5] NDJSON artifact validation

print ( "\n[3/5] NDJSON artifact validation..." ) log_path = output_dir / "experiment.jsonl" with open ( log_path , "a" ) as f : f . write ( json . dumps ( { "phase" : "poc_start" , "timestamp" : datetime . now ( ) . isoformat ( ) } ) + "\n" ) assert log_path . exists ( ) , "Log file not created" print ( f" Log file: { log_path } " ) print ( " Status: PASS" )

[4/5] Epoch selector variation

print ( "\n[4/5] Epoch selector variation..." ) epochs = [ ] for seed in [ 1 , 2 , 3 ] : selector = create_selector ( )

Simulate different validation results

for e in range ( 10 , 201 , 10 ) : selector . record ( epoch = e , sortino = np . random . randn ( ) * 0.1 , sparsity = np . random . rand ( ) ) epochs . append ( selector . select ( ) ) print ( f" Selected epochs: { epochs } " ) assert len ( set ( epochs ) )

1 or all ( e == epochs [ 0 ] for e in epochs ) , "Selector not varying" print ( " Status: PASS" )

[5/5] Mini training (10 epochs)

print ( "\n[5/5] Mini training (10 epochs)..." ) model = create_model ( architecture , input_size = n_features ) . to ( device ) optimizer = torch . optim . AdamW ( model . parameters ( ) , lr = 0.0005 ) initial_loss = None for epoch in range ( 10 ) : loss = train_one_epoch ( model , train_loader , optimizer ) if initial_loss is None : initial_loss = loss print ( f" Initial loss: { initial_loss : .4f } " ) print ( f" Final loss: { loss : .4f } " ) print ( " Status: PASS" ) print ( "\n" + "=" * 60 ) print ( "POC RESULT: ALL 5 CHECKS PASSED" ) print ( "=" * 60 ) Extended POC (10 Checks) Add these for comprehensive validation:

[6/10] Data loading

print ( "\n[6/10] Data loading..." ) df = fetch_data ( symbol , threshold ) assert len ( df )

min_required_bars , f"Insufficient data: { len ( df ) } bars" print ( f" Loaded: { len ( df ) : , } bars" ) print ( " Status: PASS" )

[7/10] Schema validation

print ( "\n[7/10] Schema validation..." ) validate_schema ( df , required_columns , "raw_data" ) print ( " Status: PASS" )

[8/10] Feature computation

print ( "\n[8/10] Feature computation..." ) df = compute_features ( df ) validate_schema ( df , feature_columns , "features" ) print ( f" Features: { len ( feature_columns ) } " ) print ( " Status: PASS" )

[9/10] Prediction sanity

print ( "\n[9/10] Prediction sanity..." ) preds = model ( X_test ) . detach ( ) . cpu ( ) . numpy ( ) pred_std = preds . std ( ) target_std = y_test . std ( ) pred_ratio = pred_std / target_std assert pred_ratio

0.005 , f"Predictions collapsed: ratio= { pred_ratio : .4f } " print ( f" Pred std ratio: { pred_ratio : .2% } " ) print ( " Status: PASS" )

[10/10] Checkpoint save/load

print ( "\n[10/10] Checkpoint save/load..." ) torch . save ( model . state_dict ( ) , checkpoint_path ) model2 = create_model ( architecture , input_size = n_features ) model2 . load_state_dict ( torch . load ( checkpoint_path ) ) print ( " Status: PASS" ) 3. Schema Validation Pattern The Problem

BAD: Cryptic error 2 hours into experiment

KeyError : 'returns_vs'

Which file? Which function? What columns exist?

The Solution def validate_schema ( df , required : list [ str ] , stage : str ) -

None : """Fail-fast schema validation with actionable error messages."""

Handle both DataFrame columns and DatetimeIndex

available

list ( df . columns ) if hasattr ( df . index , 'name' ) and df . index . name : available . append ( df . index . name ) missing = [ c for c in required if c not in available ] if missing : raise ValueError ( f"[ { stage } ] Missing columns: { missing } \n" f"Available: { sorted ( available ) } \n" f"DataFrame shape: { df . shape } " ) print ( f" Schema validation PASSED ( { stage } ): { len ( required ) } columns" , flush = True )

Usage at pipeline boundaries

REQUIRED_RAW

[ "open" , "high" , "low" , "close" , "volume" ] REQUIRED_FEATURES = [ "returns_vs" , "momentum_z" , "atr_pct" , "volume_z" , "rsi_14" , "bb_pct_b" , "vol_regime" , "return_accel" , "pv_divergence" ] df = fetch_data ( symbol ) validate_schema ( df , REQUIRED_RAW , "raw_data" ) df = compute_features ( df ) validate_schema ( df , REQUIRED_FEATURES , "features" ) 4. Gradient Health Checks Basic Gradient Check def check_gradient_health ( model : nn . Module , sample_input : torch . Tensor ) -

dict : """Verify gradients flow correctly through model.""" model . train ( ) out = model ( sample_input ) loss = out . sum ( ) loss . backward ( ) stats = { "total_params" : 0 , "params_with_grad" : 0 , "grad_norms" : [ ] } for name , param in model . named_parameters ( ) : stats [ "total_params" ] += 1 if param . grad is not None : stats [ "params_with_grad" ] += 1 norm = param . grad . norm ( ) . item ( ) stats [ "grad_norms" ] . append ( norm )

Check for issues

if not np . isfinite ( norm ) : raise ValueError ( f"Non-finite gradient in { name } : { norm } " ) if norm

100 : print ( f" WARNING: Large gradient in { name } : { norm : .2f } " ) stats [ "max_grad" ] = max ( stats [ "grad_norms" ] ) if stats [ "grad_norms" ] else 0 stats [ "mean_grad" ] = np . mean ( stats [ "grad_norms" ] ) if stats [ "grad_norms" ] else 0 return stats Architecture-Specific Checks def check_lstm_gradients ( model : nn . Module ) -

dict : """Check LSTM-specific gradient patterns.""" stats = { } for name , param in model . named_parameters ( ) : if param . grad is None : continue

Check forget gate bias (should not be too negative)

if "bias_hh" in name or "bias_ih" in name :

LSTM bias: [i, f, g, o] gates

hidden_size

param . shape [ 0 ] // 4 forget_bias = param . grad [ hidden_size : 2 * hidden_size ] stats [ "forget_bias_grad_mean" ] = forget_bias . mean ( ) . item ( )

Check hidden-to-hidden weights

if "weight_hh" in name : stats [ "hh_weight_grad_norm" ] = param . grad . norm ( ) . item ( ) return stats 5. Prediction Sanity Checks Collapse Detection def check_prediction_sanity ( preds : np . ndarray , targets : np . ndarray ) -

dict : """Detect prediction collapse or explosion.""" stats = { "pred_mean" : preds . mean ( ) , "pred_std" : preds . std ( ) , "pred_min" : preds . min ( ) , "pred_max" : preds . max ( ) , "target_std" : targets . std ( ) , }

Relative threshold (not absolute!)

stats [ "pred_std_ratio" ] = stats [ "pred_std" ] / stats [ "target_std" ]

Collapse detection

if stats [ "pred_std_ratio" ] < 0.005 :

< 0.5% of target variance

raise ValueError ( f"Predictions collapsed!\n" f" pred_std: { stats [ 'pred_std' ] : .6f } \n" f" target_std: { stats [ 'target_std' ] : .6f } \n" f" ratio: { stats [ 'pred_std_ratio' ] : .4% } " )

Explosion detection

if stats [ "pred_std_ratio" ]

100 :

> 100x target variance

raise ValueError ( f"Predictions exploded!\n" f" pred_std: { stats [ 'pred_std' ] : .2f } \n" f" target_std: { stats [ 'target_std' ] : .6f } \n" f" ratio: { stats [ 'pred_std_ratio' ] : .1f } x" )

Unique value check

stats [ "unique_values" ] = len ( np . unique ( np . round ( preds , 6 ) ) ) if stats [ "unique_values" ] < 10 : print ( f" WARNING: Only { stats [ 'unique_values' ] } unique prediction values" ) return stats Correlation Check def check_prediction_correlation ( preds : np . ndarray , targets : np . ndarray ) -

float : """Check if predictions have any correlation with targets.""" corr = np . corrcoef ( preds . flatten ( ) , targets . flatten ( ) ) [ 0 , 1 ] if not np . isfinite ( corr ) : print ( " WARNING: Correlation is NaN (likely collapsed predictions)" ) return 0.0

Note: negative correlation may still be useful (short signal)

print ( f" Prediction-target correlation: { corr : .4f } " ) return corr 6. NDJSON Logging Validation Required Event Types REQUIRED_EVENTS = { "experiment_start" : [ "architecture" , "features" , "config" ] , "fold_start" : [ "fold_id" , "train_size" , "val_size" , "test_size" ] , "epoch_complete" : [ "epoch" , "train_loss" , "val_loss" ] , "fold_complete" : [ "fold_id" , "test_sharpe" , "test_sortino" ] , "experiment_complete" : [ "total_folds" , "mean_sharpe" , "elapsed_seconds" ] , } def validate_ndjson_schema ( log_path : Path ) -

None : """Validate NDJSON log has all required events and fields.""" events = { } with open ( log_path ) as f : for line in f : event = json . loads ( line ) phase = event . get ( "phase" , "unknown" ) if phase not in events : events [ phase ] = [ ] events [ phase ] . append ( event ) for phase , required_fields in REQUIRED_EVENTS . items ( ) : if phase not in events : raise ValueError ( f"Missing event type: { phase } " ) sample = events [ phase ] [ 0 ] missing = [ f for f in required_fields if f not in sample ] if missing : raise ValueError ( f"Event ' { phase } ' missing fields: { missing } " ) print ( f" NDJSON schema valid: { len ( events ) } event types" ) 7. POC Timing Guide Check Typical Time Max Time Action if Exceeded Model instantiation < 1s 5s Check device, reduce model size Gradient flow < 2s 10s Check batch size Schema validation < 0.1s 1s Check data loading Mini training (10 epochs) < 30s 2min Reduce batch, check data loader Full POC (10 checks) < 2min 5min Something is wrong 8. Failure Response Guide Failure Likely Cause Fix Shape mismatch Wrong input_size or seq_len Check feature count NaN gradients LR too high, bad init Reduce LR, check init Zero gradients Dead layers, missing params Check model architecture Predictions collapsed Normalizer issue, bad loss Check sLSTM normalizer Predictions exploded Gradient explosion Add/tighten gradient clipping Schema missing columns Wrong data source Check fetch function Checkpoint load fails State dict key mismatch Check model architecture match 9. Integration Example def main ( ) :

Parse args, setup output dir...

PHASE 1: Fail-fast POC

print ( "=" * 60 ) print ( "FAIL-FAST POC VALIDATION" ) print ( "=" * 60 ) try : run_poc_validation ( ) except Exception as e : print ( f"\n { '=' * 60 } " ) print ( f"POC FAILED: { type ( e ) . name } " ) print ( f" { '=' * 60 } " ) print ( f"Error: { e } " ) print ( "\nFix the issue before running full experiment." ) sys . exit ( 1 )

PHASE 2: Full experiment (only if POC passes)

print ( "\n" + "=" * 60 ) print ( "STARTING FULL EXPERIMENT" ) print ( "=" * 60 ) run_full_experiment ( ) 10. Anti-Patterns to Avoid DON'T: Skip validation to "save time"

BAD: "I'll just run it and see"

run_full_experiment ( )

4 hours later: crash

DON'T: Use absolute thresholds for relative quantities

BAD: Absolute threshold

assert pred_std

1e-4

Meaningless for returns ~0.001

GOOD: Relative threshold

assert pred_std / target_std

0.005

0.5% of target variance

DON'T: Catch all exceptions silently

BAD: Hides real issues

try : result = risky_operation ( ) except Exception : result = default_value

What went wrong?

GOOD: Catch specific exceptions

try : result = risky_operation ( ) except ( ValueError , RuntimeError ) as e : logger . error ( f"Operation failed: { e } " ) raise DON'T: Print without flush

BAD: Output buffered, can't see progress

print ( f"Processing fold { i } ..." )

GOOD: See output immediately

print ( f"Processing fold { i } ..." , flush = True ) References Schema validation in data pipelines PyTorch gradient debugging NDJSON specification Troubleshooting Issue Cause Solution NaN gradients in POC Learning rate too high Reduce LR by 10x, check weight initialization Zero gradients Dead layers or missing params Check model architecture, verify requires_grad=True Predictions collapsed Normalizer issue or bad loss Check target normalization, verify loss function Predictions exploded Gradient explosion Add gradient clipping, reduce learning rate Schema missing columns Wrong data source or transform Verify fetch function returns expected columns Checkpoint load fails State dict key mismatch Ensure model architecture matches saved checkpoint POC timeout (>5 min) Data loading or model too large Reduce batch size, check DataLoader num_workers Mini training no progress Learning rate too low or frozen Increase LR, verify optimizer updates all parameters NDJSON validation fails Missing required event types Check all phases emit expected fields Shape mismatch error Wrong input_size or seq_len Verify feature count matches model input dimension

安装