Data Cleaning and Variable Screening Quick Start

Run the complete data cleaning pipeline

python

".github/skills/datanalysis-credit-risk/scripts/example.py"

Complete Process Description

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

Get Data

- Load and format raw data

Organization Sample Analysis

- Statistics of sample count and bad sample rate for each organization

Separate OOS Data

- Separate out-of-sample (OOS) samples from modeling samples

Filter Abnormal Months

- Remove months with insufficient bad sample count or total sample count

Calculate Missing Rate

- Calculate overall and organization-level missing rates for each feature

Drop High Missing Rate Features

- Remove features with overall missing rate exceeding threshold

Drop Low IV Features

- Remove features with overall IV too low or IV too low in too many organizations

Drop High PSI Features

- Remove features with unstable PSI

Null Importance Denoising

- Remove noise features using label permutation method

Drop High Correlation Features

- Remove high correlation features based on original gain

Export Report

- Generate Excel report containing details and statistics of all steps

Core Functions

Function

Purpose

Module

get_dataset()

Load and format data

references.func

org_analysis()

Organization sample analysis

references.func

missing_check()

Calculate missing rate

references.func

drop_abnormal_ym()

Filter abnormal months

references.analysis

drop_highmiss_features()

Drop high missing rate features

references.analysis

drop_lowiv_features()

Drop low IV features

references.analysis

drop_highpsi_features()

Drop high PSI features

references.analysis

drop_highnoise_features()

Null Importance denoising

references.analysis

drop_highcorr_features()

Drop high correlation features

references.analysis

iv_distribution_by_org()

IV distribution statistics

references.analysis

psi_distribution_by_org()

PSI distribution statistics

references.analysis

value_ratio_distribution_by_org()

Value ratio distribution statistics

references.analysis

export_cleaning_report()

Export cleaning report

references.analysis

Parameter Description

Data Loading Parameters

DATA_PATH

Data file path (best are parquet format)

DATE_COL

Date column name

Y_COL

Label column name

ORG_COL

Organization column name

KEY_COLS

Primary key column name list

OOS Organization Configuration

OOS_ORGS

Out-of-sample organization list

Abnormal Month Filtering Parameters

min_ym_bad_sample

Minimum bad sample count per month (default 10)

min_ym_sample

Minimum total sample count per month (default 500)

Missing Rate Parameters

missing_ratio

Overall missing rate threshold (default 0.6)

IV Parameters

overall_iv_threshold

Overall IV threshold (default 0.1)

org_iv_threshold

Single organization IV threshold (default 0.1)

max_org_threshold

Maximum tolerated low IV organization count (default 2)

PSI Parameters

psi_threshold

PSI threshold (default 0.1)

max_months_ratio

Maximum unstable month ratio (default 1/3)

max_orgs

Maximum unstable organization count (default 6)

Null Importance Parameters

n_estimators

Number of trees (default 100)

max_depth

Maximum tree depth (default 5)

gain_threshold

Gain difference threshold (default 50)

High Correlation Parameters

max_corr

Correlation threshold (default 0.9)

top_n_keep

Keep top N features by original gain ranking (default 20)

Output Report

The generated Excel report contains the following sheets:

汇总

- Summary information of all steps, including operation results and conditions

机构样本统计

- Sample count and bad sample rate for each organization

分离OOS数据

- OOS sample and modeling sample counts

Step4-异常月份处理

- Abnormal months that were removed

缺失率明细

- Overall and organization-level missing rates for each feature

Step5-有值率分布统计

- Distribution of features in different value ratio ranges

Step6-高缺失率处理

- High missing rate features that were removed

Step7-IV明细

- IV values of each feature in each organization and overall

Step7-IV处理

- Features that do not meet IV conditions and low IV organizations

Step7-IV分布统计

- Distribution of features in different IV ranges

Step8-PSI明细

- PSI values of each feature in each organization each month

Step8-PSI处理

- Features that do not meet PSI conditions and unstable organizations

Step8-PSI分布统计

- Distribution of features in different PSI ranges

Step9-null importance处理

- Noise features that were removed

Step10-高相关性剔除

- High correlation features that were removed

Features

Interactive Input

Parameters can be input before each step execution, with default values supported

Independent Execution

Each step is executed independently without deleting original data, facilitating comparative analysis

Complete Report

Generate complete Excel report containing details, statistics, and distributions

Multi-process Support

IV and PSI calculations support multi-process acceleration
Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction

安装

Run the complete data cleaning pipeline