ML Model Training Train machine learning models with proper data handling and evaluation. Training Workflow Data Preparation → 2. Feature Engineering → 3. Model Selection → 4. Training → 5. Evaluation Data Preparation import pandas as pd from sklearn . model_selection import train_test_split from sklearn . preprocessing import StandardScaler , LabelEncoder

Load and clean data

df

pd . read_csv ( 'data.csv' ) df = df . dropna ( )

Encode categorical variables

le

LabelEncoder ( ) df [ 'category' ] = le . fit_transform ( df [ 'category' ] )

Split data (70/15/15)

X

df . drop ( 'target' , axis = 1 ) y = df [ 'target' ] X_train , X_temp , y_train , y_temp = train_test_split ( X , y , test_size = 0.3 ) X_val , X_test , y_val , y_test = train_test_split ( X_temp , y_temp , test_size = 0.5 )

Scale features

scaler

StandardScaler

(

)

X_train

=

scaler

.

fit_transform

(

X_train

)

X_val

=

scaler

.

transform

(

X_val

)

X_test

=

scaler

.

transform

(

X_test

)

Scikit-learn Training

from

sklearn

.

ensemble

import

RandomForestClassifier

from

sklearn

.

metrics

import

classification_report

,

accuracy_score

model

=

RandomForestClassifier

(

n_estimators

=

100

,

random_state

=

42

)

model

.

fit

(

X_train

,

y_train

)

y_pred

=

model

.

predict

(

X_val

)

print

(

classification_report

(

y_val

,

y_pred

)

PyTorch Training

import

torch

import

torch

.

nn

as

nn

class

Model

(

nn

.

Module

)

:

def

init

(

self

,

input_dim

)

:

super

(

)

.

init

(

)

self

.

layers

=

nn

.

Sequential

(

nn

.

Linear

(

input_dim

,

64

)

,

nn

.

ReLU

(

)

,

nn

.

Dropout

(

0.3

)

,

nn

.

Linear

(

64

,

32

)

,

nn

.

ReLU

(

)

,

nn

.

Linear

(

32

,

1

)

,

nn

.

Sigmoid

(

)

def

forward

(

self

,

x

)

:

return

self

.

layers

(

x

)

model

=

Model

(

X_train

.

shape

[

1

]

)

optimizer

=

torch

.

optim

.

Adam

(

model

.

parameters

(

)

,

lr

=

0.001

)

criterion

=

nn

.

BCELoss

(

)

for

epoch

in

range

(

100

)

:

model

.

train

(

)

optimizer

.

zero_grad

(

)

output

=

model

(

X_train_tensor

)

loss

=

criterion

(

output

,

y_train_tensor

)

loss

.

backward

(

)

optimizer

.

step

(

)

Evaluation Metrics

Task

Metrics

Classification

Accuracy, Precision, Recall, F1, AUC-ROC

Regression

MSE, RMSE, MAE, R²

Complete Framework Examples

PyTorch

See

references/pytorch-training.md

for complete training with:

Custom model classes with BatchNorm and Dropout

Training/validation loops with early stopping

Learning rate scheduling

Model checkpointing

Full evaluation with classification report

TensorFlow/Keras

See

references/tensorflow-keras.md

for:

Sequential model architecture

Callbacks (EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard)

Training history visualization

TFLite conversion for mobile deployment

Custom training loops

Best Practices

Do:

Use cross-validation for robust evaluation

Track experiments with MLflow

Save model checkpoints regularly

Monitor for overfitting

Document hyperparameters

Use 70/15/15 train/val/test split

Don't:

Train without a validation set

Ignore class imbalance

Skip feature scaling

Use test set for hyperparameter tuning

Forget to set random seeds

Known Issues Prevention

1. Data Leakage

Problem

Scaling or transforming data before splitting leads to test set information leaking into training.
Solution: Always split data first, then fit transformers only on training data:

✅ Correct: Fit on train, transform train/val/test

scaler

StandardScaler ( ) X_train = scaler . fit_transform ( X_train ) X_val = scaler . transform ( X_val )

Only transform

X_test

scaler . transform ( X_test )

Only transform

❌ Wrong: Fitting on all data

X_all

scaler . fit_transform ( X )

Leaks test info!

Class Imbalance Ignored

Problem

Training on imbalanced datasets (e.g., 95% class A, 5% class B) leads to models that predict only the majority class.

Solution

Use class weights or resampling: from sklearn . utils . class_weight import compute_class_weight

Compute class weights

class_weights

compute_class_weight ( 'balanced' , classes = np . unique ( y_train ) , y = y_train ) model = RandomForestClassifier ( class_weight = 'balanced' )

Or use SMOTE for oversampling minority class

from

imblearn

.

over_sampling

import

SMOTE

smote

=

SMOTE

(

)

X_resampled

,

y_resampled

=

smote

.

fit_resample

(

X_train

,

y_train

)

3. Overfitting Due to No Regularization

Problem

Complex models memorize training data, perform poorly on validation/test sets.
Solution: Add regularization techniques:

Dropout in PyTorch

nn . Dropout ( 0.3 )

L2 regularization in scikit-learn

RandomForestClassifier ( max_depth = 10 , min_samples_split = 20 )

Early stopping in Keras

from

tensorflow

.

keras

.

callbacks

import

EarlyStopping

early_stop

=

EarlyStopping

(

monitor

=

'val_loss'

,

patience

=

10

,

restore_best_weights

=

True

)

model

.

fit

(

X_train

,

y_train

,

validation_data

=

(

X_val

,

y_val

)

,

callbacks

=

[

early_stop

]

)

4. Not Setting Random Seeds

Problem

Results are not reproducible across runs, making debugging and comparison impossible.

Solution

Set all random seeds:

import

random

import

numpy

as

np

import

torch

random

.

seed

(

42

)

np

.

random

.

seed

(

42

)

torch

.

manual_seed

(

42

)

if

torch

.

cuda

.

is_available

(

)

:

torch

.

cuda

.

manual_seed_all

(

42

)

5. Using Test Set for Hyperparameter Tuning

Problem

Optimizing hyperparameters on test set leads to overfitting to test data.
Solution: Use validation set for tuning, test set only for final evaluation: from sklearn . model_selection import GridSearchCV

✅ Correct: Tune on train+val, evaluate on test

param_grid

{ 'n_estimators' : [ 50 , 100 , 200 ] , 'max_depth' : [ 5 , 10 , 15 ] } grid_search = GridSearchCV ( RandomForestClassifier ( ) , param_grid , cv = 5 ) grid_search . fit ( X_train , y_train )

Cross-validation on training set

best_model

grid_search . best_estimator_

Final evaluation on held-out test set

final_score

best_model

.

score

(

X_test

,

y_test

)

When to Load References

Load reference files when you need:

PyTorch implementation details

Load
references/pytorch-training.md
for complete training loops with early stopping, learning rate scheduling, and checkpointing
TensorFlow/Keras patterns: Load references/tensorflow-keras.md for callback usage, custom training loops, and mobile deployment with TFLite

ml-model-training

安装

Load and clean data

df

Encode categorical variables

le

Split data (70/15/15)

X

Scale features

scaler

✅ Correct: Fit on train, transform train/val/test

scaler

Only transform

X_test

Only transform

❌ Wrong: Fitting on all data

X_all

Leaks test info!

Compute class weights

class_weights

Or use SMOTE for oversampling minority class

Dropout in PyTorch

L2 regularization in scikit-learn

Early stopping in Keras

✅ Correct: Tune on train+val, evaluate on test

param_grid

Cross-validation on training set

best_model

Final evaluation on held-out test set

final_score