- Feature Engineering
- Overview
- Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.
- When to Use
- When you need to improve model performance beyond using raw features
- When dealing with categorical variables that need encoding for ML algorithms
- When features have different scales and require normalization
- When creating domain-specific features based on business knowledge
- When handling skewed distributions or non-linear relationships
- When preparing data for different types of ML algorithms with specific requirements
- Engineering Techniques
- Encoding
-
- Converting categorical to numerical
- Scaling
-
- Normalizing feature ranges
- Polynomial Features
-
- Higher-order terms
- Interactions
-
- Combining features
- Domain-specific
-
- Business-relevant transformations
- Temporal
- Time-based features Key Principles Create features based on domain knowledge Remove redundant features Scale features appropriately Handle categorical variables Create meaningful interactions Implementation with Python import pandas as pd import numpy as np import matplotlib . pyplot as plt from sklearn . preprocessing import ( StandardScaler , MinMaxScaler , RobustScaler , PolynomialFeatures , OneHotEncoder , OrdinalEncoder , LabelEncoder ) from sklearn . pipeline import Pipeline from sklearn . compose import ColumnTransformer import seaborn as sns
Create sample dataset
np . random . seed ( 42 ) df = pd . DataFrame ( { 'age' : np . random . uniform ( 18 , 80 , 1000 ) , 'income' : np . random . uniform ( 20000 , 150000 , 1000 ) , 'experience_years' : np . random . uniform ( 0 , 50 , 1000 ) , 'category' : np . random . choice ( [ 'A' , 'B' , 'C' ] , 1000 ) , 'city' : np . random . choice ( [ 'NYC' , 'LA' , 'Chicago' ] , 1000 ) , 'purchased' : np . random . choice ( [ 0 , 1 ] , 1000 ) , } ) print ( "Original Data:" ) print ( df . head ( ) ) print ( df . info ( ) )
1. Categorical Encoding
One-Hot Encoding
print ( "\n1. One-Hot Encoding:" ) df_ohe = pd . get_dummies ( df , columns = [ 'category' , 'city' ] , drop_first = True ) print ( df_ohe . head ( ) )
Ordinal Encoding
print ( "\n2. Ordinal Encoding:" ) ordinal_encoder = OrdinalEncoder ( ) df [ 'category_ordinal' ] = ordinal_encoder . fit_transform ( df [ [ 'category' ] ] ) print ( df [ [ 'category' , 'category_ordinal' ] ] . head ( ) )
Label Encoding
print ( "\n3. Label Encoding:" ) le = LabelEncoder ( ) df [ 'city_encoded' ] = le . fit_transform ( df [ 'city' ] ) print ( df [ [ 'city' , 'city_encoded' ] ] . head ( ) )
2. Feature Scaling
print ( "\n4. Feature Scaling:" ) X = df [ [ 'age' , 'income' , 'experience_years' ] ] . copy ( )
StandardScaler (mean=0, std=1)
scaler
StandardScaler ( ) X_standard = scaler . fit_transform ( X )
MinMaxScaler [0, 1]
minmax_scaler
MinMaxScaler ( ) X_minmax = minmax_scaler . fit_transform ( X )
RobustScaler (resistant to outliers)
robust_scaler
RobustScaler ( ) X_robust = robust_scaler . fit_transform ( X )
Visualization
fig , axes = plt . subplots ( 2 , 2 , figsize = ( 12 , 8 ) ) axes [ 0 , 0 ] . hist ( X [ 'age' ] , bins = 30 , edgecolor = 'black' ) axes [ 0 , 0 ] . set_title ( 'Original Age' ) axes [ 0 , 1 ] . hist ( X_standard [ : , 0 ] , bins = 30 , edgecolor = 'black' ) axes [ 0 , 1 ] . set_title ( 'StandardScaler Age' ) axes [ 1 , 0 ] . hist ( X_minmax [ : , 0 ] , bins = 30 , edgecolor = 'black' ) axes [ 1 , 0 ] . set_title ( 'MinMaxScaler Age' ) axes [ 1 , 1 ] . hist ( X_robust [ : , 0 ] , bins = 30 , edgecolor = 'black' ) axes [ 1 , 1 ] . set_title ( 'RobustScaler Age' ) plt . tight_layout ( ) plt . show ( )
3. Polynomial Features
print ( "\n5. Polynomial Features:" ) X_simple = df [ [ 'age' ] ] . copy ( ) poly = PolynomialFeatures ( degree = 2 , include_bias = False ) X_poly = poly . fit_transform ( X_simple ) X_poly_df = pd . DataFrame ( X_poly , columns = [ 'age' , 'age^2' ] ) print ( X_poly_df . head ( ) )
Visualization
plt . figure ( figsize = ( 12 , 5 ) ) plt . scatter ( df [ 'age' ] , df [ 'income' ] , alpha = 0.5 ) plt . xlabel ( 'Age' ) plt . ylabel ( 'Income' ) plt . title ( 'Age vs Income' ) plt . grid ( True , alpha = 0.3 ) plt . show ( )
4. Feature Interactions
print ( "\n6. Feature Interactions:" ) df [ 'age_income_interaction' ] = df [ 'age' ] * df [ 'income' ] / 10000 df [ 'age_experience_ratio' ] = df [ 'age' ] / ( df [ 'experience_years' ] + 1 ) print ( df [ [ 'age' , 'income' , 'age_income_interaction' , 'age_experience_ratio' ] ] . head ( ) )
5. Domain-specific Transformations
print ( "\n7. Domain-specific Features:" ) df [ 'age_group' ] = pd . cut ( df [ 'age' ] , bins = [ 0 , 30 , 45 , 60 , 100 ] , labels = [ 'Young' , 'Middle' , 'Senior' , 'Retired' ] ) df [ 'income_level' ] = pd . qcut ( df [ 'income' ] , q = 3 , labels = [ 'Low' , 'Medium' , 'High' ] ) df [ 'log_income' ] = np . log1p ( df [ 'income' ] ) df [ 'sqrt_experience' ] = np . sqrt ( df [ 'experience_years' ] ) print ( df [ [ 'age' , 'age_group' , 'income' , 'income_level' , 'log_income' ] ] . head ( ) )
6. Temporal Features (if date data available)
print ( "\n8. Temporal Features:" ) dates = pd . date_range ( '2023-01-01' , periods = len ( df ) ) df [ 'date' ] = dates df [ 'year' ] = df [ 'date' ] . dt . year df [ 'month' ] = df [ 'date' ] . dt . month df [ 'day_of_week' ] = df [ 'date' ] . dt . dayofweek df [ 'quarter' ] = df [ 'date' ] . dt . quarter df [ 'is_weekend' ] = df [ 'date' ] . dt . dayofweek
= 5 print ( df [ [ 'date' , 'year' , 'month' , 'day_of_week' , 'is_weekend' ] ] . head ( ) )
7. Feature Standardization Pipeline
print ( "\n9. Feature Engineering Pipeline:" )
Separate numerical and categorical features
numerical_features
[ 'age' , 'income' , 'experience_years' ] categorical_features = [ 'category' , 'city' ]
Create preprocessing pipeline
preprocessor
ColumnTransformer ( transformers = [ ( 'num' , StandardScaler ( ) , numerical_features ) , ( 'cat' , OneHotEncoder ( drop = 'first' ) , categorical_features ) , ] ) X_processed = preprocessor . fit_transform ( df [ numerical_features + categorical_features ] ) print ( f"Processed shape: { X_processed . shape } " )
8. Feature Statistics
print ( "\n10. Feature Statistics:" ) X_for_stats = df [ numerical_features ] . copy ( ) X_for_stats [ 'category_A' ] = ( df [ 'category' ] == 'A' ) . astype ( int ) X_for_stats [ 'city_NYC' ] = ( df [ 'city' ] == 'NYC' ) . astype ( int ) feature_stats = pd . DataFrame ( { 'Feature' : X_for_stats . columns , 'Mean' : X_for_stats . mean ( ) , 'Std' : X_for_stats . std ( ) , 'Min' : X_for_stats . min ( ) , 'Max' : X_for_stats . max ( ) , 'Skewness' : X_for_stats . skew ( ) , 'Kurtosis' : X_for_stats . kurtosis ( ) , } ) print ( feature_stats )
9. Feature Correlations
fig , axes = plt . subplots ( 1 , 2 , figsize = ( 14 , 5 ) ) X_numeric = df [ numerical_features ] . copy ( ) X_numeric [ 'purchased' ] = df [ 'purchased' ] corr_matrix = X_numeric . corr ( ) sns . heatmap ( corr_matrix , annot = True , cmap = 'coolwarm' , center = 0 , ax = axes [ 0 ] ) axes [ 0 ] . set_title ( 'Feature Correlation Matrix' )
Distribution of engineered features
axes [ 1 ] . hist ( df [ 'age_income_interaction' ] , bins = 30 , edgecolor = 'black' , alpha = 0.7 ) axes [ 1 ] . set_title ( 'Age-Income Interaction Distribution' ) axes [ 1 ] . set_xlabel ( 'Value' ) axes [ 1 ] . set_ylabel ( 'Frequency' ) plt . tight_layout ( ) plt . show ( )
10. Feature Binning / Discretization
print ( "\n11. Feature Binning:" ) df [ 'age_bin_equal' ] = pd . cut ( df [ 'age' ] , bins = 5 ) df [ 'age_bin_quantile' ] = pd . qcut ( df [ 'age' ] , q = 5 ) df [ 'income_bins' ] = pd . cut ( df [ 'income' ] , bins = [ 0 , 50000 , 100000 , 150000 ] ) print ( "Equal Width Binning:" ) print ( df [ 'age_bin_equal' ] . value_counts ( ) . sort_index ( ) ) print ( "\nEqual Frequency Binning:" ) print ( df [ 'age_bin_quantile' ] . value_counts ( ) . sort_index ( ) )
11. Missing Value Creation and Handling
print ( "\n12. Missing Value Imputation:" ) df_with_missing = df . copy ( ) missing_indices = np . random . choice ( len ( df ) , 50 , replace = False ) df_with_missing . loc [ missing_indices , 'age' ] = np . nan
Mean imputation
age_mean
df_with_missing [ 'age' ] . mean ( ) df_with_missing [ 'age_imputed_mean' ] = df_with_missing [ 'age' ] . fillna ( age_mean )
Median imputation
age_median
df_with_missing [ 'age' ] . median ( ) df_with_missing [ 'age_imputed_median' ] = df_with_missing [ 'age' ] . fillna ( age_median )
Forward fill
- df_with_missing
- [
- 'age_imputed_ffill'
- ]
- =
- df_with_missing
- [
- 'age'
- ]
- .
- fillna
- (
- method
- =
- 'ffill'
- )
- (
- df_with_missing
- [
- [
- 'age'
- ,
- 'age_imputed_mean'
- ,
- 'age_imputed_median'
- ]
- ]
- .
- head
- (
- 10
- )
- )
- (
- "\nFeature Engineering Complete!"
- )
- (
- f"Original features:
- {
- len
- (
- df
- .
- columns
- )
- -
- 5
- }
- "
- )
- (
- f"Final features available:
- {
- len
- (
- df
- .
- columns
- )
- }
- "
- )
- Best Practices
- Understand your domain before engineering features
- Create features that are interpretable
- Avoid data leakage (using future information)
- Test feature importance after engineering
- Document all transformations
- Use appropriate scaling for different algorithms
- Common Transformations
- Log Transform
-
- For skewed distributions
- Polynomial Features
-
- For non-linear relationships
- Interaction Terms
-
- For combined effects
- Binning
-
- For categorical approximation
- Normalization
- For comparison across scales Deliverables Engineered feature dataset Feature transformation documentation Correlation analysis of new features Distribution comparisons (before/after) Feature importance rankings Preprocessing pipeline code Data dictionary with feature descriptions