feature engineering

安装量: 144
排名: #5962

安装

npx skills add https://github.com/aj-geddes/useful-ai-prompts --skill 'Feature Engineering'
Feature Engineering
Overview
Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.
When to Use
When you need to improve model performance beyond using raw features
When dealing with categorical variables that need encoding for ML algorithms
When features have different scales and require normalization
When creating domain-specific features based on business knowledge
When handling skewed distributions or non-linear relationships
When preparing data for different types of ML algorithms with specific requirements
Engineering Techniques
Encoding
Converting categorical to numerical
Scaling
Normalizing feature ranges
Polynomial Features
Higher-order terms
Interactions
Combining features
Domain-specific
Business-relevant transformations
Temporal
Time-based features Key Principles Create features based on domain knowledge Remove redundant features Scale features appropriately Handle categorical variables Create meaningful interactions Implementation with Python import pandas as pd import numpy as np import matplotlib . pyplot as plt from sklearn . preprocessing import ( StandardScaler , MinMaxScaler , RobustScaler , PolynomialFeatures , OneHotEncoder , OrdinalEncoder , LabelEncoder ) from sklearn . pipeline import Pipeline from sklearn . compose import ColumnTransformer import seaborn as sns

Create sample dataset

np . random . seed ( 42 ) df = pd . DataFrame ( { 'age' : np . random . uniform ( 18 , 80 , 1000 ) , 'income' : np . random . uniform ( 20000 , 150000 , 1000 ) , 'experience_years' : np . random . uniform ( 0 , 50 , 1000 ) , 'category' : np . random . choice ( [ 'A' , 'B' , 'C' ] , 1000 ) , 'city' : np . random . choice ( [ 'NYC' , 'LA' , 'Chicago' ] , 1000 ) , 'purchased' : np . random . choice ( [ 0 , 1 ] , 1000 ) , } ) print ( "Original Data:" ) print ( df . head ( ) ) print ( df . info ( ) )

1. Categorical Encoding

One-Hot Encoding

print ( "\n1. One-Hot Encoding:" ) df_ohe = pd . get_dummies ( df , columns = [ 'category' , 'city' ] , drop_first = True ) print ( df_ohe . head ( ) )

Ordinal Encoding

print ( "\n2. Ordinal Encoding:" ) ordinal_encoder = OrdinalEncoder ( ) df [ 'category_ordinal' ] = ordinal_encoder . fit_transform ( df [ [ 'category' ] ] ) print ( df [ [ 'category' , 'category_ordinal' ] ] . head ( ) )

Label Encoding

print ( "\n3. Label Encoding:" ) le = LabelEncoder ( ) df [ 'city_encoded' ] = le . fit_transform ( df [ 'city' ] ) print ( df [ [ 'city' , 'city_encoded' ] ] . head ( ) )

2. Feature Scaling

print ( "\n4. Feature Scaling:" ) X = df [ [ 'age' , 'income' , 'experience_years' ] ] . copy ( )

StandardScaler (mean=0, std=1)

scaler

StandardScaler ( ) X_standard = scaler . fit_transform ( X )

MinMaxScaler [0, 1]

minmax_scaler

MinMaxScaler ( ) X_minmax = minmax_scaler . fit_transform ( X )

RobustScaler (resistant to outliers)

robust_scaler

RobustScaler ( ) X_robust = robust_scaler . fit_transform ( X )

Visualization

fig , axes = plt . subplots ( 2 , 2 , figsize = ( 12 , 8 ) ) axes [ 0 , 0 ] . hist ( X [ 'age' ] , bins = 30 , edgecolor = 'black' ) axes [ 0 , 0 ] . set_title ( 'Original Age' ) axes [ 0 , 1 ] . hist ( X_standard [ : , 0 ] , bins = 30 , edgecolor = 'black' ) axes [ 0 , 1 ] . set_title ( 'StandardScaler Age' ) axes [ 1 , 0 ] . hist ( X_minmax [ : , 0 ] , bins = 30 , edgecolor = 'black' ) axes [ 1 , 0 ] . set_title ( 'MinMaxScaler Age' ) axes [ 1 , 1 ] . hist ( X_robust [ : , 0 ] , bins = 30 , edgecolor = 'black' ) axes [ 1 , 1 ] . set_title ( 'RobustScaler Age' ) plt . tight_layout ( ) plt . show ( )

3. Polynomial Features

print ( "\n5. Polynomial Features:" ) X_simple = df [ [ 'age' ] ] . copy ( ) poly = PolynomialFeatures ( degree = 2 , include_bias = False ) X_poly = poly . fit_transform ( X_simple ) X_poly_df = pd . DataFrame ( X_poly , columns = [ 'age' , 'age^2' ] ) print ( X_poly_df . head ( ) )

Visualization

plt . figure ( figsize = ( 12 , 5 ) ) plt . scatter ( df [ 'age' ] , df [ 'income' ] , alpha = 0.5 ) plt . xlabel ( 'Age' ) plt . ylabel ( 'Income' ) plt . title ( 'Age vs Income' ) plt . grid ( True , alpha = 0.3 ) plt . show ( )

4. Feature Interactions

print ( "\n6. Feature Interactions:" ) df [ 'age_income_interaction' ] = df [ 'age' ] * df [ 'income' ] / 10000 df [ 'age_experience_ratio' ] = df [ 'age' ] / ( df [ 'experience_years' ] + 1 ) print ( df [ [ 'age' , 'income' , 'age_income_interaction' , 'age_experience_ratio' ] ] . head ( ) )

5. Domain-specific Transformations

print ( "\n7. Domain-specific Features:" ) df [ 'age_group' ] = pd . cut ( df [ 'age' ] , bins = [ 0 , 30 , 45 , 60 , 100 ] , labels = [ 'Young' , 'Middle' , 'Senior' , 'Retired' ] ) df [ 'income_level' ] = pd . qcut ( df [ 'income' ] , q = 3 , labels = [ 'Low' , 'Medium' , 'High' ] ) df [ 'log_income' ] = np . log1p ( df [ 'income' ] ) df [ 'sqrt_experience' ] = np . sqrt ( df [ 'experience_years' ] ) print ( df [ [ 'age' , 'age_group' , 'income' , 'income_level' , 'log_income' ] ] . head ( ) )

6. Temporal Features (if date data available)

print ( "\n8. Temporal Features:" ) dates = pd . date_range ( '2023-01-01' , periods = len ( df ) ) df [ 'date' ] = dates df [ 'year' ] = df [ 'date' ] . dt . year df [ 'month' ] = df [ 'date' ] . dt . month df [ 'day_of_week' ] = df [ 'date' ] . dt . dayofweek df [ 'quarter' ] = df [ 'date' ] . dt . quarter df [ 'is_weekend' ] = df [ 'date' ] . dt . dayofweek

= 5 print ( df [ [ 'date' , 'year' , 'month' , 'day_of_week' , 'is_weekend' ] ] . head ( ) )

7. Feature Standardization Pipeline

print ( "\n9. Feature Engineering Pipeline:" )

Separate numerical and categorical features

numerical_features

[ 'age' , 'income' , 'experience_years' ] categorical_features = [ 'category' , 'city' ]

Create preprocessing pipeline

preprocessor

ColumnTransformer ( transformers = [ ( 'num' , StandardScaler ( ) , numerical_features ) , ( 'cat' , OneHotEncoder ( drop = 'first' ) , categorical_features ) , ] ) X_processed = preprocessor . fit_transform ( df [ numerical_features + categorical_features ] ) print ( f"Processed shape: { X_processed . shape } " )

8. Feature Statistics

print ( "\n10. Feature Statistics:" ) X_for_stats = df [ numerical_features ] . copy ( ) X_for_stats [ 'category_A' ] = ( df [ 'category' ] == 'A' ) . astype ( int ) X_for_stats [ 'city_NYC' ] = ( df [ 'city' ] == 'NYC' ) . astype ( int ) feature_stats = pd . DataFrame ( { 'Feature' : X_for_stats . columns , 'Mean' : X_for_stats . mean ( ) , 'Std' : X_for_stats . std ( ) , 'Min' : X_for_stats . min ( ) , 'Max' : X_for_stats . max ( ) , 'Skewness' : X_for_stats . skew ( ) , 'Kurtosis' : X_for_stats . kurtosis ( ) , } ) print ( feature_stats )

9. Feature Correlations

fig , axes = plt . subplots ( 1 , 2 , figsize = ( 14 , 5 ) ) X_numeric = df [ numerical_features ] . copy ( ) X_numeric [ 'purchased' ] = df [ 'purchased' ] corr_matrix = X_numeric . corr ( ) sns . heatmap ( corr_matrix , annot = True , cmap = 'coolwarm' , center = 0 , ax = axes [ 0 ] ) axes [ 0 ] . set_title ( 'Feature Correlation Matrix' )

Distribution of engineered features

axes [ 1 ] . hist ( df [ 'age_income_interaction' ] , bins = 30 , edgecolor = 'black' , alpha = 0.7 ) axes [ 1 ] . set_title ( 'Age-Income Interaction Distribution' ) axes [ 1 ] . set_xlabel ( 'Value' ) axes [ 1 ] . set_ylabel ( 'Frequency' ) plt . tight_layout ( ) plt . show ( )

10. Feature Binning / Discretization

print ( "\n11. Feature Binning:" ) df [ 'age_bin_equal' ] = pd . cut ( df [ 'age' ] , bins = 5 ) df [ 'age_bin_quantile' ] = pd . qcut ( df [ 'age' ] , q = 5 ) df [ 'income_bins' ] = pd . cut ( df [ 'income' ] , bins = [ 0 , 50000 , 100000 , 150000 ] ) print ( "Equal Width Binning:" ) print ( df [ 'age_bin_equal' ] . value_counts ( ) . sort_index ( ) ) print ( "\nEqual Frequency Binning:" ) print ( df [ 'age_bin_quantile' ] . value_counts ( ) . sort_index ( ) )

11. Missing Value Creation and Handling

print ( "\n12. Missing Value Imputation:" ) df_with_missing = df . copy ( ) missing_indices = np . random . choice ( len ( df ) , 50 , replace = False ) df_with_missing . loc [ missing_indices , 'age' ] = np . nan

Mean imputation

age_mean

df_with_missing [ 'age' ] . mean ( ) df_with_missing [ 'age_imputed_mean' ] = df_with_missing [ 'age' ] . fillna ( age_mean )

Median imputation

age_median

df_with_missing [ 'age' ] . median ( ) df_with_missing [ 'age_imputed_median' ] = df_with_missing [ 'age' ] . fillna ( age_median )

Forward fill

df_with_missing
[
'age_imputed_ffill'
]
=
df_with_missing
[
'age'
]
.
fillna
(
method
=
'ffill'
)
print
(
df_with_missing
[
[
'age'
,
'age_imputed_mean'
,
'age_imputed_median'
]
]
.
head
(
10
)
)
print
(
"\nFeature Engineering Complete!"
)
print
(
f"Original features:
{
len
(
df
.
columns
)
-
5
}
"
)
print
(
f"Final features available:
{
len
(
df
.
columns
)
}
"
)
Best Practices
Understand your domain before engineering features
Create features that are interpretable
Avoid data leakage (using future information)
Test feature importance after engineering
Document all transformations
Use appropriate scaling for different algorithms
Common Transformations
Log Transform
For skewed distributions
Polynomial Features
For non-linear relationships
Interaction Terms
For combined effects
Binning
For categorical approximation
Normalization
For comparison across scales Deliverables Engineered feature dataset Feature transformation documentation Correlation analysis of new features Distribution comparisons (before/after) Feature importance rankings Preprocessing pipeline code Data dictionary with feature descriptions
返回排行榜