Correlation Explorer
Analyze correlations between variables in CSV/Excel datasets.
Features Correlation Matrix: Compute all pairwise correlations Heatmap Visualization: Color-coded correlation display Significance Testing: P-values for correlations Multiple Methods: Pearson, Spearman, Kendall Strong Correlations: Find highly correlated pairs Target Analysis: Correlations with specific variable Quick Start from correlation_explorer import CorrelationExplorer
explorer = CorrelationExplorer()
Load and analyze
explorer.load_csv("sales_data.csv") matrix = explorer.correlation_matrix()
Find strong correlations
strong = explorer.find_strong_correlations(threshold=0.7) print(strong)
Generate heatmap
explorer.plot_heatmap("correlation_heatmap.png")
CLI Usage
Compute correlation matrix
python correlation_explorer.py --input data.csv --output correlations.csv
Generate heatmap
python correlation_explorer.py --input data.csv --heatmap heatmap.png
Find strong correlations
python correlation_explorer.py --input data.csv --strong --threshold 0.7
Correlations with target variable
python correlation_explorer.py --input data.csv --target sales
Use Spearman correlation
python correlation_explorer.py --input data.csv --method spearman
Include p-values
python correlation_explorer.py --input data.csv --pvalues
API Reference CorrelationExplorer Class class CorrelationExplorer: def init(self)
# Data loading
def load_csv(self, filepath: str, **kwargs) -> 'CorrelationExplorer'
def load_dataframe(self, df: pd.DataFrame) -> 'CorrelationExplorer'
# Analysis
def correlation_matrix(self, method: str = "pearson") -> pd.DataFrame
def correlation_with_pvalues(self, method: str = "pearson") -> tuple
def correlate_with_target(self, target: str, method: str = "pearson") -> pd.Series
# Discovery
def find_strong_correlations(self, threshold: float = 0.7) -> list
def find_weak_correlations(self, threshold: float = 0.3) -> list
# Visualization
def plot_heatmap(self, output: str, **kwargs) -> str
def plot_scatter(self, var1: str, var2: str, output: str) -> str
# Export
def to_csv(self, output: str) -> str
def to_json(self, output: str) -> str
Correlation Methods Method Best For pearson Linear relationships, normal data spearman Non-linear, ordinal data kendall Small samples, ordinal data
Pearson (default) - parametric
matrix = explorer.correlation_matrix(method="pearson")
Spearman - rank-based, non-parametric
matrix = explorer.correlation_matrix(method="spearman")
Kendall - robust to outliers
matrix = explorer.correlation_matrix(method="kendall")
Output Format Correlation Matrix sales marketing customers sales 1.000 0.854 0.723 marketing 0.854 1.000 0.612 customers 0.723 0.612 1.000
Strong Correlations [ {"var1": "sales", "var2": "marketing", "correlation": 0.854, "abs_corr": 0.854}, {"var1": "sales", "var2": "customers", "correlation": 0.723, "abs_corr": 0.723} ]
With P-Values { "correlations": DataFrame, "pvalues": DataFrame, "significant": [...], # p < 0.05 }
Example Workflows Feature Selection explorer = CorrelationExplorer() explorer.load_csv("features.csv")
Find features correlated with target
target_corr = explorer.correlate_with_target("target") important_features = target_corr[abs(target_corr) > 0.3].index.tolist() print(f"Important features: {important_features}")
Find multicollinear features (to potentially drop)
strong = explorer.find_strong_correlations(threshold=0.9) print("Highly correlated pairs (consider dropping one):") for pair in strong: print(f" {pair['var1']} <-> {pair['var2']}: {pair['correlation']:.3f}")
Sales Analysis explorer = CorrelationExplorer() explorer.load_csv("sales_data.csv")
What drives sales?
sales_corr = explorer.correlate_with_target("revenue") print("Factors correlated with revenue:") for var, corr in sales_corr.sort_values(ascending=False).items(): if var != "revenue": print(f" {var}: {corr:.3f}")
Visualize
explorer.plot_heatmap("sales_correlations.png")
Data Exploration explorer = CorrelationExplorer() explorer.load_csv("dataset.csv")
Get full picture
corr, pvals = explorer.correlation_with_pvalues()
Find all significant correlations
significant = [] for i in range(len(corr.columns)): for j in range(i+1, len(corr.columns)): if pvals.iloc[i, j] < 0.05: significant.append({ 'var1': corr.columns[i], 'var2': corr.columns[j], 'r': corr.iloc[i, j], 'p': pvals.iloc[i, j] })
Heatmap Options explorer.plot_heatmap( output="heatmap.png", cmap="coolwarm", # Color scheme annot=True, # Show values figsize=(12, 10), # Figure size vmin=-1, vmax=1, # Color scale title="Correlation Matrix" )
Dependencies pandas>=2.0.0 numpy>=1.24.0 scipy>=1.10.0 matplotlib>=3.7.0 seaborn>=0.12.0