Clustering Analyzer

Analyze and cluster data using multiple algorithms with visualization and evaluation.

Features K-Means: Partition-based clustering with elbow method DBSCAN: Density-based clustering for arbitrary shapes Hierarchical: Agglomerative clustering with dendrograms Evaluation: Silhouette scores, cluster statistics Visualization: 2D/3D plots, dendrograms, elbow curves Export: Labeled data, cluster summaries Quick Start from clustering_analyzer import ClusteringAnalyzer

analyzer = ClusteringAnalyzer() analyzer.load_csv("customers.csv")

K-Means clustering

result = analyzer.kmeans(n_clusters=3) print(f"Silhouette Score: {result['silhouette_score']:.3f}")

Visualize

analyzer.plot_clusters("clusters.png")

CLI Usage

K-Means clustering

python clustering_analyzer.py --input data.csv --method kmeans --clusters 3

Find optimal clusters (elbow method)

python clustering_analyzer.py --input data.csv --method kmeans --find-optimal

DBSCAN clustering

python clustering_analyzer.py --input data.csv --method dbscan --eps 0.5 --min-samples 5

Hierarchical clustering

python clustering_analyzer.py --input data.csv --method hierarchical --clusters 4

Generate plots

python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --plot clusters.png

Export labeled data

python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --output labeled.csv

Select specific columns

python clustering_analyzer.py --input data.csv --columns age,income,spending --method kmeans --clusters 3

API Reference ClusteringAnalyzer Class class ClusteringAnalyzer: def init(self)

# Data loading
def load_csv(self, filepath: str, columns: list = None) -> 'ClusteringAnalyzer'
def load_dataframe(self, df: pd.DataFrame, columns: list = None) -> 'ClusteringAnalyzer'

# Clustering methods
def kmeans(self, n_clusters: int, **kwargs) -> dict
def dbscan(self, eps: float = 0.5, min_samples: int = 5) -> dict
def hierarchical(self, n_clusters: int, linkage: str = "ward") -> dict

# Optimal clusters
def find_optimal_clusters(self, max_k: int = 10) -> dict
def elbow_plot(self, output: str, max_k: int = 10) -> str

# Evaluation
def silhouette_score(self) -> float
def cluster_statistics(self) -> dict

# Visualization
def plot_clusters(self, output: str, dimensions: list = None) -> str
def plot_dendrogram(self, output: str) -> str
def plot_silhouette(self, output: str) -> str

# Export
def get_labels(self) -> list
def to_dataframe(self) -> pd.DataFrame
def save_labeled(self, output: str) -> str

Clustering Methods K-Means

Best for spherical clusters with known number of groups:

result = analyzer.kmeans(n_clusters=3)

Returns:

{ "labels": [0, 1, 2, 0, ...], "n_clusters": 3, "silhouette_score": 0.65, "inertia": 1234.56, "cluster_sizes": {0: 150, 1: 200, 2: 100}, "centroids": [[...], [...], [...]] }

DBSCAN

Best for arbitrary-shaped clusters:

result = analyzer.dbscan(eps=0.5, min_samples=5)

Returns:

{ "labels": [0, 0, 1, -1, ...], # -1 = noise "n_clusters": 3, "n_noise": 15, "silhouette_score": 0.58, "cluster_sizes": {0: 150, 1: 200, 2: 100} }

Hierarchical (Agglomerative)

Best for understanding cluster hierarchy:

result = analyzer.hierarchical(n_clusters=4, linkage="ward")

Returns:

{ "labels": [0, 1, 2, 3, ...], "n_clusters": 4, "silhouette_score": 0.62, "cluster_sizes": {0: 100, 1: 150, 2: 120, 3: 80} }

Finding Optimal Clusters Elbow Method optimal = analyzer.find_optimal_clusters(max_k=10)

Returns:

{ "optimal_k": 4, "inertias": [1000, 800, 500, 300, 280, ...], "silhouettes": [0.5, 0.55, 0.6, 0.65, 0.63, ...] }

Elbow Plot analyzer.elbow_plot("elbow.png", max_k=10)

Generates plot showing inertia vs number of clusters.

Cluster Statistics stats = analyzer.cluster_statistics()

Returns:

{ "n_clusters": 3, "cluster_sizes": {0: 150, 1: 200, 2: 100}, "cluster_means": { 0: {"age": 25.5, "income": 45000, ...}, 1: {"age": 45.2, "income": 75000, ...}, 2: {"age": 35.1, "income": 55000, ...} }, "cluster_std": { 0: {"age": 5.2, "income": 8000, ...}, ... }, "overall_silhouette": 0.65 }

Visualization Cluster Plot

2D plot (uses first 2 features or PCA)

analyzer.plot_clusters("clusters_2d.png")

Specify dimensions

analyzer.plot_clusters("clusters.png", dimensions=["age", "income"])

Dendrogram

For hierarchical clustering

analyzer.hierarchical(n_clusters=4) analyzer.plot_dendrogram("dendrogram.png")

Silhouette Plot analyzer.plot_silhouette("silhouette.png")

Shows silhouette coefficient for each sample.

Export Results Get Labels labels = analyzer.get_labels()

[0, 1, 2, 0, 1, ...]

Save Labeled Data analyzer.save_labeled("labeled_data.csv")

Original data + cluster_label column

Get Full DataFrame df = analyzer.to_dataframe()

DataFrame with cluster_label column

Example Workflows Customer Segmentation analyzer = ClusteringAnalyzer() analyzer.load_csv("customers.csv", columns=["age", "income", "spending_score"])

Find optimal number of segments

optimal = analyzer.find_optimal_clusters(max_k=8) print(f"Optimal segments: {optimal['optimal_k']}")

Cluster with optimal k

result = analyzer.kmeans(n_clusters=optimal['optimal_k'])

Get segment characteristics

stats = analyzer.cluster_statistics() for cluster_id, means in stats["cluster_means"].items(): print(f"\nSegment {cluster_id}:") for feature, value in means.items(): print(f" {feature}: {value:.2f}")

Save segmented data

analyzer.save_labeled("customer_segments.csv")

Anomaly Detection with DBSCAN analyzer = ClusteringAnalyzer() analyzer.load_csv("transactions.csv", columns=["amount", "frequency"])

DBSCAN identifies noise points as potential anomalies

result = analyzer.dbscan(eps=0.3, min_samples=10)

print(f"Found {result['n_noise']} potential anomalies")

Get anomalous records

df = analyzer.to_dataframe() anomalies = df[df["cluster_label"] == -1]

Document Clustering

After TF-IDF transformation

analyzer = ClusteringAnalyzer() analyzer.load_dataframe(tfidf_matrix)

Hierarchical clustering to see document relationships

result = analyzer.hierarchical(n_clusters=5) analyzer.plot_dendrogram("doc_dendrogram.png")

Data Preprocessing

The analyzer automatically:

Handles missing values (imputation) Scales features (standardization) Reduces dimensions for visualization (PCA)

For custom preprocessing:

from sklearn.preprocessing import StandardScaler

Preprocess manually

df = pd.read_csv("data.csv") scaler = StandardScaler() df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

Load preprocessed data

analyzer.load_dataframe(df_scaled)

Dependencies scikit-learn>=1.3.0 pandas>=2.0.0 numpy>=1.24.0 matplotlib>=3.7.0 scipy>=1.10.0

clustering-analyzer

安装

K-Means clustering

Visualize

K-Means clustering

Find optimal clusters (elbow method)

DBSCAN clustering

Hierarchical clustering

Generate plots

Export labeled data

Select specific columns

Returns:

Returns:

Returns:

Returns:

Returns:

2D plot (uses first 2 features or PCA)

Specify dimensions

For hierarchical clustering

[0, 1, 2, 0, 1, ...]

Original data + cluster_label column

DataFrame with cluster_label column

Find optimal number of segments

Cluster with optimal k

Get segment characteristics

Save segmented data

DBSCAN identifies noise points as potential anomalies

Get anomalous records

After TF-IDF transformation

Hierarchical clustering to see document relationships

Preprocess manually

Load preprocessed data