Distribution Analyzer Expert Эксперт по анализу статистических распределений. Core Principles Exploratory Data Analysis Начинайте с визуализации (histograms, Q-Q plots) Descriptive statistics (mean, median, std, skewness, kurtosis) Выявление outliers и аномалий Multiple Testing Kolmogorov-Smirnov test Anderson-Darling test Shapiro-Wilk test Chi-square goodness-of-fit Model Selection AIC/BIC criteria Cross-validation Residual analysis Distribution Families Continuous Distributions Normal (Gaussian) : Use cases : Natural phenomena , measurement errors Parameters : μ (mean) , σ (std) When : Symmetric , additive processes Test : Shapiro - Wilk Log-Normal : Use cases : Income , stock prices , file sizes Parameters : μ , σ (of log) When : Multiplicative processes , positive skew Test : Transform and test normality Exponential : Use cases : Time between events , failure times Parameters : λ (rate) When : Memoryless processes Test : K - S test Gamma : Use cases : Wait times , insurance claims Parameters : α (shape) , β (rate) When : Sum of exponential events Weibull : Use cases : Reliability , survival analysis Parameters : λ (scale) , k (shape) When : Failure time modeling Beta : Use cases : Proportions , probabilities Parameters : α , β When : Bounded [ 0 , 1 ] data Pareto : Use cases : Wealth , city sizes Parameters : α (shape) , xm (scale) When : Power law , heavy tail Student's t : Use cases : Small samples , heavy tails Parameters : ν (degrees of freedom) When : Normal - like but heavier tails Discrete Distributions Poisson : Use cases : Event counts , rare events Parameters : λ (rate) When : Events in fixed interval Binomial : Use cases : Success/failure trials Parameters : n (trials) , p (probability) When : Fixed number of independent trials Negative Binomial : Use cases : Overdispersed counts Parameters : r , p When : Variance
Mean Geometric : Use cases : Number of trials until success Parameters : p (probability) When : Waiting for first success Distribution Analyzer Class import numpy as np import scipy . stats as stats import matplotlib . pyplot as plt from typing import Dict , List , Tuple , Optional class DistributionAnalyzer : """Comprehensive distribution analysis toolkit.""" DISTRIBUTIONS = { 'norm' : stats . norm , 'lognorm' : stats . lognorm , 'expon' : stats . expon , 'gamma' : stats . gamma , 'weibull_min' : stats . weibull_min , 'beta' : stats . beta , 'pareto' : stats . pareto , 't' : stats . t } def init ( self , data : np . ndarray ) : self . data = np . array ( data ) self . n = len ( data ) self . results = { } def descriptive_stats ( self ) -
Dict : """Calculate descriptive statistics.""" return { 'n' : self . n , 'mean' : np . mean ( self . data ) , 'median' : np . median ( self . data ) , 'std' : np . std ( self . data ) , 'var' : np . var ( self . data ) , 'min' : np . min ( self . data ) , 'max' : np . max ( self . data ) , 'range' : np . ptp ( self . data ) , 'skewness' : stats . skew ( self . data ) , 'kurtosis' : stats . kurtosis ( self . data ) , 'q25' : np . percentile ( self . data , 25 ) , 'q50' : np . percentile ( self . data , 50 ) , 'q75' : np . percentile ( self . data , 75 ) , 'iqr' : stats . iqr ( self . data ) } def fit_distributions ( self ) -
Dict : """Fit multiple distributions and rank by goodness-of-fit.""" results = { } for name , dist in self . DISTRIBUTIONS . items ( ) : try :
Fit distribution
params
dist . fit ( self . data )
Calculate log-likelihood
log_likelihood
np . sum ( dist . logpdf ( self . data , * params ) )
Calculate AIC and BIC
k
len ( params ) aic = 2 * k - 2 * log_likelihood bic = k * np . log ( self . n ) - 2 * log_likelihood
K-S test
ks_stat , ks_pvalue = stats . kstest ( self . data , name , params ) results [ name ] = { 'params' : params , 'log_likelihood' : log_likelihood , 'aic' : aic , 'bic' : bic , 'ks_statistic' : ks_stat , 'ks_pvalue' : ks_pvalue } except Exception as e : results [ name ] = { 'error' : str ( e ) }
Rank by AIC
valid_results
{ k : v for k , v in results . items ( ) if 'aic' in v } ranked = sorted ( valid_results . items ( ) , key = lambda x : x [ 1 ] [ 'aic' ] ) self . results = results return { 'all' : results , 'best' : ranked [ 0 ] if ranked else None , 'ranking' : [ ( name , res [ 'aic' ] ) for name , res in ranked ] } def test_normality ( self ) -
Dict : """Comprehensive normality testing.""" results = { }
Shapiro-Wilk (best for n < 5000)
if self . n < 5000 : stat , pvalue = stats . shapiro ( self . data ) results [ 'shapiro_wilk' ] = { 'statistic' : stat , 'pvalue' : pvalue , 'is_normal' : pvalue
0.05 }
D'Agostino-Pearson
if self . n
= 20 : stat , pvalue = stats . normaltest ( self . data ) results [ 'dagostino_pearson' ] = { 'statistic' : stat , 'pvalue' : pvalue , 'is_normal' : pvalue
0.05 }
Anderson-Darling
result
stats . anderson ( self . data , dist = 'norm' ) results [ 'anderson_darling' ] = { 'statistic' : result . statistic , 'critical_values' : dict ( zip ( [ f' { cv } %' for cv in result . significance_level ] , result . critical_values ) ) , 'is_normal' : result . statistic < result . critical_values [ 2 ]
5%
}
Kolmogorov-Smirnov
stat , pvalue = stats . kstest ( self . data , 'norm' , args = ( np . mean ( self . data ) , np . std ( self . data ) ) ) results [ 'kolmogorov_smirnov' ] = { 'statistic' : stat , 'pvalue' : pvalue , 'is_normal' : pvalue
0.05 } return results def detect_outliers ( self , method : str = 'iqr' ) -
Dict : """Detect outliers using various methods.""" results = { 'method' : method } if method == 'iqr' : q1 , q3 = np . percentile ( self . data , [ 25 , 75 ] ) iqr = q3 - q1 lower = q1 - 1.5 * iqr upper = q3 + 1.5 * iqr outliers = self . data [ ( self . data < lower ) | ( self . data
upper ) ] results . update ( { 'lower_bound' : lower , 'upper_bound' : upper , 'outliers' : outliers , 'n_outliers' : len ( outliers ) , 'outlier_percentage' : len ( outliers ) / self . n * 100 } ) elif method == 'zscore' : z_scores = np . abs ( stats . zscore ( self . data ) ) threshold = 3 outliers = self . data [ z_scores
threshold ] results . update ( { 'threshold' : threshold , 'outliers' : outliers , 'n_outliers' : len ( outliers ) , 'outlier_percentage' : len ( outliers ) / self . n * 100 } ) elif method == 'mad' : median = np . median ( self . data ) mad = np . median ( np . abs ( self . data - median ) ) threshold = 3.5 modified_z = 0.6745 * ( self . data - median ) / mad outliers = self . data [ np . abs ( modified_z )
threshold ] results . update ( { 'median' : median , 'mad' : mad , 'outliers' : outliers , 'n_outliers' : len ( outliers ) , 'outlier_percentage' : len ( outliers ) / self . n * 100 } ) return results def plot_analysis ( self , figsize : Tuple = ( 15 , 10 ) ) : """Generate comprehensive diagnostic plots.""" fig , axes = plt . subplots ( 2 , 3 , figsize = figsize )
Histogram with KDE
axes [ 0 , 0 ] . hist ( self . data , bins = 'auto' , density = True , alpha = 0.7 ) kde_x = np . linspace ( self . data . min ( ) , self . data . max ( ) , 100 ) kde = stats . gaussian_kde ( self . data ) axes [ 0 , 0 ] . plot ( kde_x , kde ( kde_x ) , 'r-' , lw = 2 ) axes [ 0 , 0 ] . set_title ( 'Histogram with KDE' )
Q-Q Plot (Normal)
stats . probplot ( self . data , dist = "norm" , plot = axes [ 0 , 1 ] ) axes [ 0 , 1 ] . set_title ( 'Q-Q Plot (Normal)' )
Box Plot
axes [ 0 , 2 ] . boxplot ( self . data , vert = True ) axes [ 0 , 2 ] . set_title ( 'Box Plot' )
ECDF
sorted_data
np . sort ( self . data ) ecdf = np . arange ( 1 , self . n + 1 ) / self . n axes [ 1 , 0 ] . step ( sorted_data , ecdf , where = 'post' ) axes [ 1 , 0 ] . set_title ( 'Empirical CDF' )
P-P Plot
theoretical
stats . norm . cdf ( sorted_data , loc = np . mean ( self . data ) , scale = np . std ( self . data ) ) axes [ 1 , 1 ] . scatter ( theoretical , ecdf , alpha = 0.5 ) axes [ 1 , 1 ] . plot ( [ 0 , 1 ] , [ 0 , 1 ] , 'r--' ) axes [ 1 , 1 ] . set_title ( 'P-P Plot (Normal)' )
Violin Plot
axes [ 1 , 2 ] . violinplot ( self . data ) axes [ 1 , 2 ] . set_title ( 'Violin Plot' ) plt . tight_layout ( ) return fig def bootstrap_ci ( self , statistic : str = 'mean' , n_bootstrap : int = 10000 , confidence : float = 0.95 ) -
Dict : """Calculate bootstrap confidence intervals.""" stat_func = { 'mean' : np . mean , 'median' : np . median , 'std' : np . std } [ statistic ] bootstrap_stats = [ ] for _ in range ( n_bootstrap ) : sample = np . random . choice ( self . data , size = self . n , replace = True ) bootstrap_stats . append ( stat_func ( sample ) ) alpha = 1 - confidence lower = np . percentile ( bootstrap_stats , alpha / 2 * 100 ) upper = np . percentile ( bootstrap_stats , ( 1 - alpha / 2 ) * 100 ) return { 'statistic' : statistic , 'point_estimate' : stat_func ( self . data ) , 'ci_lower' : lower , 'ci_upper' : upper , 'confidence' : confidence , 'standard_error' : np . std ( bootstrap_stats ) } Usage Examples
Example usage
import numpy as np
Generate sample data
np . random . seed ( 42 ) data = np . random . lognormal ( mean = 2 , sigma = 0.5 , size = 1000 )
Create analyzer
analyzer
DistributionAnalyzer ( data )
Descriptive statistics
print ( "Descriptive Stats:" ) print ( analyzer . descriptive_stats ( ) )
Fit distributions
print ( "\nDistribution Fitting:" ) fit_results = analyzer . fit_distributions ( ) print ( f"Best fit: { fit_results [ 'best' ] [ 0 ] } " ) print ( f"AIC: { fit_results [ 'best' ] [ 1 ] [ 'aic' ] : .2f } " )
Test normality
- (
- "\nNormality Tests:"
- )
- norm_results
- =
- analyzer
- .
- test_normality
- (
- )
- for
- test
- ,
- result
- in
- norm_results
- .
- items
- (
- )
- :
- (
- f"
- {
- test
- }
- p-value = { result . get ( 'pvalue' , 'N/A' ) : .4f } " )
Detect outliers
print ( "\nOutlier Detection:" ) outliers = analyzer . detect_outliers ( method = 'iqr' ) print ( f"Found { outliers [ 'n_outliers' ] } outliers ( { outliers [ 'outlier_percentage' ] : .1f } %)" )
Bootstrap CI
print ( "\nBootstrap Confidence Interval:" ) ci = analyzer . bootstrap_ci ( 'mean' , confidence = 0.95 ) print ( f"Mean: { ci [ 'point_estimate' ] : .2f } [ { ci [ 'ci_lower' ] : .2f } , { ci [ 'ci_upper' ] : .2f } ]" )
Generate plots
fig
analyzer . plot_analysis ( ) plt . savefig ( 'distribution_analysis.png' ) Distribution Selection Guide Symmetric data : - Start with Normal - If heavy tails : Student's t - If lighter tails : Consider uniform Right-skewed data : - Log - normal for multiplicative - Gamma for waiting times - Weibull for reliability - Exponential if memoryless Left-skewed data : - Reflect and fit right - skewed - Beta with appropriate parameters Bounded data [ 0 , 1 ] : - Beta distribution - Truncated normal if near - normal Count data : - Poisson if mean ≈ variance - Negative binomial if overdispersed - Binomial for fixed trials Heavy tails : - Pareto for power law - Student's t - Cauchy (extreme) Common Pitfalls Visual inspection alone : Problem : Histograms can be misleading Solution : Use multiple tests and Q - Q plots Ignoring sample size : Problem : Tests behave differently with n Solution : Shapiro - Wilk for small , K - S for large Single test reliance : Problem : Each test has assumptions Solution : Use multiple complementary tests Overfitting : Problem : Complex distribution fits noise Solution : AIC/BIC penalize parameters Parameter uncertainty : Problem : Point estimates hide uncertainty Solution : Bootstrap confidence intervals Лучшие практики Visualize first — всегда начинайте с графиков Multiple tests — не полагайтесь на один тест Sample size awareness — выбирайте тесты по размеру выборки Domain knowledge — учитывайте физический смысл данных Report uncertainty — давайте confidence intervals Validate — cross-validation для model selection