Data Quality Auditor
Comprehensive data quality assessment for CSV/Excel datasets.
Features Completeness: Missing values analysis Uniqueness: Duplicate detection Validity: Type validation and constraints Consistency: Pattern and format checks Quality Score: Overall data quality metric Reports: Detailed HTML/JSON reports Quick Start from data_quality_auditor import DataQualityAuditor
auditor = DataQualityAuditor() auditor.load_csv("customers.csv")
Run full audit
report = auditor.audit() print(f"Quality Score: {report['quality_score']}/100")
Check specific issues
missing = auditor.check_missing() duplicates = auditor.check_duplicates()
CLI Usage
Full audit
python data_quality_auditor.py --input data.csv
Generate HTML report
python data_quality_auditor.py --input data.csv --report report.html
Check specific aspects
python data_quality_auditor.py --input data.csv --missing python data_quality_auditor.py --input data.csv --duplicates python data_quality_auditor.py --input data.csv --types
JSON output
python data_quality_auditor.py --input data.csv --json
Validate against rules
python data_quality_auditor.py --input data.csv --rules rules.json
API Reference DataQualityAuditor Class class DataQualityAuditor: def init(self)
# Data loading
def load_csv(self, filepath: str, **kwargs) -> 'DataQualityAuditor'
def load_dataframe(self, df: pd.DataFrame) -> 'DataQualityAuditor'
# Full audit
def audit(self) -> dict
def quality_score(self) -> float
# Individual checks
def check_missing(self) -> dict
def check_duplicates(self, subset: list = None) -> dict
def check_types(self) -> dict
def check_uniqueness(self) -> dict
def check_patterns(self, column: str, pattern: str) -> dict
# Validation
def validate_column(self, column: str, rules: dict) -> dict
def validate_dataset(self, rules: dict) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def summary(self) -> str
Quality Checks Missing Values missing = auditor.check_missing()
Returns:
{ "total_cells": 10000, "missing_cells": 150, "missing_percent": 1.5, "by_column": { "email": {"count": 50, "percent": 5.0}, "phone": {"count": 100, "percent": 10.0} }, "rows_with_missing": 120 }
Duplicates dups = auditor.check_duplicates()
Returns:
{ "total_rows": 1000, "duplicate_rows": 25, "duplicate_percent": 2.5, "duplicate_groups": [...], "by_columns": { "email": {"duplicates": 15}, "phone": {"duplicates": 20} } }
Type Validation types = auditor.check_types()
Returns:
{ "columns": { "age": { "detected_type": "int64", "unique_values": 75, "sample_values": [25, 30, 45], "issues": [] }, "date": { "detected_type": "object", "unique_values": 365, "sample_values": ["2023-01-01", "invalid"], "issues": ["Mixed date formats detected"] } } }
Validation Rules
Define custom validation rules:
{ "columns": { "email": { "required": true, "unique": true, "pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$" }, "age": { "type": "integer", "min": 0, "max": 120 }, "status": { "allowed_values": ["active", "inactive", "pending"] }, "created_at": { "type": "date", "format": "%Y-%m-%d" } } }
results = auditor.validate_dataset(rules)
Quality Score
The quality score (0-100) is calculated from:
Completeness (30%): Missing value ratio Uniqueness (25%): Duplicate row ratio Validity (25%): Type and constraint compliance Consistency (20%): Format and pattern adherence score = auditor.quality_score()
85.5
Output Formats Audit Report { "file": "data.csv", "rows": 1000, "columns": 15, "quality_score": 85.5, "completeness": { "score": 92.0, "missing_cells": 800, "details": {...} }, "uniqueness": { "score": 97.5, "duplicate_rows": 25, "details": {...} }, "validity": { "score": 78.0, "type_issues": [...], "details": {...} }, "consistency": { "score": 80.0, "pattern_issues": [...], "details": {...} }, "recommendations": [ "Column 'phone' has 10% missing values", "25 duplicate rows detected", "Column 'date' has inconsistent formats" ] }
Example Workflows Pre-Import Validation auditor = DataQualityAuditor() auditor.load_csv("import_data.csv")
report = auditor.audit() if report['quality_score'] < 80: print("Data quality below threshold!") for rec in report['recommendations']: print(f" - {rec}") exit(1)
ETL Pipeline Check auditor = DataQualityAuditor() auditor.load_dataframe(transformed_df)
Check critical columns
email_check = auditor.validate_column("email", { "required": True, "unique": True, "pattern": r"^[\w.+-]+@[\w-]+.[\w.-]+$" })
if email_check['issues']: raise ValueError(f"Email validation failed: {email_check['issues']}")
Generate Documentation auditor = DataQualityAuditor() auditor.load_csv("dataset.csv")
Generate comprehensive report
auditor.generate_report("quality_report.html", format="html")
Or get summary text
print(auditor.summary())
Dependencies pandas>=2.0.0 numpy>=1.24.0