Data Anonymizer
Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.
Quick Start from scripts.data_anonymizer import DataAnonymizer
Anonymize text
anonymizer = DataAnonymizer() result = anonymizer.anonymize("Contact John Smith at john@email.com or 555-123-4567") print(result)
"Contact [NAME] at [EMAIL] or [PHONE]"
Anonymize CSV
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")
Features PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates Multiple Strategies: Mask, redact, hash, fake data replacement CSV Processing: Anonymize specific columns or auto-detect Reversible Tokens: Optional mapping for de-anonymization Custom Patterns: Add your own PII patterns Audit Report: List all detected PII with locations API Reference Initialization anonymizer = DataAnonymizer( strategy="mask", # mask, redact, hash, fake reversible=False # Enable token mapping )
Text Anonymization
Basic anonymization
result = anonymizer.anonymize(text)
With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])
Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)
Masking Strategies text = "Email john@test.com, call 555-1234"
Mask (default) - replace with type labels
anonymizer.strategy = "mask"
"Email [EMAIL], call [PHONE]"
Redact - replace with asterisks
anonymizer.strategy = "redact"
"Email **, call *"
Hash - replace with hash
anonymizer.strategy = "hash"
"Email a1b2c3d4, call e5f6g7h8"
Fake - replace with realistic fake data
anonymizer.strategy = "fake"
"Email jane@example.org, call 555-9876"
CSV Processing
Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")
Specify columns
anonymizer.anonymize_csv( "input.csv", "output.csv", columns=["name", "email", "phone"] )
Different strategies per column
anonymizer.anonymize_csv( "input.csv", "output.csv", column_strategies={ "name": "fake", "email": "hash", "ssn": "redact" } )
Reversible Anonymization anonymizer = DataAnonymizer(reversible=True)
Anonymize with token mapping
result = anonymizer.anonymize("John Smith: john@test.com") mapping = anonymizer.get_mapping()
Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")
Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret") original = anonymizer.deanonymize(result)
Custom Patterns
Add custom PII pattern
anonymizer.add_pattern( name="employee_id", pattern=r"EMP-\d{6}", label="[EMPLOYEE_ID]" )
CLI Usage
Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt
Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv
Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake
Generate audit report
python data_anonymizer.py --input document.txt --report audit.json
Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn
CLI Arguments Argument Description Default --input Input file Required --output Output file Required --strategy Masking strategy mask --types PII types to detect all --columns CSV columns to process auto --report Generate audit report - --reversible Enable token mapping False Supported PII Types Type Examples Pattern name John Smith, Mary Johnson NLP-based email user@domain.com Regex phone 555-123-4567, (555) 123-4567 Regex ssn 123-45-6789 Regex credit_card 4111-1111-1111-1111 Regex + Luhn address 123 Main St, City, ST 12345 NLP + Regex date_of_birth 01/15/1990, January 15, 1990 Regex ip_address 192.168.1.1 Regex Examples Anonymize Customer Support Logs anonymizer = DataAnonymizer(strategy="mask")
log = """ Ticket #1234: Customer John Doe (john.doe@company.com) called about billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309. Address: 123 Oak Street, Springfield, IL 62701. """
result = anonymizer.anonymize(log) print(result)
Ticket #1234: Customer [NAME] ([EMAIL]) called about
billing issue. SSN on file: [SSN]. Callback number: [PHONE].
Address: [ADDRESS].
GDPR Compliance for Database Export anonymizer = DataAnonymizer(strategy="hash")
Consistent hashing for joins
anonymizer.anonymize_csv( "users.csv", "users_anon.csv", columns=["email", "name", "phone"] )
anonymizer.anonymize_csv( "orders.csv", "orders_anon.csv", columns=["customer_email"] # Same hash as users.email )
Generate Test Data from Production anonymizer = DataAnonymizer(strategy="fake")
Replace real PII with realistic fake data
anonymizer.anonymize_csv( "production_data.csv", "test_data.csv" )
Test data has same structure but fake PII
Dependencies pandas>=2.0.0 faker>=18.0.0
Limitations Name detection may miss unusual names Address detection works best for US formats Custom patterns may be needed for domain-specific PII Fake data replacement doesn't preserve exact format