Multi Source Data Merger Overview This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports. Workflow Step 1: Analyze Requirements and Source Files Before writing any code, thoroughly understand the task: Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.) Determine the merge key (e.g., user_id , record_id ) that links records across sources Review field mapping requirements - source fields may have different names that map to common output fields Understand conflict resolution rules - typically based on source priority ordering Identify expected output formats and structure Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries. Step 2: Set Up Environment Create a Python virtual environment using uv or venv Install required dependencies based on source formats: pandas - Core data manipulation pyarrow - Parquet file support openpyxl - Excel file support lxml - XML parsing (if needed) Verify installations before proceeding Example environment setup: uv venv .venv source .venv/bin/activate uv pip install pandas pyarrow Step 3: Write the Merge Script Structure the script with clear separation of concerns: Data reading functions - One per format type Field mapping function - Apply column renames Data normalization - Handle date formats, type conversions Merge logic - Combine records using the merge key Conflict resolution - Apply priority rules Output generation - Write merged data and conflict reports Script quality practices: Validate syntax before execution: python -m py_compile script.py Use try-except blocks with informative error messages Document assumptions about data formats Step 4: Execute and Verify Run a comprehensive verification process: Check output file existence at expected locations Validate merged data contains expected values Verify conflict report structure and content Run any provided test suites Common Pitfalls Binary File Handling Mistake: Attempting to read Parquet/Excel files as text Solution: Always use pandas with appropriate engine ( pyarrow for Parquet, openpyxl for Excel) Syntax Errors in Scripts Mistake: Writing long scripts without validation, leading to indentation or syntax errors Solution: Run python -m py_compile script.py before execution Date Format Normalization Mistake: Assuming consistent date formats across sources Solution: Implement flexible date parsing that handles multiple formats: ISO format: 2024-01-15 US format: 01/15/2024 European format: 15-01-2024 Datetime: 2024-01-15T10:30:00 Incomplete Script Output Mistake: Writing very long scripts that may get truncated Solution: Break into modular functions, verify complete code visibility Environment Path Issues Mistake: Repeating PATH exports in every command Solution: Set PATH once in a setup step or use absolute paths to executables Verification Strategies Output Validation Checklist File existence check: import os assert os . path . exists ( "output/merged_data.json" ) assert os . path . exists ( "output/conflict_report.json" ) Data completeness check: import json with open ( "output/merged_data.json" ) as f : data = json . load ( f )

Verify expected record count

assert len ( data ) == expected_count Conflict report validation: with open ( "output/conflict_report.json" ) as f : conflicts = json . load ( f )

Verify conflict structure has required fields

for conflict in conflicts : assert "field" in conflict assert "selected" in conflict assert "sources" in conflict Sample value verification:

Spot-check specific merged records

record

next ( r for r in data if r [ "user_id" ] == "expected_id" ) assert record [ "field_name" ] == "expected_value" Consolidate Verification Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output. Edge Cases to Consider Empty source files - Handle gracefully with appropriate warnings Missing merge keys - Decide whether to skip or error Type mismatches - Convert consistently (e.g., user_id as string vs integer) Null/None values - Determine handling in conflict resolution Unicode/encoding - Specify encoding when reading text-based formats Records in some sources but not others - Include partial records or require complete matches Field Mapping Example When sources have different field names for the same concept: FIELD_MAPPINGS = { "source_a" : { "firstName" : "first_name" , "lastName" : "last_name" , "emailAddress" : "email" } , "source_b" : { "fname" : "first_name" , "lname" : "last_name" , "mail" : "email" } } def apply_mapping ( df , source_name ) : mapping = FIELD_MAPPINGS . get ( source_name , { } ) return df . rename ( columns = mapping ) Conflict Resolution Pattern When the same field has different values across sources: def resolve_conflict ( values_by_source , priority_order ) : """ Select value based on source priority. Args: values_by_source: dict mapping source name to value priority_order: list of source names from highest to lowest priority Returns: tuple: (selected_value, conflict_info) """ conflict_info = None unique_values = set ( v for v in values_by_source . values ( ) if v is not None ) if len ( unique_values )

1 : conflict_info = { "sources" : values_by_source , "resolved_by" : "priority" } for source in priority_order : if source in values_by_source and values_by_source [ source ] is not None : return values_by_source [ source ] , conflict_info return None , conflict_info

multi-source-data-merger

安装

Verify expected record count

Verify conflict structure has required fields

Spot-check specific merged records

record