check-metadata-typos

仓库: owid/etl
安装量: 45
排名: #16530

安装

npx skills add https://github.com/owid/etl --skill check-metadata-typos

Check Metadata Typos Check metadata files for spelling typos using comprehensive spell checking. Scope Options Ask the user which scope they want to check: Current step only - Ask the user to specify the step path (e.g., etl/steps/data/garden/energy/2025-06-27/electricity_mix ) All ETL metadata - Check all active .meta.yml files in etl/steps/data/{garden,meadow,grapher}/ (automatically excludes ~3,570 archived steps) Snapshot metadata - Check all snapshot .dvc files in snapshots/ (~7,915 files) All metadata - Check both ETL steps and snapshot metadata files Note: Archived steps and snapshots (defined in dag/archive/*.yml ) are automatically excluded from checking as they are no longer actively maintained. Implementation Strategy 0. Check codespell installation IMPORTANT: Check if codespell is installed before attempting to use it. Since codespell is now a dev dependency in the project, it should already be installed, but verify first to avoid reinstalling unnecessarily.

Check if codespell is installed

if ! .venv/bin/codespell --version &> /dev/null ; then echo "codespell not found, installing..." uv add --dev codespell else echo "codespell is already installed" fi If codespell is not installed and uv add --dev codespell fails, explain to the user how to install it manually. 1. Exclude archived steps and snapshots IMPORTANT: Do not check archived steps and snapshots as they are no longer in use. Archived steps and snapshots are defined in dag/archive/*.yml files: ~3,570 deprecated steps (garden, meadow, grapher) ~736 deprecated snapshots To exclude them, extract their paths and create a list of active files:

Extract archived step paths to a file

for step_type in garden meadow grapher ; do grep -h "data:// ${step_type} /" dag/archive/*.yml 2

/dev/null | \ grep -o "data:// ${step_type} /[^:]*" | \ sed 's|data://|etl/steps/data/|' | \ sed 's|$|.meta.yml|' done

/tmp/archived_files.txt

Extract archived snapshots

grep -rh "snapshot://" dag/archive/*.yml 2

/dev/null | \ grep -o "snapshot://[^:]*" | \ sed 's|snapshot://|snapshots/|' | \ sed 's|$|.dvc|' | \ sort -u

/tmp/archived_files.txt

Create list of all metadata files

find etl/steps/data/garden -name "*.meta.yml"

/tmp/all_meta_files.txt find etl/steps/data/meadow -name "*.meta.yml"

/tmp/all_meta_files.txt find etl/steps/data/grapher -name "*.meta.yml"

/tmp/all_meta_files.txt find snapshots -name "*.dvc"

/tmp/all_meta_files.txt

Filter out archived files

grep -vFf /tmp/archived_files.txt /tmp/all_meta_files.txt

/tmp/active_meta_files.txt echo "Total files to check: $( wc -l < /tmp/active_meta_files.txt ) " 2. Run codespell with ignore list and exclusions Use the existing .codespell-ignore.txt file to filter out domain-specific terms: For option 1 (current step only): Ask the user to provide the step path (e.g., etl/steps/data/garden/energy/2025-06-27/electricity_mix ) Construct the full path to the metadata file: /*.meta.yml Run codespell on that specific path:

For specific step (option 1)

STEP_PATH

""

e.g., etl/steps/data/garden/energy/2025-06-27/electricity_mix

.venv/bin/codespell " ${STEP_PATH} " /*.meta.yml \ --ignore-words = .codespell-ignore.txt For option 2 (all ETL metadata - garden, meadow, grapher):

For all ETL step metadata (option 2)

find etl/steps/data/garden -name "*.meta.yml"

/tmp/all_step_files.txt find etl/steps/data/meadow -name "*.meta.yml"

/tmp/all_step_files.txt find etl/steps/data/grapher -name "*.meta.yml"

/tmp/all_step_files.txt grep -vFf /tmp/archived_files.txt /tmp/all_step_files.txt

/tmp/active_step_files.txt cat /tmp/active_step_files.txt | xargs .venv/bin/codespell \ --ignore-words = .codespell-ignore.txt Note: Excluding archived steps reduces the scope by ~3,570 files and focuses on actively maintained metadata. For option 3 (snapshot metadata):

For all snapshot metadata (option 3)

find snapshots -name "*.dvc"

/tmp/all_snapshot_files.txt grep -vFf /tmp/archived_files.txt /tmp/all_snapshot_files.txt

/tmp/active_snapshot_files.txt cat /tmp/active_snapshot_files.txt | xargs .venv/bin/codespell \ --ignore-words = .codespell-ignore.txt Note: Snapshot .dvc files contain metadata in the meta.source.description and meta.source.published_by fields. ~736 archived snapshots are excluded. For option 4 (all metadata):

For all metadata - ETL and snapshots (option 4)

Use the active_meta_files.txt created in step 1

cat /tmp/active_meta_files.txt | xargs .venv/bin/codespell \ --ignore-words = .codespell-ignore.txt 3. Parse and present results Extract typos from codespell output and present them in a structured format: Group by typo type (e.g., all instances of "seperate" → "separate") Show file paths (as clickable links when possible) Show line numbers Show suggested corrections Example output format: Found 15 typos across 8 files: Most common: - "inmigrant" → "immigrant" (5 occurrences in 2 files) - "seperate" → "separate" (3 occurrences in 1 file) - "accomodation" → "accommodation" (2 occurrences in 1 file) Detailed list: [file.meta.yml:123] inmigrant → immigrant [file.meta.yml:456] seperate → separate ... 4. Offer to fix typos After presenting results, ask the user: Fix all automatically? - Apply all suggested fixes Review each typo? - Go through typos one by one for confirmation Cancel - Exit without making changes 5. Apply fixes (if user confirms) For automatic fixes:

Use sed or Python script to replace typos in files

Example: sed -i '' 's/seperate/separate/g' file.meta.yml

For reviewed fixes, confirm each change before applying. 6. Verify fixes After applying fixes, re-run codespell to verify all typos were corrected: .venv/bin/codespell < path

--ignore-words

.codespell-ignore.txt Should return 0 results. 7. Clean up IMPORTANT: Delete any temporary files created during the check: rm -f /tmp/archived_files.txt /tmp/all_meta_files.txt /tmp/active_meta_files.txt \ /tmp/all_step_files.txt /tmp/active_step_files.txt \ /tmp/all_snapshot_files.txt /tmp/active_snapshot_files.txt \ /tmp/codespell_output.txt The only persistent files should be: - The .codespell-ignore.txt whitelist ( if it doesn 't exist, create it) - Modified .meta.yml files (if fixes were applied) Do NOT create new persistent files in the repo like: - ❌ TYPO_CHECK_REPORT.md - ❌ scripts/analyze_typos.py - ❌ scripts/advanced_spell_checker.py All analysis logic should be embedded in this command execution, not saved as separate files.


Error Handling

  • Check if codespell is installed first (see step 0). If not installed and uv add --dev codespell fails, explain to the user how to install it manually with uv sync or check their Python environment
  • If no .meta.yml or .dvc files are found in the specified scope, inform the user
  • If codespell finds no typos, congratulate the user on clean metadata!
  • If file modification fails, report which files couldn' t be updated

Notes

  • Always use American English spelling ( e.g., "combating" not "combatting" )
  • Technical field names ( like variable names with underscores ) are typically safe to ignore
  • Acronyms in ALL CAPS should be ignored - they are almost always legitimate acronyms ( e.g., TE, INE, DIEA )
  • URLs and domain names should be ignored - codespell may flag parts of URLs ( e.g., "ine.es" , "corona.fo" ) but these are correct
  • When in doubt about a flagged word, ask the user before fixing
返回排行榜