Content-Hash File Cache Pattern Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes. When to Activate Building file processing pipelines (PDF, images, text extraction) Processing cost is high and same files are processed repeatedly Need a --cache/--no-cache CLI option Want to add caching to existing pure functions without modifying them Core Pattern 1. Content-Hash Based Cache Key Use file content (not path) as the cache key: import hashlib from pathlib import Path _HASH_CHUNK_SIZE = 65536

64KB chunks for large files

def compute_file_hash ( path : Path ) -

str : """SHA-256 of file contents (chunked for large files).""" if not path . is_file ( ) : raise FileNotFoundError ( f"File not found: { path } " ) sha256 = hashlib . sha256 ( ) with open ( path , "rb" ) as f : while True : chunk = f . read ( _HASH_CHUNK_SIZE ) if not chunk : break sha256 . update ( chunk ) return sha256 . hexdigest ( ) Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed. 2. Frozen Dataclass for Cache Entry from dataclasses import dataclass @dataclass ( frozen = True , slots = True ) class CacheEntry : file_hash : str source_path : str document : ExtractedDocument

The cached result

File-Based Cache Storage Each cache entry is stored as {hash}.json — O(1) lookup by hash, no index file required. import json from typing import Any def write_cache ( cache_dir : Path , entry : CacheEntry ) -

None : cache_dir . mkdir ( parents = True , exist_ok = True ) cache_file = cache_dir / f" { entry . file_hash } .json" data = serialize_entry ( entry ) cache_file . write_text ( json . dumps ( data , ensure_ascii = False ) , encoding = "utf-8" ) def read_cache ( cache_dir : Path , file_hash : str ) -

CacheEntry | None : cache_file = cache_dir / f" { file_hash } .json" if not cache_file . is_file ( ) : return None try : raw = cache_file . read_text ( encoding = "utf-8" ) data = json . loads ( raw ) return deserialize_entry ( data ) except ( json . JSONDecodeError , ValueError , KeyError ) : return None

Treat corruption as cache miss

Service Layer Wrapper (SRP) Keep the processing function pure. Add caching as a separate service layer. def extract_with_cache ( file_path : Path , * , cache_enabled : bool = True , cache_dir : Path = Path ( ".cache" ) , ) -

ExtractedDocument : """Service layer: cache check -> extraction -> cache write.""" if not cache_enabled : return extract_text ( file_path )

Pure function, no cache knowledge

file_hash

compute_file_hash ( file_path )

Check cache

cached

read_cache ( cache_dir , file_hash ) if cached is not None : logger . info ( "Cache hit: %s (hash=%s)" , file_path . name , file_hash [ : 12 ] ) return cached . document

Cache miss -> extract -> store

logger . info ( "Cache miss: %s (hash=%s)" , file_path . name , file_hash [ : 12 ] ) doc = extract_text ( file_path ) entry = CacheEntry ( file_hash = file_hash , source_path = str ( file_path ) , document = doc ) write_cache ( cache_dir , entry ) return doc Key Design Decisions Decision Rationale SHA-256 content hash Path-independent, auto-invalidates on content change {hash}.json file naming O(1) lookup, no index file needed Service layer wrapper SRP: extraction stays pure, cache is a separate concern Manual JSON serialization Full control over frozen dataclass serialization Corruption returns None Graceful degradation, re-processes on next run cache_dir.mkdir(parents=True) Lazy directory creation on first write Best Practices Hash content, not paths — paths change, content identity doesn't Chunk large files when hashing — avoid loading entire files into memory Keep processing functions pure — they should know nothing about caching Log cache hit/miss with truncated hashes for debugging Handle corruption gracefully — treat invalid cache entries as misses, never crash Anti-Patterns to Avoid

BAD: Path-based caching (breaks on file move/rename)

cache

{ "/path/to/file.pdf" : result }

BAD: Adding cache logic inside the processing function (SRP violation)

def extract_text ( path , * , cache_enabled = False , cache_dir = None ) : if cache_enabled :

Now this function has two responsibilities

. . .

BAD: Using dataclasses.asdict() with nested frozen dataclasses

(can cause issues with complex nested types)

data

dataclasses . asdict ( entry )

Use manual serialization instead

When to Use File processing pipelines (PDF parsing, OCR, text extraction, image analysis) CLI tools that benefit from --cache/--no-cache options Batch processing where the same files appear across runs Adding caching to existing pure functions without modifying them When NOT to Use Data that must always be fresh (real-time feeds) Cache entries that would be extremely large (consider streaming instead) Results that depend on parameters beyond file content (e.g., different extraction configs)

content-hash-cache-pattern

安装