chunking-strategy

安装量: 349
排名: #2664

安装

npx skills add https://github.com/giuseppe-trisciuoglio/developer-kit --skill chunking-strategy
Chunking Strategy for RAG Systems
Overview
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
When to Use
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
Instructions
Choose Chunking Strategy
Select appropriate chunking strategy based on document type and use case:
Fixed-Size Chunking
(Level 1)
Use for simple documents without clear structure
Start with 512 tokens and 10-20% overlap
Adjust size based on query type: 256 for factoid, 1024 for analytical
Recursive Character Chunking
(Level 2)
Use for documents with clear structural boundaries
Implement hierarchical separators: paragraphs → sentences → words
Customize separators for document types (HTML, Markdown)
Structure-Aware Chunking
(Level 3)
Use for structured documents (Markdown, code, tables, PDFs)
Preserve semantic units: functions, sections, table blocks
Validate structure preservation post-splitting
Semantic Chunking
(Level 4)
Use for complex documents with thematic shifts
Implement embedding-based boundary detection
Configure similarity threshold (0.8) and buffer size (3-5 sentences)
Advanced Methods
(Level 5)
Use Late Chunking for long-context embedding models
Apply Contextual Retrieval for high-precision requirements
Monitor computational costs vs. retrieval improvements
Reference detailed strategy implementations in
references/strategies.md
.
Implement Chunking Pipeline
Follow these steps to implement effective chunking:
Pre-process documents
Analyze document structure and content types
Identify multi-modal content (tables, images, code)
Assess information density and complexity
Select strategy parameters
Choose chunk size based on embedding model context window
Set overlap percentage (10-20% for most cases)
Configure strategy-specific parameters
Process and validate
Apply chosen chunking strategy
Validate semantic coherence of chunks
Test with representative documents
Evaluate and iterate
Measure retrieval precision and recall
Monitor processing latency and resource usage
Optimize based on specific use case requirements
Reference detailed implementation guidelines in
references/implementation.md
.
Evaluate Performance
Use these metrics to evaluate chunking effectiveness:
Retrieval Precision
Fraction of retrieved chunks that are relevant
Retrieval Recall
Fraction of relevant chunks that are retrieved
End-to-End Accuracy
Quality of final RAG responses
Processing Time
Latency impact on overall system
Resource Usage
Memory and computational costs Reference detailed evaluation framework in references/evaluation.md . Examples Basic Fixed-Size Chunking from langchain . text_splitter import RecursiveCharacterTextSplitter

Configure for factoid queries

splitter

RecursiveCharacterTextSplitter ( chunk_size = 256 , chunk_overlap = 25 , length_function = len ) chunks = splitter . split_documents ( documents ) Structure-Aware Code Chunking def chunk_python_code ( code ) : """Split Python code into semantic chunks""" import ast tree = ast . parse ( code ) chunks = [ ] for node in ast . walk ( tree ) : if isinstance ( node , ( ast . FunctionDef , ast . ClassDef ) ) : chunks . append ( ast . get_source_segment ( code , node ) ) return chunks Semantic Chunking with Embeddings def semantic_chunk ( text , similarity_threshold = 0.8 ) : """Chunk text based on semantic boundaries""" sentences = split_into_sentences ( text ) embeddings = generate_embeddings ( sentences ) chunks = [ ] current_chunk = [ sentences [ 0 ] ] for i in range ( 1 , len ( sentences ) ) : similarity = cosine_similarity ( embeddings [ i - 1 ] , embeddings [ i ] ) if similarity < similarity_threshold : chunks . append ( " " . join ( current_chunk ) ) current_chunk = [ sentences [ i ] ] else : current_chunk . append ( sentences [ i ] ) chunks . append ( " " . join ( current_chunk ) ) return chunks Best Practices Core Principles Balance context preservation with retrieval precision Maintain semantic coherence within chunks Optimize for embedding model constraints Preserve document structure when beneficial Implementation Guidelines Start simple with fixed-size chunking (512 tokens, 10-20% overlap) Test thoroughly with representative documents Monitor both accuracy metrics and computational costs Iterate based on specific document characteristics Common Pitfalls to Avoid Over-chunking: Creating too many small, context-poor chunks Under-chunking: Missing relevant information due to oversized chunks Ignoring document structure and semantic boundaries Using one-size-fits-all approach for diverse content types Neglecting overlap for boundary-crossing information Constraints and Warnings Resource Considerations Semantic and contextual methods require significant computational resources Late chunking needs long-context embedding models Complex strategies increase processing latency Monitor memory usage for large document processing Quality Requirements Validate chunk semantic coherence post-processing Test with domain-specific documents before deployment Ensure chunks maintain standalone meaning where possible Implement proper error handling for edge cases References Reference detailed documentation in the references/ folder: strategies.md - Detailed strategy implementations implementation.md - Complete implementation guidelines evaluation.md - Performance evaluation framework tools.md - Recommended libraries and frameworks research.md - Key research papers and findings advanced-strategies.md - 11 comprehensive chunking methods semantic-methods.md - Semantic and contextual approaches visualization-tools.md - Evaluation and visualization tools

返回排行榜