RAG Pipeline Builder The RAG Pipeline Builder skill guides you through designing and implementing Retrieval-Augmented Generation systems that enhance LLM responses with relevant context from your own data. RAG combines the power of large language models with the precision of information retrieval, reducing hallucinations and enabling AI to work with private, current, or domain-specific knowledge. This skill covers the complete RAG stack: document ingestion, chunking strategies, embedding generation, vector storage, retrieval optimization, context injection, and response generation. It helps you make informed decisions at each stage based on your specific requirements for accuracy, latency, cost, and scale. Whether you are building a documentation Q&A bot, a customer support system, or an enterprise knowledge assistant, this skill ensures your RAG implementation follows production best practices. Core Workflows Workflow 1: Design RAG Architecture Define requirements: Data sources and formats Query types and patterns Accuracy requirements Latency budget Scale expectations Choose components: Document loaders Chunking strategy Embedding model Vector database LLM for generation Reranking layer (optional) Design data flow: Documents → Loader → Chunker → Embedder → Vector DB ↓ Query → Embedder → Vector Search → Reranker → Context ↓ Context + Query → LLM → Response Document architecture decisions Workflow 2: Implement Ingestion Pipeline Set up document loaders: PDF, Markdown, HTML parsers API connectors for live sources Incremental update handling Implement chunking: def smart_chunk ( doc , chunk_size = 500 , overlap = 50 ) :

Respect document structure

sections

extract_sections

(

doc

)

chunks

=

[

]

for

section

in

sections

:

if

len

(

section

)

>

chunk_size

:

chunks

.

extend

(

sliding_window

(

section

,

chunk_size

,

overlap

)

else

:

chunks

.

append

(

section

)

return

add_metadata

(

chunks

,

doc

)

Generate

embeddings with batching

Store

in vector database with metadata

Verify

ingestion quality

Workflow 3: Optimize Retrieval Quality

Measure

baseline retrieval performance:

Recall@k for known queries

Mean Reciprocal Rank (MRR)

Relevance scoring

Apply

optimization techniques:

Query expansion/rewriting

Hybrid search (semantic + keyword)

Reranking with cross-encoders

Metadata filtering

Tune

retrieval parameters:

Number of chunks to retrieve (k)

Similarity threshold

Diversity/MMR settings

Validate

improvements with test set

Quick Reference

Action

Command/Trigger

Design RAG system

"Help me design a RAG pipeline for [use case]"

Choose vector DB

"Which vector database for RAG"

Optimize chunking

"Best chunking strategy for [content type]"

Improve retrieval

"My RAG has poor retrieval quality"

Reduce hallucinations

"RAG still hallucinating, help fix"

Scale pipeline

"Scale RAG to [X] documents"

Best Practices

Chunk at Semantic Boundaries

Preserve meaning in chunks

Good: Split at paragraphs, sections, or topic boundaries

Bad: Fixed-size splits that cut sentences mid-thought

Include section headers as context in chunks

Include Rich Metadata

Enable filtering and context

Source document, section, page number

Timestamps for temporal relevance

Categories, tags, or topics

Use metadata filters before semantic search

Use Hybrid Search

Combine semantic and keyword search

Semantic: Captures meaning and synonyms

Keyword (BM25): Catches exact terms, names, codes

Weight combination based on query type

Rerank for Quality

Two-stage retrieval improves precision

Stage 1: Fast vector search (retrieve 20-50)

Stage 2: Cross-encoder reranking (keep top 5-10)

Reranking is slower but much more accurate

Show Your Work

Include citations and sources
Return source chunks with responses
Enable users to verify and explore
Build trust through transparency
Handle Edge Cases: What happens when retrieval fails? No relevant results found Conflicting information in sources Query outside knowledge base scope Implement graceful fallbacks Advanced Techniques Multi-Index Strategy Use different indexes for different content types: Index 1: FAQs (short, self-contained) Index 2: Documentation (long-form, structured) Index 3: Conversations (temporal, contextual) Route queries to appropriate index based on intent Query Transformation Pipeline Improve retrieval with query processing: def transform_query ( query ) :

Step 1: Classify query type

query_type

classify_query ( query )

Step 2: Extract entities

entities

extract_entities ( query )

Step 3: Generate search queries

if query_type == "factual" : return generate_keyword_queries ( query , entities ) elif query_type == "conceptual" : return generate_semantic_queries ( query ) else : return [ query ]

Use as-is

Contextual Compression Reduce noise in retrieved context: Retrieved chunks (verbose) → LLM compressor → Relevant excerpts only Agentic RAG Let the LLM control retrieval: def agentic_rag ( query ) :

LLM decides what to search for

search_plan

llm . plan_searches ( query )

Execute searches

results

[ ] for search in search_plan : results . extend ( retriever . search ( search . query , filters = search . filters ) )

LLM synthesizes answer

return llm . synthesize ( query , results ) Evaluation Framework Continuously measure RAG quality: Metrics: - Retrieval: Precision@k, Recall@k, MRR - Generation: Faithfulness, Answer Relevance, Context Utilization - End-to-end: Task Success Rate, User Satisfaction Tools: Ragas, TruLens, LangSmith Common Pitfalls to Avoid Chunking too large (loses specificity) or too small (loses context) Not preserving document structure and hierarchy in chunks Ignoring keyword search when exact matches matter Retrieving too few chunks (missing information) or too many (context dilution) Not handling conflicting information across sources Assuming LLM will always use retrieved context correctly Skipping evaluation and monitoring in production Not updating embeddings when source documents change

rag pipeline builder

安装

Respect document structure

sections

Step 1: Classify query type

query_type

Step 2: Extract entities

entities

Step 3: Generate search queries

Use as-is

LLM decides what to search for

search_plan

Execute searches

results

LLM synthesizes answer