RAG Agent Builder
Build powerful Retrieval-Augmented Generation (RAG) applications that enhance LLM capabilities with external knowledge sources, enabling accurate, contextualized AI responses.
Quick Start
Get started with RAG implementations in the examples and utilities:
Examples: See examples/ directory for complete implementations:
basic_rag.py - Simple chunk-embed-retrieve-generate pipeline retrieval_strategies.py - Hybrid search, reranking, and filtering agentic_rag.py - Agent-controlled retrieval with iterative refinement
Utilities: See scripts/ directory for helper modules:
embedding_management.py - Embedding generation, normalization, and caching vector_db_manager.py - Vector database abstraction and factory rag_evaluation.py - Retrieval and answer quality metrics Overview
RAG systems combine three key components:
Document Retrieval - Find relevant information from knowledge bases Context Integration - Pass retrieved context to the LLM Response Generation - Generate answers grounded in the retrieved information
This skill covers building production-ready RAG applications with various frameworks and approaches.
Core Concepts What is RAG?
RAG augments LLM knowledge with external data:
Without RAG: LLM relies on training data (may be outdated or limited) With RAG: LLM uses real-time, custom knowledge + training knowledge When to Use RAG Document Q&A: Answer questions about PDFs, books, reports Knowledge Base Search: Query internal documentation, wikis Enterprise Search: Search proprietary company data Context-Specific Assistants: Customer support, HR assistants Fact-Heavy Applications: Legal docs, medical records, financial data When RAG Might Not Be Needed General knowledge questions (ChatGPT-like) Real-time data that changes constantly (use tools instead) Very simple lookup tasks (use database queries) Architecture Patterns Basic RAG Pipeline Documents → Chunks → Embeddings → Vector DB ↓ User Question → Embedding → Retrieval → LLM → Answer ↑ ↓ Vector DB Context
Advanced RAG Patterns 1. Agentic RAG Agent decides what to retrieve and when Can refine queries iteratively Better for complex reasoning 2. Hierarchical RAG Multi-level document structure Search at different levels of detail More flexible organization 3. Hybrid Search RAG Combines keyword search (BM25) + semantic search (embeddings) Captures both exact matches and meaning Better for mixed query types 4. Corrective RAG (CRAG) Evaluates retrieved documents for relevance Retrieves additional sources if needed Ensures high-quality context Implementation Components 1. Document Processing
Chunking Strategies:
Simple fixed-size chunks
chunks = split_text(doc, chunk_size=1000, overlap=100)
Semantic chunks (group by meaning)
chunks = semantic_chunking(doc, max_tokens=512)
Hierarchical chunks (different levels)
chapters = split_by_heading(doc) chunks = split_each_chapter(chapters, size=1000)
Key Considerations:
Chunk size affects retrieval quality and cost Overlap helps maintain context between chunks Semantic chunking preserves meaning better 2. Embedding Generation
Popular Embedding Models:
OpenAI: text-embedding-3-small, text-embedding-3-large Open Source: all-MiniLM-L6-v2, all-mpnet-base-v2 Domain-Specific: Domain-trained embeddings for specialized knowledge
Best Practices:
Use consistent embedding model for retrieval and queries Store embeddings with normalized vectors Update embeddings when documents change 3. Vector Databases
Popular Options:
Pinecone: Managed, serverless, easy to scale Weaviate: Open-source, self-hosted, flexible Milvus: Open-source, high performance Chroma: Lightweight, good for prototypes Qdrant: Production-grade, high-performance
Selection Criteria:
Scale requirements (data volume, queries per second) Latency needs (real-time vs batch) Cost considerations Deployment preferences (managed vs self-hosted) 4. Retrieval Strategies
Retrieval Methods:
Similarity search (most common)
results = vector_db.query(question_embedding, k=5)
Hybrid search (keyword + semantic)
keyword_results = bm25.search(question, k=3) semantic_results = vector_db.query(embedding, k=3) results = combine_and_rank(keyword_results, semantic_results)
Reranking (improve relevance)
retrieved = initial_retrieval(query) reranked = rerank_by_relevance(retrieved, query)
Retrieval Parameters:
k (number of results): Balance between context and relevance Similarity threshold: Filter out low-relevance results Diversity: Return varied results vs best matches 5. Context Integration
Context Window Management:
Fit retrieved documents into context window
def prepare_context(retrieved_docs, max_tokens=3000): context = "" for doc in retrieved_docs: if len(tokenize(context + doc)) <= max_tokens: context += doc else: break return context
Prompt Design:
You are a helpful assistant. Answer the question based on the provided context.
Context:
Question: {user_question}
Answer:
- Response Generation
Generation Strategies:
Direct Generation: LLM answers from context Summarization: Summarize multiple retrieved docs first Fact-Grounding: Ensure answer cites sources Iterative Refinement: Refine based on user feedback Implementation Patterns Pattern 1: Basic RAG
Simplest RAG implementation:
Split documents into chunks Generate embeddings for each chunk Store in vector database Retrieve top-k similar chunks for query Pass to LLM with context
Pros: Simple, fast, works well for straightforward QA Cons: May miss relevant context, no refinement
Pattern 2: Agentic RAG
Agent controls retrieval:
Agent receives user question Decides whether to retrieve documents Formulates retrieval query (may differ from original) Retrieves relevant documents Can iterate or use tools Generates final answer
Pros: Better for complex questions, iterative improvement Cons: More complex, higher costs
Pattern 3: Corrective RAG (CRAG)
Validates retrieved documents:
Retrieve documents for question Grade each document for relevance If poor relevance: Try different retrieval strategy Expand search scope Retrieve from different sources Generate answer from validated context
Pros: Higher quality answers, adapts to failures Cons: More API calls, slower
Popular Frameworks LangChain from langchain.document_loaders import PDFLoader from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Pinecone from langchain.chains import RetrievalQA
Load documents
loader = PDFLoader("document.pdf") docs = loader.load()
Create RAG chain
embeddings = OpenAIEmbeddings() vectorstore = Pinecone.from_documents(docs, embeddings) qa = RetrievalQA.from_chain_type( llm=ChatOpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever() )
answer = qa.run("What is the document about?")
LlamaIndex from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
Load documents
documents = SimpleDirectoryReader("./data").load_data()
Create index
index = GPTVectorStoreIndex.from_documents(documents)
Query
response = index.as_query_engine().query("What is the main topic?")
CrewAI with RAG from crewai import Agent, Task, Crew from tools import retrieval_tool
researcher = Agent( role="Research Assistant", goal="Research topics using knowledge base", tools=[retrieval_tool] )
research_task = Task( description="Research the topic: {topic}", agent=researcher )
Best Practices Document Preparation ✓ Clean and normalize text (remove headers, footers) ✓ Preserve document structure when possible ✓ Add metadata (source, date, category) ✓ Handle PDFs with OCR if scanned ✓ Test chunk sizes for your domain Embedding Strategy ✓ Use same embedding model for indexing and queries ✓ Fine-tune embeddings for domain-specific needs ✓ Normalize embeddings for consistency ✓ Monitor embedding quality metrics Retrieval Optimization ✓ Tune k (number of results) for your use case ✓ Use reranking for quality improvement ✓ Implement relevance filtering ✓ Monitor retrieval precision and recall ✓ Cache frequently retrieved documents Generation Quality ✓ Include source citations in answers ✓ Prompt LLM to indicate confidence ✓ Ask to cite specific documents ✓ Generate summaries for long contexts ✓ Validate answers against context Monitoring & Evaluation ✓ Track retrieval metrics (precision, recall, MRR) ✓ Monitor answer quality and relevance ✓ Log failed retrievals for improvement ✓ Collect user feedback ✓ Iterate based on failures Common Challenges & Solutions Challenge: Irrelevant Retrieval
Solutions:
Improve chunking strategy Better embedding model Add document metadata to queries Implement reranking Use hybrid search Challenge: Context Too Large
Solutions:
Reduce chunk size Retrieve fewer results (smaller k) Summarize retrieved context Use hierarchical retrieval Filter by relevance score Challenge: Missing Information
Solutions:
Increase k (retrieve more) Improve embedding model Better preprocessing Use multiple search strategies Add document hierarchy Challenge: Slow Performance
Solutions:
Use managed vector database Cache embeddings Batch process documents Optimize chunk size Use smaller embedding model for speed Evaluation Metrics
Retrieval Metrics:
Precision: % of retrieved docs that are relevant Recall: % of relevant docs that are retrieved MRR (Mean Reciprocal Rank): Rank of first relevant result NDCG (Normalized DCG): Quality of ranking
Answer Quality Metrics:
Relevance: Does answer address the question? Correctness: Is the answer factually accurate? Grounding: Is answer supported by context? User Satisfaction: Would user find answer helpful? Advanced Techniques 1. Query Expansion
Expand query with related terms
expanded_query = query + " " + synonym_expansion(query) results = retrieve(expanded_query)
- Document Compression
Compress retrieved docs before passing to LLM
compressed = compress_documents(retrieved_docs, query) context = format_context(compressed)
- Active Retrieval
Iteratively refine retrieval based on LLM output
query = user_question while iterations < max: results = retrieve(query) answer = generate_with_context(results) if answer_complete(answer): break query = refine_query(answer)
- Multi-Modal RAG
Retrieve both text and images
text_results = text_retriever.query(question) image_results = image_retriever.query(question) context = combine_multimodal(text_results, image_results)
Resources & References Key Papers "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al.) "REALM: Retrieval-Augmented Language Model Pre-Training" (Guu et al.) Frameworks LangChain: https://python.langchain.com/ LlamaIndex: https://www.llamaindex.ai/ HayStack: https://haystack.deepset.ai/ Vector Databases Pinecone: https://www.pinecone.io/ Weaviate: https://weaviate.io/ Qdrant: https://qdrant.tech/ Embedding Models OpenAI: https://platform.openai.com/docs/guides/embeddings Hugging Face: https://huggingface.co/models?pipeline_tag=sentence-similarity Next Steps Choose your stack: Decide on framework (LangChain, LlamaIndex, etc.) Prepare documents: Process and chunk your knowledge base Select embeddings: Choose embedding model for your domain Pick vector DB: Select storage solution for scale Build pipeline: Implement retrieval and generation Evaluate: Test on sample questions and iterate Monitor: Track quality metrics in production