AI Agents

RAG Architecture Patterns for Enterprise Knowledge Bases

From naive RAG to advanced hybrid retrieval — a comprehensive guide to building accurate, scalable knowledge retrieval systems.

December 202514 min read

Beyond Naive RAG

Retrieval-Augmented Generation has become the standard pattern for grounding LLM responses in factual data. But the gap between a tutorial RAG demo and a production enterprise system is enormous. Naive RAG — embed documents, find nearest vectors, stuff into prompt — works for demos but fails in enterprise settings due to poor recall, hallucinations from irrelevant context, and inability to handle complex queries.

This article walks through the architecture patterns that bridge that gap, from foundational improvements to advanced techniques we deploy at MBB AI Studio.

Pattern 1: Chunking Strategy

How you split documents matters more than which embedding model you use. Common chunking mistakes include:

  • Fixed-size chunks with arbitrary boundaries that split sentences and lose context
  • Chunks that are too small (< 100 tokens) losing semantic meaning
  • Chunks that are too large (> 1000 tokens) diluting the relevant information

Effective chunking strategies for enterprise documents:

Semantic chunking — Split based on topic shifts detected by embedding similarity between consecutive paragraphs:

python
from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,
)
chunks = splitter.split_documents(documents)

Document-structure-aware chunking — Respect document hierarchy (headers, sections, lists). For PDFs, use layout-aware parsers like Unstructured or DocLing that preserve heading structure:

python
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="report.pdf",
    strategy="hi_res",
    chunking_strategy="by_title",
    max_characters=1500,
    combine_text_under_n_chars=200,
)

Parent-child chunking — Index small chunks for precise retrieval but return the parent (larger) chunk for context. This gives you the best of both worlds: high-precision matching with sufficient context for generation.

Pattern 2: Hybrid Retrieval

Vector similarity alone misses exact keyword matches, acronyms, and domain-specific terminology. Hybrid retrieval combines dense (vector) and sparse (keyword) search:

python
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Dense retriever (vector similarity)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Sparse retriever (BM25 keyword matching)
bm25_retriever = BM25Retriever.from_documents(documents, k=10)

# Combine with reciprocal rank fusion
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Tune based on your data
)

In our experience, hybrid retrieval improves recall by 15-30% over pure vector search for enterprise knowledge bases, especially for technical documentation with domain-specific jargon.

Pattern 3: Query Transformation

User queries are often vague, ambiguous, or poorly structured. Transform them before retrieval:

Query expansion — Generate multiple reformulations of the query to cast a wider retrieval net:

python
def expand_query(original_query: str) -> list[str]:
    prompt = f"""Generate 3 alternative phrasings of this question
    that might match relevant documents differently:

    Original: {original_query}

    Return only the 3 alternatives, one per line."""

    result = llm.invoke(prompt)
    alternatives = result.content.strip().split("\n")
    return [original_query] + alternatives

HyDE (Hypothetical Document Embeddings) — Generate a hypothetical answer, embed it, and use that embedding for retrieval. This works because a hypothetical answer is semantically closer to the actual document than the question itself:

python
def hyde_retrieval(query: str) -> list[Document]:
    # Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a brief paragraph that would answer: {query}"
    )
    # Embed the hypothetical answer
    hyde_embedding = embeddings.embed_query(hypothetical.content)
    # Search with the hypothetical embedding
    return vectorstore.similarity_search_by_vector(hyde_embedding, k=10)

Step-back prompting — For specific questions, generate a broader question first:

  • Original: "What is the maximum batch size for model X on A100?"
  • Step-back: "What are the performance characteristics and configuration options for model X?"

The broader question retrieves more relevant context that likely contains the specific answer.

Pattern 4: Re-Ranking

Initial retrieval casts a wide net (top 20-50 results). A re-ranker then scores each result's relevance to the specific query and returns the top-k:

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
    # Broad initial retrieval
    candidates = hybrid_retriever.invoke(query)  # top 30

    # Score each candidate
    pairs = [(query, doc.page_content) for doc in candidates]
    scores = reranker.predict(pairs)

    # Return top-k by reranker score
    scored_docs = sorted(
        zip(candidates, scores), key=lambda x: x[1], reverse=True
    )
    return [doc for doc, score in scored_docs[:k]]

Cross-encoder re-rankers are significantly more accurate than bi-encoder similarity for relevance scoring because they process the query and document jointly. The trade-off is speed — re-ranking 30 candidates takes 50-100ms vs. 5ms for vector search. This is why we use a two-stage pipeline: fast retrieval followed by accurate re-ranking.

Pattern 5: Contextual Compression

Even after re-ranking, retrieved chunks may contain irrelevant information that wastes prompt tokens and confuses the LLM. Contextual compression extracts only the relevant portions:

python
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

def compressed_retrieval(query: str) -> list[Document]:
    docs = retrieve_and_rerank(query, k=5)
    compressed = []
    for doc in docs:
        result = compressor.compress_documents([doc], query)
        if result:
            compressed.extend(result)
    return compressed

This reduces token usage by 40-60% while maintaining answer quality — a significant cost savings at scale.

Pattern 6: Multi-Index Architecture

Enterprise knowledge bases span multiple data sources with different structures. Instead of one monolithic index, use specialized indexes:

IndexData SourceEmbedding ModelChunk Size
docsTechnical documentationtext-embedding-3-large800 tokens
codeCode repositoriescode-specific embeddingsFunction-level
ticketsSupport ticketstext-embedding-3-smallFull ticket
policiesPolicy documentstext-embedding-3-largeSection-level

A routing layer classifies the incoming query and selects the appropriate index(es):

python
def route_query(query: str) -> list[str]:
    classification = llm.invoke(
        f"Classify this query into one or more categories: "
        f"docs, code, tickets, policies\n\nQuery: {query}"
    )
    return parse_categories(classification.content)

This approach dramatically improves precision by searching only relevant knowledge domains.

Pattern 7: Evaluation and Continuous Improvement

You can't improve what you can't measure. Build an evaluation pipeline:

Retrieval metrics:

  • Recall@k — What percentage of relevant documents are in the top-k results?
  • MRR (Mean Reciprocal Rank) — How high is the first relevant result ranked?
  • NDCG — Are the most relevant results ranked highest?

End-to-end metrics:

  • Answer correctness — Scored by an LLM judge against ground-truth answers
  • Faithfulness — Does the answer only use information from the retrieved context?
  • Answer relevance — Does the answer actually address the question asked?

Build an evaluation dataset of 100-200 query-answer pairs manually curated by domain experts. Run evaluations on every pipeline change:

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_correctness, context_recall],
)
print(results)

Putting It All Together

A production RAG pipeline combines these patterns into a multi-stage pipeline:

1. Query transformation — Expand/reformulate the query 2. Hybrid retrieval — Dense + sparse search across routed indexes 3. Re-ranking — Cross-encoder scoring of candidates 4. Contextual compression — Extract relevant portions 5. Generation — LLM produces an answer with source citations 6. Post-processing — Validate citations, check for hallucination markers

Conclusion

Enterprise RAG is not a single technique — it's an architecture. Each pattern addresses a specific failure mode of naive RAG: poor chunking causes lost context, single-mode retrieval misses keywords, unranked results dilute quality, and lack of evaluation means silent degradation. At MBB AI Studio, we implement these patterns incrementally with clients, measuring improvement at each stage. Start with hybrid retrieval and re-ranking — these two changes alone typically improve answer quality by 30-40% over naive RAG.