All posts
Feb 15, 2026·8 min read

Building RAG Pipelines That Actually Work in Production

RAGLLMsEngineering

Umesh Bhati

Full-Stack & AI Engineer

Why Most RAG Demos Fail in Production

Building a RAG demo is easy. Getting it to actually work in production — reliably, at scale, with real enterprise data — is where most teams struggle.

I've shipped several RAG systems now, from internal research tools to client-facing AI assistants. Here's what I've learned.

The Chunking Problem

Most tutorials chunk documents at fixed sizes (512 tokens, 1024 tokens) with fixed overlap. This is fine for demos. In production, it creates terrible retrieval because:

  • A chunk that cuts mid-sentence loses context
  • Fixed chunks don't respect document structure (headers, paragraphs, code blocks)
  • Overlap is usually too small or creates duplicate retrievals
  • What actually works: Semantic chunking using embedding similarity to find natural break points, combined with parent-child chunking where you store small chunks but retrieve with larger context windows.

    Retrieval Quality is Everything

    Your embedding model matters less than you think. Your retrieval strategy matters a lot.

    Naive vector search misses:

  • Keyword-heavy queries (product names, error codes)
  • Multi-hop questions that need multiple chunks
  • Queries that need recent information that's been reindexed
  • What actually works: Hybrid search (vector + BM25) with re-ranking using a cross-encoder model. Add a query expansion step using an LLM to generate alternative phrasings.

    Evaluation is Non-Negotiable

    You cannot improve what you don't measure. Set up evaluation from day one.

    Key metrics:

  • **Context precision**: Is the retrieved context relevant to the question?
  • **Context recall**: Does the retrieved context contain the answer?
  • **Answer faithfulness**: Does the LLM answer stick to the retrieved context?
  • **Answer relevance**: Does the answer actually address the question?
  • Tools: RAGAS, DeepEval, or roll your own with GPT-4 as a judge.

    Production Checklist

  • Semantic/parent-child chunking
  • Hybrid search with re-ranking
  • Query expansion
  • Evaluation framework set up on day 1
  • Caching layer for repeated queries
  • Observability (LangSmith, Langfuse)
  • Graceful fallbacks when retrieval fails
  • The difference between a demo and a production system is mostly operational maturity, not algorithmic cleverness.

    Questions or thoughts? Find me on X or send an email.