Umesh Bhati — Full-Stack & AI Engineer

Why Most RAG Demos Fail in Production

Building a RAG demo is easy. Getting it to actually work in production — reliably, at scale, with real enterprise data — is where most teams struggle.

I've shipped several RAG systems now, from internal research tools to client-facing AI assistants. Here's what I've learned.

The Chunking Problem

Most tutorials chunk documents at fixed sizes (512 tokens, 1024 tokens) with fixed overlap. This is fine for demos. In production, it creates terrible retrieval because:

A chunk that cuts mid-sentence loses context

Fixed chunks don't respect document structure (headers, paragraphs, code blocks)

Overlap is usually too small or creates duplicate retrievals

What actually works: Semantic chunking using embedding similarity to find natural break points, combined with parent-child chunking where you store small chunks but retrieve with larger context windows.

Retrieval Quality is Everything

Your embedding model matters less than you think. Your retrieval strategy matters a lot.

Naive vector search misses:

Keyword-heavy queries (product names, error codes)

Multi-hop questions that need multiple chunks

Queries that need recent information that's been reindexed

What actually works: Hybrid search (vector + BM25) with re-ranking using a cross-encoder model. Add a query expansion step using an LLM to generate alternative phrasings.

Evaluation is Non-Negotiable

You cannot improve what you don't measure. Set up evaluation from day one.

Key metrics:

**Context precision**: Is the retrieved context relevant to the question?

**Context recall**: Does the retrieved context contain the answer?

**Answer faithfulness**: Does the LLM answer stick to the retrieved context?

**Answer relevance**: Does the answer actually address the question?

Tools: RAGAS, DeepEval, or roll your own with GPT-4 as a judge.

Production Checklist

Semantic/parent-child chunking

Hybrid search with re-ranking

Query expansion

Evaluation framework set up on day 1

Caching layer for repeated queries

Observability (LangSmith, Langfuse)

Graceful fallbacks when retrieval fails

The difference between a demo and a production system is mostly operational maturity, not algorithmic cleverness.

Building RAG Pipelines That Actually Work in Production

Why Most RAG Demos Fail in Production

The Chunking Problem

Retrieval Quality is Everything

Evaluation is Non-Negotiable

Production Checklist