Building Production RAG: Beyond the Tutorial

What it actually takes to build a RAG system that won't embarrass you in production—lessons from building a client-safe knowledge platform for a creative agency.

7 min read
aiengineeringbackendlearning

Most RAG tutorials show you how to stuff documents into a vector database and call an LLM. That's maybe 10% of building a production system. The other 90%? That's where things get interesting—and where I spent the past few months building a knowledge platform for a creative agency.

Here's what I learned building a system that actually works.

The Real Problem

The client was a branding agency managing assets for 50+ clients. Their "knowledge management" was a sprawling Google Drive with thousands of documents—brand guidelines, past campaigns, style guides, logo specs. Finding anything meant:

  1. Knowing which folder to look in (good luck)
  2. Remembering file names from projects three years ago
  3. Opening 10 documents to find the one sentence you needed

Worse, there was genuine risk. What happens when an intern accidentally uses Client A's brand colors for Client B? Or when a new copywriter references the wrong tone guide?

The ask: Build a system where teams can search, ask questions, and generate drafts—grounded in authoritative sources, with zero risk of cross-client data leakage.

Why Most RAG Demos Fail in Production

Before diving into what I built, let's talk about why the standard RAG setup breaks down:

⚠️The Tutorial Gap

Demos optimize for "wow factor." Production optimizes for "won't blow up at 3am."

Problem 1: Naive chunking loses context

Splitting documents every 500 tokens ignores structure. Your chunk might start mid-sentence, missing the heading that explains what it's about. Citations become useless when the chunk says "As mentioned above..." but "above" got split into a different chunk.

Problem 2: Vector search alone isn't enough

Vector similarity is great for semantic matching but terrible for exact terms. Ask for "hex code #3B82F6" and pure vector search might return chunks about "color palettes" without the actual code.

Problem 3: No isolation means data leakage

Multi-tenant systems need isolation at the query level, not just the application level. "The LLM won't leak data" is not a security strategy.

Problem 4: Citations can be hallucinated

Telling the LLM to "cite sources" doesn't mean it will cite correct sources. It might cite [Source 3] when the information came from [Source 7]—or cite a source that doesn't exist at all.

The Architecture That Actually Works

Here's what I ended up building:

┌─────────────────────────────────────────────────────────┐
│                    INGESTION PIPELINE                    │
│  Google Drive → Export → Extract → Chunk → Embed → DB   │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                  RETRIEVAL LAYER                         │
│  Hybrid Search (Vector + Keyword) → RRF Fusion → Rerank │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│                    RAG LAYER                             │
│  Build Prompt → LLM → Extract Citations → Validate      │
└─────────────────────────────────────────────────────────┘

Let me break down the non-obvious parts.

Structure-Aware Chunking

Instead of naive splitting, I detect document structure:

def is_heading(line: str) -> bool:
    # Markdown headings
    if re.match(r'^#{1,6}\s+\S', line):
        return True
    # ALL CAPS headings (common in brand docs)
    if line.isupper() and len(line) >= 3:
        return True
    # Numbered sections
    if re.match(r'^\d+\.(\d+\.?)?\s+[A-Z]', line):
        return True
    return False

Each chunk gets a heading_context field preserving its parent heading. When the LLM cites a chunk, users see "Section: Brand Colors > Primary Palette" instead of an orphaned paragraph.

Hybrid Search with RRF

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. The solution is Reciprocal Rank Fusion:

def rrf_fusion(vector_results, keyword_results, k=60):
    scores = {}
    for chunk_id in all_results:
        score = 0.0
        if chunk_id in vector_ranks:
            score += 1.0 / (k + vector_ranks[chunk_id])
        if chunk_id in keyword_ranks:
            score += 1.0 / (k + keyword_ranks[chunk_id])
        scores[chunk_id] = score
    return sorted(scores.items(), key=lambda x: -x[1])

The beauty of RRF: it doesn't require normalizing scores between different search methods. It only uses ranks. A document appearing in both lists gets boosted; one appearing only in vector search still gets considered.

Two-Stage Retrieval

Fast retrieval gets you recall. Slow reranking gets you precision. I do both:

  1. Stage 1: Hybrid search returns top 50 candidates (fast, ~100ms)
  2. Stage 2: Cohere reranking picks the best 10 (slower, ~500ms, but much more accurate)

The reranking model understands query-document relevance better than cosine similarity. Worth the latency.

Client Isolation at the Query Level

This was non-negotiable. Every database query includes a client filter:

# CRITICAL: Empty client_ids returns empty, not everything
if not client_ids:
    logger.warning("Empty client_ids - returning empty")
    return []

# Every query is scoped
WHERE client_id = ANY($1::uuid[])

The isolation happens at the repository layer, not the application layer. Even if a bug bypasses the router logic, the database won't return unauthorized data.

Citation Validation

LLMs hallucinate citations. My solution:

  1. Strict format: Citations must be [Source N]
  2. Regex extraction: Parse all citations from response
  3. Range validation: Skip any citation outside the actual source count
  4. Refusal detection: Pattern match against 14+ refusal phrases
REFUSAL_PHRASES = [
    "i don't have enough information",
    "i cannot answer",
    "no relevant information",
    "not enough information in the provided sources",
    # ... more patterns
]

If the LLM refuses to answer, I detect it and show users a clear "insufficient information" message rather than a hedged non-answer.

The Testing Strategy

363 tests might sound excessive. It's not.

When you're building a system where mistakes have real consequences—wrong brand colors, leaked client data, missed citations—you need coverage that matches the risk:

  • 16 tests for client isolation alone (vector search, keyword search, hybrid search, edge cases)
  • 15 tests for the ask endpoint (auth, validation, integration)
  • 23 tests for chunking (heading detection, splitting, overlap)
  • 14 tests for audit logging (sanitization, action mapping, completeness)

The tests aren't just for correctness. They're documentation of security guarantees.

💡Test what matters

Don't test that FastAPI returns 200. Test that User A cannot see User B's data under any retrieval method.

Key Decisions and Tradeoffs

pgvector over Pinecone

Pinecone is great. But for this project:

  • Simpler ops (one database, not two)
  • No vendor lock-in
  • Client isolation via SQL WHERE clauses
  • Audit logging in the same transaction

Tradeoff: pgvector requires more tuning at scale. For 2,000 documents, it's not a problem.

Denormalized client_id

I store client_id on both documents and chunks tables. Yes, it's redundant. But:

  • No joins needed for retrieval queries
  • B-tree index on client_id makes filtering fast
  • Isolation is enforced at the data level

Tradeoff: Storage overhead. Worth it for query simplicity.

Explicit Refusal Behavior

Instead of hoping the LLM refuses gracefully, I enforce it:

  • Prompt template includes explicit refusal instructions
  • Response parsing detects refusal patterns
  • Confidence scoring based on chunk count and citation count

Tradeoff: Some valid answers might get flagged as low-confidence. Better than hallucinated answers.

What I'd Do Differently

Start with evaluation earlier. I built the eval framework after most features were done. Should have been iteration 1.

More aggressive caching. Embedding the same query twice is wasteful. Query caching would help with common questions.

Better observability from day one. Structured logging is good. Distributed tracing would be better. When something goes wrong in production, you want to see the full request path.

The Result

The system now handles:

  • Sub-second hybrid search with 95%+ recall
  • Grounded Q&A where every claim has a citation
  • Draft generation using past approved work as examples
  • Full audit trail of who accessed what

More importantly, it's a system rather than a collection of API calls. The architecture handles edge cases, the tests prevent regressions, and the isolation guarantees are enforced at every layer.

Final Thoughts

Building RAG for production isn't about the LLM. It's about everything around the LLM:

  • Ingestion that preserves document structure
  • Retrieval that combines multiple signals
  • Isolation that's enforced at the data layer
  • Validation that catches LLM mistakes
  • Testing that proves your guarantees

The LLM is the easy part. The hard part is building a system that won't embarrass you when it matters.


Building something similar? I'd love to hear about your approach. Reach out on LinkedIn or email.