Sahaib's Tech Stack

Sign in Subscribe

By Sahaib Singh in InsideTheStack — 08 Dec 2025

RAG That Actually Works

And why 90 percent of people implement it wrong

Most RAG systems look impressive in demos.
In real usage, they quietly fall apart.

The problem is not RAG as a concept.
The problem is treating it like a search feature instead of an engineering system.

Why RAG matters

When RAG is bad, everything downstream breaks:

hallucinations that sound confident
irrelevant answers pulled from the wrong place
context overwriting instead of grounding
exploding token bills with no accuracy gain

When RAG is good, it feels like your own ChatGPT, trained on your data, answering with restraint and relevance.

That gap is entirely implementation.

What actually makes RAG work

RAG reliability depends on four things. Miss one and the system degrades fast.

Chunking strategy
Semantic vs fixed-size chunks change recall and coherence dramatically.
Embedding model quality
Weak embeddings guarantee weak retrieval, no matter how good your LLM is.
Vector database structure
Flat storage without metadata is a dead end.
Prompt formatting
How retrieved context is injected matters more than most people realize.

A simple but critical tradeoff:

small chunks improve recall
large chunks preserve semantic cohesion

There is no universal right answer. Only intentional design.

Where most systems fail

These mistakes show up again and again:

chunks that are too large and lose focus
chunks that are too small and add noise
embedding models chosen casually
no reranking layer
missing metadata filters
zero visibility into what was retrieved

Fixing these alone can improve accuracy by 40 to 60 percent.
Not by switching models. By fixing plumbing.

What actually worked for me

Hard-earned lessons, not theory:

start with a strong embedding model first
always add a reranking stage
use metadata filters like source, page, timestamp
restrict prompts to the domain intentionally
log retrieval results and inspect failures

RAG is not prompt engineering.
It is systems engineering.

The real takeaway

RAG fails when you copy templates.
It works when you design for intent, grounding, and traceability.

Not perfect.
Not flashy.
But predictable and trustworthy.

That is the only kind of AI system worth shipping.

Closing

This post is part of InsideTheStack, focused on how real AI applications are built, not how they are marketed.

Follow along for more.

#InsideTheStack #RAG #AIEngineering