RAG That Actually Works
And why 90 percent of people implement it wrong
Most RAG systems look impressive in demos.
In real usage, they quietly fall apart.
The problem is not RAG as a concept.
The problem is treating it like a search feature instead of an engineering system.
Why RAG matters
When RAG is bad, everything downstream breaks:
- hallucinations that sound confident
- irrelevant answers pulled from the wrong place
- context overwriting instead of grounding
- exploding token bills with no accuracy gain
When RAG is good, it feels like your own ChatGPT, trained on your data, answering with restraint and relevance.
That gap is entirely implementation.

What actually makes RAG work
RAG reliability depends on four things. Miss one and the system degrades fast.
- Chunking strategy
Semantic vs fixed-size chunks change recall and coherence dramatically. - Embedding model quality
Weak embeddings guarantee weak retrieval, no matter how good your LLM is. - Vector database structure
Flat storage without metadata is a dead end. - Prompt formatting
How retrieved context is injected matters more than most people realize.
A simple but critical tradeoff:
- small chunks improve recall
- large chunks preserve semantic cohesion
There is no universal right answer. Only intentional design.

Where most systems fail
These mistakes show up again and again:
- chunks that are too large and lose focus
- chunks that are too small and add noise
- embedding models chosen casually
- no reranking layer
- missing metadata filters
- zero visibility into what was retrieved
Fixing these alone can improve accuracy by 40 to 60 percent.
Not by switching models. By fixing plumbing.

What actually worked for me
Hard-earned lessons, not theory:
- start with a strong embedding model first
- always add a reranking stage
- use metadata filters like source, page, timestamp
- restrict prompts to the domain intentionally
- log retrieval results and inspect failures
RAG is not prompt engineering.
It is systems engineering.

The real takeaway
RAG fails when you copy templates.
It works when you design for intent, grounding, and traceability.
Not perfect.
Not flashy.
But predictable and trustworthy.
That is the only kind of AI system worth shipping.

Closing
This post is part of InsideTheStack, focused on how real AI applications are built, not how they are marketed.
Follow along for more.
#InsideTheStack #RAG #AIEngineering