RAG Retrieval Architectures: When Better Embeddings Stop Helping (2026)
/ Arvid Andersson
Most RAG projects start vector-first: embed the documents, store them, retrieve by similarity. It works in the demo. Then a user searches for an exact thing, a product code, an error number, a specific name, and the system misses it, because vector search ranks by meaning, not by literal tokens. This post is about that failure and the retrieval architectures that fix it, as of June 2026.
Looking for a side-by-side table with filters? See the Vector databases comparison.
The failure pure vector search hides
An embedding turns text into a point in a space where nearby points mean similar things. That is exactly what you want for "find documents about return policies" and exactly what you do not want for "find SKU-4471". The model encodes "SKU-4471" as something close to other codes that look and read like it, so the literal one a user typed can sit just outside the top results. The same happens with error codes, part numbers, ticket IDs, acronyms, and rare proper nouns.
This failure stays hidden through testing. Semantic queries work, the demo looks done, and then a user types the one exact thing they expected to match and loses trust in the whole system. It compounds in chat, where users phrase things literally and expect literal matches. A bigger or better embedding model does not fix it, because the problem is not embedding quality, it is that exact-match is a different job from semantic similarity.
A retrieval ladder, in order
Climb only as far as your evaluation says you need to. Each rung costs more.
Hybrid search. Run BM25 (keyword) and vector queries together, merge with rank fusion. Recovers exact tokens without losing semantic recall. The biggest single win for most projects.
Reranking. A cross-encoder re-scores the top candidates against the query and reorders them. Usually the next biggest quality jump after hybrid.
Query and metadata work. Query expansion, metadata filters, and better chunking. Helps when the right chunk exists but is not surfaced.
Structure-aware retrieval. Entity-aware retrieval or a graph layer (GraphRAG) for questions about entities, relationships, and identities that no single chunk answers.
Step 1: hybrid search
Hybrid search runs a lexical query (BM25 or full-text) and a vector query, then merges the two result sets, commonly with Reciprocal Rank Fusion. Lexical search nails exact terms and rare tokens; vector search nails paraphrase and meaning. The merge recovers what either alone would miss. This is the rung most projects should reach for first.
You can get hybrid two ways. A vector database with native hybrid support handles both signals in one query: Weaviate and Qdrant use BM25-based sparse plus dense vectors, and Pinecone pairs dense vectors with its own sparse model. Or you use a search engine built around hybrid from the start: Typesense and Meilisearch combine full-text and vector search with typo tolerance, and Azure AI Search offers full-text, vector, and hybrid in one managed service. The right choice depends on whether you already run a vector DB or want search-engine ergonomics (faceting, typo tolerance, geo) alongside retrieval.
Step 2: reranking
Hybrid search widens the net for recall, but the merged ranking is approximate. A reranker fixes the ordering: a cross-encoder reads the query and each candidate together and scores true relevance, then reorders. It is usually the highest-leverage quality change after hybrid, the lever to reach for once recall is good but precision lags.
Rerankers ship as hosted APIs alongside embeddings. Jina AI offers a multilingual reranker (with open weights on Hugging Face) next to its embeddings and reader. Voyage AI (now part of MongoDB) provides rerank models with a free tier, focused on retrieval quality. Cohere offers a widely-used Rerank endpoint. The cost note that matters: rerank only the top candidates (often 50 to 100), not the whole result set, since a cross-encoder is far more expensive per document than a vector lookup.
Step 3: query and metadata work
If the right chunk exists but does not surface, the lever moves upstream. Query expansion generates variants of the incoming query (synonyms, expanded acronyms, alternate phrasings) and searches them in parallel, then merges, useful when users type terse or ambiguous queries. Metadata filters narrow the search space before retrieval (by date, source, language, or a field like country) which improves both precision and latency. And chunking strategy decides whether a relevant passage is even retrievable as a unit. These are worth tuning once hybrid and reranking are in place, not before.
Step 4: structure-aware retrieval
Some questions cannot be answered by ranking chunks at all. "Who is X" or "how is X related to Y" over a corpus where the answer is spread across many passages, with no single chunk acting as a summary, is a retrieval-structure problem. Entity-aware approaches build profiles or summaries per entity; GraphRAG-style approaches build a graph of entities and relationships and traverse it before or alongside chunk retrieval. RAG frameworks like LlamaIndex and Haystack provide building blocks for these patterns. This rung is real work and worth it only when evaluation shows ranking is not the bottleneck.
Build the eval set before you climb
The order above is a default, not a prescription. The way to know which rung you actually need is to build a small evaluation set first: a few dozen real queries with the documents that should answer them, then sort the failures into buckets, wrong documents retrieved, right document retrieved but answer missed it, query too vague. Each bucket points at a different rung. Adding features without measuring just moves the failure around. For the generation side of this (catching answers that cite the right chunks but draw the wrong conclusion), see the companion post on evaluating RAG quality beyond RAGAS.
Tools by step
| Step | What it does | Tools |
|---|---|---|
| Hybrid (native) | Vector DB with built-in keyword + vector |
|
| Hybrid (search engine) | Full-text + vector search engine |
|
| Reranking | Cross-encoder re-scoring of candidates |
|
| Structure-aware | Entity and graph retrieval patterns |
LlamaIndex,
Haystack
|
Related reading
Retrieval is one half of RAG. These cover the rest of the stack.
Frequently asked questions
Why does my RAG system miss exact matches like SKUs or error codes?
Vector search ranks by semantic similarity, not literal token match. An embedding of "SKU-4471" or "error E-1042" lands near other codes that look similar in meaning, so the exact one a user typed can fall outside the top results. Embeddings are strong at meaning and weak at literal identifiers. The fix is hybrid search: run a keyword (BM25) query alongside the vector query and merge the results, so exact tokens are matched exactly while semantic recall is preserved.
What is hybrid search in RAG?
Hybrid search combines lexical search (BM25 or full-text) with vector search and merges the two result sets, usually with Reciprocal Rank Fusion. Lexical search catches exact terms, identifiers, and rare words; vector search catches paraphrases and semantic matches. Together they recover cases that either method alone would miss. Most production RAG systems use hybrid rather than pure vector search once they hit real queries.
Do I need a reranker if I already use hybrid search?
Often yes. Hybrid search widens the candidate set for recall, but the top of that merged list is not necessarily ordered by true relevance. A reranker (a cross-encoder model) re-scores the top candidates against the query and reorders them, which usually gives the largest precision gain after hybrid search. It adds latency and cost, so rerank a small candidate set (for example the top 50 to 100), not the whole index.
When should I move beyond hybrid search and reranking?
When evaluation shows the failures are no longer about ranking. If the right chunk is never retrieved regardless of method, the problem is upstream: chunking, metadata filters, or missing structure. Questions about entities, relationships, and identities across a corpus often need entity-aware retrieval or a graph layer (GraphRAG) rather than better embeddings. Build an eval set first so you change the layer that is actually failing.
Is hybrid search worth it for a small RAG project?
It depends on the documents. If your corpus contains identifiers, product names, codes, or rare technical terms that users search for literally, hybrid search pays off quickly. If the content is prose where meaning dominates and exact tokens rarely matter, pure vector search may be enough. The practical test is to build a small eval set with the queries you actually expect, then compare pure vector against hybrid on it.
Browse all Vector databases on Infrabase.ai
Is your product missing?

