A hand checking off items on a handwritten checklist, representing reviewing answers for correctness

Evaluating RAG Quality Beyond RAGAS (2026)

/ Arvid Andersson

RAGAS catches the obvious failures: unsupported claims, off-topic retrieval, answers that ignore the question. But teams keep shipping RAG responses that look grounded, cite real chunks, and are still subtly wrong. This post is about that specific failure mode, why faithfulness scoring misses it, and a layered method for catching it, with the tools that fit each layer. It reflects the tooling landscape as of June 2026.

Looking for a side-by-side table with filters? See the Observability & Analytics comparison.

The gap: faithful is not the same as correct

RAGAS popularized the core RAG metrics: faithfulness, answer relevance, and context precision and recall. Faithfulness asks whether the retrieved chunks support the answer's claims. That is valuable, and it catches a lot. But it has a blind spot: it scores whether the cited chunks support the answer, not whether the inference drawn from those chunks is correct.

So an answer can retrieve a real, relevant chunk, cite it accurately, pass faithfulness, and still be wrong. It drew an incorrect conclusion from a correct source. This is the "sounds right, isn't right" case. It is exactly the one that slips through a faithfulness-only setup. Closing it takes more than one metric.

RAG evaluation in layers

Each layer catches what the one before it misses. Add layers as the stakes rise.

0

UI provenance. Show the exact source chunk in the product so users and reviewers can catch bad inferences themselves.

1

Reference-free scoring. Faithfulness, context relevance, answer relevance. Catches unsupported claims and bad retrieval without labels.

2

Semantic correctness via judge. An LLM-as-judge scores whether the inference is right, not just supported. Powerful, but the judge must be calibrated.

3

Ground-truth checks. Compare against a gold answer or a curated knowledge base for high-stakes responses.

Layer 1: reference-free scoring (the baseline)

The first layer needs no labeled data. It compares the answer to the retrieved context: does every claim trace to a chunk (faithfulness), were the retrieved chunks relevant (context relevance), does the answer address the question (answer relevance). RAGAS is the common starting point here. TruLens frames the same idea as the RAG triad (context relevance, groundedness, answer relevance) and adds tracing so you can see which step produced a bad result. DeepEval offers the same RAG metrics inside a pytest-style framework, which fits teams that want evals in their existing test suite.

Layer 2: semantic correctness, and the judge problem

To catch the inference gap, you add a judge: an LLM scores whether the answer is actually correct, not just supported. This is where most teams close the "grounded but wrong" hole. Patronus AI, Galileo, and Future AGI provide hallucination and correctness scoring built on this approach, and Braintrust and PromptFoo let you define and run custom judge-based scorers against datasets.

The catch: if your evaluation depends on an LLM judge, you have the same hallucination problem one layer up. Judges hallucinate too. The honest mitigation is calibration. Compare the judge against a human-graded baseline, measure agreement, and re-baseline whenever the judge model updates. When agreement drifts past your tolerance, tighten the rubric. Or stop treating the judge as pass/fail and use it to flag responses for human review instead. A judge you have never compared to humans is not a measurement. It is a second opinion you have not checked.

Layer 3: ground-truth checks for high stakes

Some responses are expensive to get subtly wrong, such as those in regulated or safety-critical domains. For those, the layer above judges is comparison against ground truth: a gold answer set, or factuality checks against a curated knowledge base. This is the most expensive layer to build and maintain. That is the point. You spend it where the cost of a wrong answer justifies it, not everywhere.

Layer 0: let users catch it (the cheapest fix)

The highest-leverage change is often not an eval tool at all: show the exact chunk that informed each part of the response, directly in the UI. When users can see the source behind a claim, they catch the cases where the source does not actually support it. No automated detector required. The same approach works for internal QA review. Surfacing provenance reduces how much you have to lean on automated hallucination detection in the first place. It is also usually cheaper to build than a calibrated judge pipeline.

RAG evaluation tools at a glance

Tool Layer What it does Open source
RAGAS 1 Reference-free RAG metrics (faithfulness, relevance, context precision/recall) Yes
TruLens 1 RAG triad scoring plus tracing to locate the failing step Yes
DeepEval 1 RAG metrics in a pytest-style framework for your test suite Yes
Patronus AI 2 Hallucination and correctness scoring via LLM-as-judge No
Galileo 2 Evaluation and guardrails with correctness scoring No
Future AGI 2 Eval, simulation, and guardrails in one platform Yes
Braintrust 2 Custom judge-based scorers run against datasets No
PromptFoo 2 Define and run custom evals, including judge-based scorers Yes

Open-source status as of June 2026. Layer 3 (ground-truth checks) is typically built in-house against your own gold set, not a single tool.

How much of this do you actually need?

Match the depth to the stakes. Before building heavy hallucination detection, confirm the failure is actually happening at a rate that matters. Do not just hypothesize it. For low-stakes retrieval ("find me the doc"), Layer 1 plus good source provenance is often enough, and user feedback handles the long tail. For high-stakes RAG, the full stack is justified. Over-investing in detection for a rare failure that is cheap to catch another way is its own kind of waste.

A practical progression: start with Layer 1 reference-free scoring and UI provenance, measure whether grounded-but-wrong answers are actually reaching users, and add a calibrated judge (Layer 2) only when the data shows you need it. Reserve ground-truth checks (Layer 3) for the responses where a wrong answer is genuinely costly.

Related reading

Evaluation is one layer of a RAG system. These cover the others.

Frequently asked questions

What does RAGAS not catch?

RAGAS faithfulness scores whether the retrieved chunks support the answer's claims. It does not score whether the inference drawn from those chunks is correct. An answer can cite a real, relevant chunk, pass faithfulness, and still be wrong because it reasoned badly from a true source. That 'grounded but incorrect' case is the gap most production teams hit, and it needs a layer beyond faithfulness.

Can I trust an LLM judge to evaluate RAG?

Only after calibrating it. An LLM judge has the same hallucination risk one layer up, so treat its score as a signal, not ground truth. Calibrate the judge against a human-graded baseline, measure judge-versus-human agreement, and re-baseline whenever the judge model updates. When agreement drifts, either tighten the rubric or downgrade the judge from pass/fail to 'flag for human review.' A judge you have never compared to humans is not a measurement.

How do I evaluate RAG without ground-truth data?

Start with reference-free metrics that compare the answer to the retrieved context rather than to a gold answer: faithfulness, context relevance, and answer relevance (the RAG triad). These catch unsupported claims and off-topic retrieval without labels. For correctness of inference you eventually need either a human-graded set or a curated knowledge base to check against, but reference-free scoring is a useful first layer while you build that set.

Is RAGAS or DeepEval better for RAG evaluation?

They overlap and many teams use both. RAGAS is focused on the RAG-specific metric set and is a common baseline. DeepEval is a broader, pytest-style evaluation framework that includes RAG metrics alongside general LLM tests, which fits teams that want evals in their existing test suite. The choice is less about which is better and more about whether you want a RAG-focused library or a general eval framework that also covers RAG.

How much RAG evaluation does my application actually need?

Match the depth to the stakes. For low-stakes retrieval (help me find a document), lightweight reference-free scoring plus clear source provenance in the UI is often enough, and user feedback catches the rest. For high-stakes domains (medical, legal, financial), layered evaluation with calibrated judges and ground-truth checks is worth the cost. Over-investing in hallucination detection for a failure that occurs rarely and is cheap to catch by other means is a common mistake.

Browse all Observability & Analytics tools on Infrabase.ai

Is your product missing?

Add it here →