New—Millions at Stake: How Melange's High-Recall Retrieval Prevents Litigation Collapse - Read the case study

True, Relevant, and Wrong: The Applicability Problem in RAG

Retrieval Augmented Generation is often sold as a safety upgrade — mainly as a way to help reduce hallucinations. A RAG system is expected to retrieve official docs, cite sources, and to not make things up. The appeal is conceptually simple: ground the model in trusted text and hallucinations largely disappear. That promise is not just casual hype. A 2024 Stanford research review of AI legal research tools notes that vendors market RAG as authoritative grounding, including a Thomson Reuters executive’s claim that it can reduce hallucinations “to nearly zero” (Stanford DHO/Open Scholarship team (2024), see also the Stanford authors, arXiv version (2024)).

That story can be true in small, stable domains, with little to no divergence in the policies that govern them, the edge cases are rare, and yesterday’s answer is usually safe today. But as the corpus grows and the business matures, the ground truth stops being a single body of text and becomes a set of variants.

At scale, the dominant failure mode changes. Organizations accumulate conditional truths. Policies split by region, eligibility, plan tier, rollout cohort, product version, internal organization and effective date. Each document can remain “correct,” yet the system can still produce the most expensive kind of failure: an answer that is well cited, authoritative, and completely wrong.

So the core question shifts.

  • Not “is this statement supported by a source?”
  • But “does this supported statement apply here?”

The problem is applicability: whether correct information governs this situation, right now.

The principle is simple. Generate a response only after selecting the correct branch of reality. The execution is hard because the branching variables are often implicit, scattered across prose, and missing from the initial question.

One question, many realities

Let’s imagine an electric utilities company that sells small appliances and administers warranty eligibility and replacement programs. They ship a support chatbot to handle product questions and warranty claims.

Now let’s imagine a customer ask:

“My toaster isn’t working — can I get a replacement?”

In a small corpus, retrieval feels like safety. There is only one version of the answer, so finding the right topic means finding the right answer: “Toaster replacement” content is retrieved, the model summarizes it, and a citation is attached. In a mature support corpus, the same question hides the variables that determine the correct answer, including the model of the toaster, when it was bought, where the customer lives, and which program issued it.

A naive RAG system retrieves a generally relevant warranty chunk, for example “fails within 24 months,” and produces a confident “yes,” with steps and a citation. The claim is then rejected by the actual eligibility workflow because a different, still valid, policy branch applies.

Nothing here was fabricated. The failure is selection: the system never chose which policy branch it was answering for.

This is also why “relevance” stops meaning “right” as the corpus grows. Similarity search is optimized for topical alignment, text about “toaster replacement,” not for compatibility with an eligibility state, configuration, or timeline.

At scale, the most misleading outputs often look the most professional. Retrieval pulls multiple correct chunks that assume different conditions, and the model blends them into one fluent procedure. We end up with a “franken-answer”: each chunk may be true on its own, but the combined answer becomes incoherent because it spans mutually incompatible assumptions.

What a "franken-answer" looks like

A naive RAG system might retrieve several correct snippets from different policy branches and blend them into one answer that sounds complete but contradicts itself.

Franken-answer (bad): "Yes—you can get a replacement toaster if your unit failed within 24 months of purchase. Complete the in-chat troubleshooting checklist, and we'll confirm eligibility. If approved, we'll ship a replacement or issue a voucher if shipping isn't available in your area. Appliances from Program X are covered for 12 months from enrollment and must be registered first.

Next steps:

  1. Share your region and purchase details.
  2. Confirm retail purchase or Program X.
  3. Provide proof of purchase and registration (if applicable).
  4. After troubleshooting, we'll ship your replacement (or send a voucher). Sources: Warranty policy (24 months); Region B fulfillment (voucher); Program X addendum (12 months + registration); Updated workflow (Jan 1)."

Why this fails:

  • Promises 24-month replacement and 12-month Program X coverage—without choosing which applies.
  • Says we'll ship a replacement and only offer vouchers in some regions—without selecting which.
  • Mixes "purchase date" and "enrollment date" logic.
  • Blends the post-Jan 1 troubleshooting requirement into all cases, even when it doesn't apply.

This is the applicability problem at its worst: nothing is invented, everything is cited, yet no real customer can follow the combined procedure.

The missing step: scope becomes a retrieval constraint

When teams say “RAG works,” the claim often means: given the right page, the model can summarize it. At scale, the hard work moves earlier. The system must decide which page is eligible to be right for the current case.

That requires representing scope as machine-usable constraints. Think region, plan tier, enrollment status, channel, product or version, and effective date.

This is not a call for “more context” in the prompt. It is a call to change candidate selection — and the gap between what retrieval optimizes for (topical similarity) and what correctness requires (scope compatibility) only widens as the corpus grows.

Scale makes this gap visible because “truth” often lives outside the prose being embedded. Large commercial systems routinely encode policy in configuration, cohorts, and feature-flag rules that never appear verbatim in customer-facing documentation. This is a familiar pattern from feature-flagged software (Kästner et al. (2020), ICSE-SEIP). The retrieval layer may return impeccably written policy text while missing the operational condition that selects which policy governs.

What is missing is a compatibility envelope: the set of conditions that make an answer applicable to a specific case. Outside that envelope, a retrieved document is not “less relevant” — it is a disallowed truth. And most RAG architectures have no representation of this envelope at all.

The facets of applicability

The toaster example illustrates one axis of the problem — branching by customer attributes. But applicability fractures along many axes simultaneously, and each one introduces its own class of failure. Here or some of the facets we’ll discuss in this series:

Temporal applicability

Documents that were true last month may not be true today — and the corpus often contains both versions. This is not simply an “effective date” problem. Older documents tend to have richer detail, more internal links, and more embedding weight from historical usage. In many corpora, stale truth actively outranks current truth because the retrieval layer has no concept of temporal validity windows. The January policy update does not automatically suppress the December version it replaced. Both coexist, and similarity search has no basis for preferring one over the other.

Compositional applicability

The franken-answer above demonstrates this, but it is a general phenomenon worth naming explicitly. A single response often requires assembling facts from multiple retrieved chunks, each carrying its own implicit scope. No individual chunk is wrong. The act of combining them is the error. This is deeply hard to detect because it is invisible to any per-chunk evaluation — each source checks out, but the composite answer describes a procedure that no real customer can follow. The failure lives in the seams between documents, not in the documents themselves.

Implicit conditions

Some conditions that determine applicability are stated directly in the text (e.g.: "for customers enrolled after January 1st"). A retrieval system can at least surface that sentence, even if it cannot enforce it. But many conditions are never written down at all. They exist only in configuration systems, enrollment databases, or internal tooling.

Return to the toaster example. Whether a customer qualifies under the standard warranty or the utility program may depend on an enrollment flag in a CRM that was set during onboarding. No document says "if enrollment_flag = PROGRAM_X, use the 12-month policy." The customer-facing documentation simply describes each program's terms separately, assuming the reader already knows which one they are in.

The retrieval corpus contains the consequences of the enrollment flag — two different policy documents — but not the branching condition itself. The system retrieves warranty text without any way to determine which warranty applies, because the knowledge that would resolve the ambiguity does not exist in retrievable text. When the knowledge that determines applicability does not exist in the text being retrieved, no amount of retrieval optimization can surface it.

Authority conditions

Not all documents carry equal weight, even when they describe the same topic. A CEO's product directive and a junior PM's feature spec may both discuss the same initiative, but they are not interchangeable sources of truth. Authority depends on who created the document, in what capacity, and with what organizational standing. Yet the retrieval layer flattens this entirely — every chunk is an equally weighted string of text. When two documents conflict, the system has no basis for choosing the authoritative source over the informal one. Worse, the informal document may be longer, more detailed, and more semantically rich, causing it to outrank the authoritative source in similarity search. The organizational hierarchy that determines which document governs is invisible to the embedding.

Granularity mismatch

The user’s question operates at one level of specificity (“can I get a replacement?”) while the corpus stores truth at many levels — some hyper-specific (model X, region B, after January 1), some general (“all appliances carry a manufacturer warranty”). The system must determine which granularity level is appropriate, and getting this wrong in either direction is a failure. Too specific and you apply a narrow exception as if it were the general rule. Too general and you give advice that does not hold for the specific case. The correct level of granularity is not a property of the question or the document — it is a property of their intersection.

Ambiguity

Users do not know what they do not know. “My toaster is broken, can I get a replacement?” feels like a complete question. It is massively underspecified — but only relative to the corpus’s branching structure, not relative to everyday language. Standard query disambiguation handles linguistic ambiguity (“bank” means financial institution or riverbank). Applicability disambiguation is different: the question is linguistically clear, but the answer space branches in ways the user cannot anticipate because they do not know the topology of the policy corpus. The system must recognize underspecification that the user has no reason to suspect exists.

Path convergence

Multiple policy branches can produce the same surface-level answer — “yes, you qualify for a replacement” — via entirely different logic paths, with different downstream consequences: shipping versus vouchers, different documentation requirements, different timelines. The system can appear correct while having reasoned through the wrong branch. The error surfaces only later, in a rejected claim or a confused follow-up. This makes evaluation especially treacherous. Checking the final answer is not enough. You must verify that the answer was derived from the correct scope — and in many architectures, that reasoning path is never recorded.

Applicability Is an Infrastructure Problem

RAG retrieves what is written. Applicability decides what is allowed to be true. That distinction sounds philosophical until you try to fix it with prompting or reranking alone.

At scale, the dominant errors are not fabrication. They are category errors — selecting the wrong plan, the wrong date window, the wrong jurisdiction, the wrong release train. The system cites impeccable sources and still fails because it answered from outside the compatibility envelope. And the standard metrics miss it entirely. Citation rate and hallucination counts can look healthy while the system routinely selects the wrong branch and only discovers the mistake downstream, in a rejected claim or an escalation.

Catching this requires evaluation that tests for scope, not just grounding:

  • Near-duplicate questions where the correct answer flips based on region, plan, version, or date.
  • Prompts that deliberately omit key qualifiers.
  • Document sets that are individually correct but mutually incompatible.

The central metric is not fluency or even citation coverage. It is whether the system chose the correct scope before it started generating.

But measurement only tells you how often you fail. Fixing it requires building what most RAG architectures currently lack: a compatibility envelope — a layer that encodes what is allowed to be true, for whom, and under what conditions.

What comes next

In this post, we named the problem. The rest of this series builds toward a framework for solving it.

Every piece of knowledge carries properties that determine when it applies — temporal validity, scope conditions, prerequisites — but those properties are almost always implicit, buried in prose or missing from the corpus entirely. Before a system can enforce applicability, knowledge itself needs a meta-layer that makes these conditions explicit and machine-readable. That is where we start. (Part 2: Knowledge Needs a Meta-Layer)

With that foundation in place, we turn to the query. A user's question arrives underspecified in ways neither the user nor the system initially understands. Introspection is the process of examining that query to extract the signals that determine which knowledge applies — recognizing what is stated, what is missing, and what the system must resolve before it can safely answer. This is technically deep work, and the post that tackles it is the intellectual core of the series. We will introduce a framework drawn from thermodynamics to build intuition for how this works: a query starts hot, with many possible answers, and each extracted signal cools the system — collapsing the space of possibly-applicable knowledge until a confident answer becomes reachable, or the system recognizes it cannot get there yet. (Part 3: Introspection as Signal Extraction)

Signals alone are not enough. They must drive action. Disambiguation is the agentic layer that takes extracted signals and routes the query to the correct assistant or knowledge bases — selecting which sources are in scope, handling cases where multiple bases apply, and deciding when to branch or ask rather than guess. This is where the architecture becomes multi-agent and the challenge shifts from understanding the query to orchestrating a response across a system of specialized knowledge. (Part 4: Disambiguation as Applicability Routing)

Finally, we bring all of it together and ask what the full experience looks like — for the developer building it and the user interacting with it. Meta-knowledge, signal extraction, routing, and confidence thresholds composed into an end-to-end architecture that someone can confidently ship. (Part 5: The Meta-layer in practice)

Share:

Was this article helpful?