TruthLens: Real-Time Hallucination Mitigation
Version: 1.0 Authors: vLLM Semantic Router Team Date: December 2025
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their tendency to generate hallucinations—fluent but factually incorrect or ungrounded content—remains a critical barrier to enterprise AI adoption. Industry surveys consistently show that hallucination risks are among the top concerns preventing organizations from deploying LLM-powered applications in production environments, particularly in high-stakes domains such as healthcare, finance, and legal services.
We propose TruthLens, a real-time hallucination detection and mitigation framework integrated into the vLLM Semantic Router. By positioning hallucination control at the inference gateway layer, TruthLens provides a model-agnostic, centralized solution that addresses the "accuracy-latency-cost" triangle through configurable mitigation strategies. Users can select from three operational modes based on their tolerance for cost and accuracy trade-offs: (1) Lightweight Mode—single-round detection with warning injection, (2) Standard Mode—iterative self-refinement with the same model, and (3) Premium Mode—multi-model cross-verification and collaborative correction. This design enables organizations to deploy trustworthy AI systems while maintaining control over operational costs and response latency.
1. Introduction: The Hallucination Crisis in Enterprise AI
1.1 The Core Problem
Hallucinations represent the most significant barrier to enterprise AI adoption today. Unlike traditional software bugs, LLM hallucinations are:
- Unpredictable: They occur randomly across different queries and contexts
- Convincing: Hallucinated content often appears fluent, confident, and plausible
- High-stakes: A single hallucination in medical, legal, or financial domains can cause irreversible harm
- Invisible: Without specialized detection, users cannot distinguish hallucinations from accurate responses
Industry Impact by Domain:
| Domain | Hallucination Risk Tolerance | Typical Mitigation Approach |
|---|---|---|
| Healthcare | Near-zero (life-critical) | Mandatory human verification, liability concerns |
| Financial Services | Very low (regulatory) | Compliance-driven review processes |
| Legal | Very low (liability) | Restricted to internal research and drafting |
| Customer Support | Moderate | Escalation protocols for uncertain responses |
| Creative/Marketing | High tolerance | Minimal intervention required |
Note: Based on enterprise deployment patterns observed across industry surveys (McKinsey 2024, Gartner 2024, Menlo Ventures 2024).
1.2 Why Existing Solutions Fall Short
Current approaches to hallucination mitigation operate at the wrong layer of the AI stack:
1.3 Why vLLM Semantic Router is the Ideal Solution Point
The vLLM Semantic Router occupies a unique position in the AI infrastructure stack that makes it ideally suited for hallucination mitigation:
Key Advantages of Gateway-Level Hallucination Control:
| Advantage | Description |
|---|---|
| Model-Agnostic | Works with any LLM backend without modification |
| Centralized Policy | Single configuration point for all applications |
| Cost Control | Organization-wide visibility into accuracy vs. cost trade-offs |
| Incremental Adoption | Enable per-decision, per-domain policies |
| Observability | Unified metrics, logging, and alerting for hallucination events |
| Defense in Depth | Complements (not replaces) RAG and prompt engineering |
1.4 Formal Problem Definition
We formalize hallucination detection in Retrieval-Augmented Generation (RAG) systems as a token-level sequence labeling problem.
Definition 1 (RAG Context). Let a RAG interaction be defined as a tuple (C, Q, R) where:
- C = {c₁, c₂, ..., cₘ} is the retrieved context (set of documents/passages)
- Q is the user query
- R = (r₁, r₂, ..., rₙ) is the generated response as a sequence of n tokens
Definition 2 (Grounded vs. Hallucinated Tokens). A token rᵢ in response R is:
- Grounded if there exists evidence in C that supports the claim containing rᵢ
- Hallucinated if rᵢ contributes to a claim that:
- (a) Contradicts information in C (contradiction hallucination), or
- (b) Cannot be verified from C and is not common knowledge (ungrounded hallucination)
Definition 3 (Hallucination Detection Function). The detection task is to learn a function:
f: (C, Q, R) → Y
where Y = (y₁, y₂, ..., yₙ) and yᵢ ∈ {0, 1} indicates whether token rᵢ is hallucinated.
Definition 4 (Hallucination Score). Given predictions Y and confidence scores P = (p₁, ..., pₙ) where pᵢ = P(yᵢ = 1), we define:
- Token-level score: s_token(rᵢ) = pᵢ
- Span-level score: For a contiguous span S = (rᵢ, ..., rⱼ), s_span(S) = max(pᵢ, ..., pⱼ)
- Response-level score: s_response(R) = 1 - ∏(1 - pᵢ) for all i where pᵢ > τ_token
Definition 5 (Mitigation Decision). Given threshold τ, the system takes action:
Action(R) =
PASS if s_response(R) < τ
MITIGATE if s_response(R) ≥ τ
2. Related Work: State-of-the-Art in Hallucination Mitigation
2.1 Taxonomy of Hallucination Types
Before reviewing detection methods, we establish a taxonomy of hallucination types:
Type 1: Intrinsic Hallucination — Generated content contradicts the provided context.
Example: Context says "The meeting is on Tuesday." Response says "The meeting is scheduled for Wednesday."
Type 2: Extrinsic Hallucination — Generated content cannot be verified from the context and is not common knowledge.
Example: Context discusses a company's Q3 earnings. Response includes Q4 projections not mentioned anywhere.
Type 3: Fabrication — Entirely invented entities, citations, or facts.
Example: "According to Smith et al. (2023)..." where no such paper exists.
| Type | Detection Difficulty | Mitigation Approach |
|---|---|---|
| Intrinsic | Easier (direct contradiction) | Context re-grounding |
| Extrinsic | Medium (requires knowledge boundary) | Uncertainty expression |
| Fabrication | Harder (requires external verification) | Cross-reference checking |
2.2 Detection Methods
| Category | Representative Work | Mechanism | Accuracy | Latency | Cost |
|---|---|---|---|---|---|
| Encoder-Based | LettuceDetect (2025), Luna (2025) | Token classification with ModernBERT/DeBERTa | F1: 75-79% | 15-35ms | Low |
| Self-Consistency | SelfCheckGPT (2023) | Multiple sampling + consistency check | Varies | Nx base | High |
| Cross-Model | Finch-Zk (2025) | Multi-model response comparison | F1: +6-39% | 2-3x base | High |
| Internal States | MIND (ACL 2024) | Hidden layer activation analysis | High | <10ms | Requires instrumentation |
2.2.1 Encoder-Based Detection (Deep Dive)
LettuceDetect (Kovács et al., 2025) frames hallucination detection as token-level sequence labeling:
- Architecture: ModernBERT-large (395M parameters) with classification head
- Input: Concatenated [Context, Query, Response] with special tokens
- Output: Per-token probability of hallucination
- Training: Fine-tuned on RAGTruth dataset (18K examples)
- Key Innovation: Long-context handling (8K tokens) enables full RAG context inclusion
Performance on RAGTruth Benchmark:
| Model | Token F1 | Example F1 | Latency |
|---|---|---|---|
| LettuceDetect-large | 79.22% | 74.8% | ~30ms |
| LettuceDetect-base | 76.5% | 71.2% | ~15ms |
| Luna (DeBERTa) | 73.1% | 68.9% | ~25ms |
| GPT-4 (zero-shot) | 61.2% | 58.4% | ~2s |
Why Encoder-Based for TruthLens: The combination of high accuracy, low latency, and fixed cost makes encoder-based detection ideal for gateway-level deployment.
2.2.2 Self-Consistency Methods
SelfCheckGPT (Manakul et al., 2023) exploits the observation that hallucinations are inconsistent across samples:
- Mechanism: Generate N responses, measure consistency
- Intuition: Factual content is reproducible; hallucinations vary
- Limitation: Requires N LLM calls (typically N=5-10)
Theoretical Basis: If P(fact) is high, the fact appears in most samples. If P(hallucination) is low per-sample, it rarely repeats.
2.2.3 Cross-Model Verification
Finch-Zk (2025) leverages model diversity:
- Mechanism: Compare responses from different model families
- Key Insight: Different models hallucinate differently
- Segment-Level Correction: Replace inconsistent segments with higher-confidence version