Hallucination is not a prompting problem — it is a structural side effect of the RLHF training objective. This issue starts with a token-level risk map and decomposes "make the model lie less" into four engineerable layers.
The awkward 2025 reality: Claude 4.7 / GPT-5 still post 5–15% hallucination rates on SimpleQA and HaluEval, and the more specific the question, the more elaborate the fabrication. Even more counter-intuitive — models actually know they don't know. Kadavath et al. (2022) showed that the last hidden layer carries a strong P(True) signal, yet the generated output almost never says "I don't know." That gap isn't something prompt engineering can close — it's the structural result of RLHF reward models systematically preferring confident answers over abstentions. This issue assumes you already know what hallucination is and what RAG is (last week); we go straight into four engineering layers: ① understanding the RLHF bias mechanism → ② knowing which tokens are ~100% fabricated → ③ a three-layer grounding stack → ④ hallucination-aware eval. Every layer maps to actual code or config changes — not just "write a better prompt."
The core evidence is OpenAI's 2025 paper "Why Language Models Hallucinate": a controlled experiment showing the same base model has a higher hallucination rate after RLHF than before. Decomposed: (1) pretraining uses cross-entropy loss, so the optimal model matches the true probability distribution including "this token should have 30% probability of being X" — uncertainty is preserved; (2) RLHF uses pairwise reward, and annotators shown "I don't know" vs. "specific but possibly wrong" prefer the latter (because the former is "unhelpful") — the reward model encodes this preference; (3) during RL fine-tuning, the model abandons calibration to maximize reward, producing the most confident answer rather than the most truthful distribution.
Anthropic's Kadavath et al. 2022 made this concrete with linear probes: on the last hidden state of a base model, a few hundred samples train a P(True) probe with AUC > 0.85 — meaning the model internally knows it doesn't know. But the post-RLHF generation manifold has been collapsed toward a confident pole by the reward signal. So the engineering response is not "make the model smarter" — it is to restore the calibration signal, via logprobs, self-consistency, or an explicit prompt frame that gives abstention positive weight.
Give the model an "abstention-is-legitimate" frame + use logprobs to calibrate confidence (OpenAI / Anthropic both support this):
import anthropic, math
client = anthropic.Anthropic()
# —— Key 1: system prompt explicitly rewards "I don't know" ——
SYS = """You will be asked a factual question. Your response MUST start
with one of three tokens, then explain:
KNOWN: I am confident in the answer and can cite the source class.
UNSURE: I have partial information but cannot verify specifics.
UNKNOWN: I do not know; do not speculate.
Saying UNKNOWN when uncertain is rewarded, not penalized.
Hallucinating a specific answer is the worst outcome."""
def ask(q):
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=300,
system=SYS,
messages=[{"role":"user","content":q}]
)
return msg.content[0].text
# —— Key 2: extract token-level uncertainty via logprobs (OpenAI API) ——
# Claude doesn't expose logprobs yet; OpenAI / Gemini / open-source do.
from openai import OpenAI
oa = OpenAI()
def answer_with_confidence(q):
r = oa.chat.completions.create(
model="gpt-5", messages=[{"role":"user","content":q}],
logprobs=True, top_logprobs=5
)
text = r.choices[0].message.content
# average token logprob → perplexity → uncertainty proxy
logprobs = [t.logprob for t in r.choices[0].logprobs.content]
avg_lp = sum(logprobs)/len(logprobs)
conf = math.exp(avg_lp) # 0~1
if conf < 0.5: text = f"[LOW-CONF {conf:.2f}] " + text
return text, conf
# —— Key 3: self-consistency as a fallback signal (without logprobs) ——
# Run the same question 5x at temperature=0.7; high divergence → untrustworthy
def consistency_check(q, n=5):
answers = [ask(q) for _ in range(n)]
return answers
Why do these six token classes hallucinate so absurdly often? The root cause: they have weak statistical signatures in the pretraining corpus. URL strings, commit hashes — these are "high-entropy non-inferable" literal strings with no generalizable semantic regularity, only verbatim memorization. At positions the model hasn't memorized, next-token sampling generates "plausible-looking" characters (URL-shape correct, function-name style correct, date format correct), but the underlying distribution is nearly uniform — the model has no idea it's making things up. OpenAI 2025 quantifies this: when prompted to produce an "arXiv paper ID," gpt-4-class models output format-correct but nonexistent IDs in > 50% of cases.
Even more counter-intuitive is the specificity reverse principle: ask "list 3 papers on Transformers" — fabrication rate ~30%. Ask "list 3 papers from ICLR 2023 on Transformer attention optimization, with title, authors, arXiv ID" — fabrication rate > 80%. The more specific the question, the more hallucination is activated, because the prompt pushes the model into a specific but training-sparse region of state-space, and the only fallback is "style-completing the details." This contradicts human intuition — we expect that asking more specifically yields more precision — so engineering must make this a first-class assumption: actively detect, actively ground.
Use regex + LLM-as-judge to detect high-risk tokens; rewrite or refuse when necessary:
import re
# —— High-risk token class regexes (scan LLM output) ——
RISK_PATTERNS = {
"url": r'https?://[^\s)]+',
"arxiv": r'arXiv:\d{4}\.\d{4,5}',
"doi": r'10\.\d{4,9}/[-._;()/:A-Za-z0-9]+',
"commit": r'\b[a-f0-9]{7,40}\b',
"fn_call": r'\b[a-z_]+\.[a-z_]+\([^)]*\)',
"exact_date": r'\b\d{4}-\d{2}-\d{2}\b',
"page_num": r'\b(?:p\.?|page)\s*\d{1,4}\b',
"isbn": r'\bISBN[-: ]?(?:\d{9}[\dX]|\d{13})\b'
}
def scan_high_risk(text):
hits = {}
for kind, pat in RISK_PATTERNS.items():
m = re.findall(pat, text)
if m: hits[kind] = m
return hits
# —— Usage 1: post-generation audit ——
def answer_with_risk_audit(q):
raw = ask(q)
risks = scan_high_risk(raw)
if risks:
# trigger grounding: verify each URL / DOI via web search tool
verified = verify_with_tools(raw, risks)
return verified
return raw
# —— Usage 2: proactively counter specificity-reverse at prompt time ——
SAFE_SPECIFICITY = """When asked for citations, URLs, function signatures,
or exact numeric details: DO NOT fabricate.
- If you cannot verify the exact reference, say "I recall a paper by [author class] on [topic]
around [approximate year] but cannot verify the exact title/ID."
- Prefer naming conventions over fabricated specifics:
"the original Transformer paper (Vaswani et al. ~2017)" is preferred over
"Vaswani et al., 'Attention is All You Need', arXiv:1706.03762" if uncertain."""
# Note: well-remembered facts keep their detail; uncertain ones degrade gracefully
\b[a-f0-9]{7,40}\b matches "abcdef0"; combine with context (require nearby "git" / "commit" keyword); (3) treating specificity suppression as deleting all specifics — wrong; well-remembered high-frequency facts ("Python sort uses Timsort") should keep their detail; only degrade rare/low-frequency tokens; (4) treating "specificity reverse" as an iron law — it's a tendency, not absolute. In domains with high training coverage (mainstream library APIs), specificity actually improves accuracy. Calibrate per domain.
The engineering rationale for each layer:
enum: ["yes","no","unknown"] and the model is physically incapable of producing a fourth option. That's an order of magnitude stronger than prompt-level constraint — it's a hard constraint, not advice.Stacked ROI: Min et al. FActScore measurements — pure generator hallucination rate ~25%, plus tool grounding drops to ~12%, plus structured output to ~7%, plus verification chain to ~5%. Each layer independently contributes 30–50%, and the failure modes don't overlap (what tool misses, structured catches; what structured misses, verifier catches).
Minimal three-layer implementation (generator → tool verify → structured → verifier):
import anthropic, json, re
client = anthropic.Anthropic()
# —— Layer A: tool fact-check ——————————————————————
TOOLS = [{
"name":"verify_citation",
"description":"Verify if a citation (arXiv ID, DOI, URL) actually exists.",
"input_schema":{"type":"object","properties":{
"identifier":{"type":"string"},
"kind":{"type":"string","enum":["arxiv","doi","url"]}
},"required":["identifier","kind"]}
}]
def verify_citation(identifier, kind):
# Real impl: call Semantic Scholar / arXiv API / HEAD request
if kind == "arxiv":
return arxiv_api.exists(identifier)
# ...
# —— Layer B: structured output with strict schema ——————
ANSWER_SCHEMA = {
"type":"object",
"properties":{
"answer":{"type":"string"},
"confidence":{"type":"string",
"enum":["known","unsure","unknown"]},
"citations":{"type":"array",
"items":{"type":"string",
"description":"Must be in verified_set or omitted."}}
},
"required":["answer","confidence"]
}
def generate(q, verified_citations):
return client.messages.create(
model="claude-opus-4-7", max_tokens=800,
tools=TOOLS,
system=f"Cite only from this verified set: {verified_citations}. "
"Set confidence=unknown rather than guessing. "
"Output JSON conforming to the response_format schema.",
messages=[{"role":"user","content":q}]
)
# —— Layer C: verifier (different model / different prompt) ——
def verify(answer_json, question):
# Use Haiku as a cheap verifier; flip prompt angle to "find flaws"
msg = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=400,
system="You are a strict fact-checker. List every claim "
"in the answer that CANNOT be verified from common knowledge. "
"Output JSON: {unverifiable: [...]}",
messages=[{"role":"user","content":
f"Question: {question}\nAnswer: {json.dumps(answer_json)}"}]
)
return json.loads(msg.content[0].text)
# —— Pipeline ——————————————————————————————————————————
def grounded_answer(q):
draft = quick_draft(q) # Layer 0: draft
risks = scan_high_risk(draft) # from §02
verified = {k: [v for v in vs if verify_citation(v,k)]
for k,vs in risks.items()} # Layer A
final = generate(q, verified) # Layer B (structured)
flags = verify(final, q) # Layer C
return annotate_unverified(final, flags)
required: ["citations"], or the model is forced to fabricate; citations are always optional; (3) verifier prompt phrased as "is this correct" — triggers sycophancy, the verifier tends to agree with the generator; phrase it as "list unverifiable claims" (steelman in reverse); (4) tool fact-check has no timeout — one slow API hangs the whole pipeline; use async + p95 cutoff; (5) overhead too high — three layers serially means 3–4× latency; only run the full chain on high-risk queries (containing citation/numeric/date), skip verifier on simple ones.
The root cause is benchmark design misalignment. SimpleQA, TriviaQA, HaluEval all use accuracy = correct / total — but this lumps "I don't know" with "wrong answer" at zero points. The optimal model strategy becomes: never abstain, always be confident. This perfectly rewards the bias RLHF already installed; eval and training objective push hallucination the same way.
The 2024–2025 academic consensus shifted to the selective prediction framework: split the model into two components — (1) answer function: produce a concrete answer; (2) confidence function: self-rate reliability. Evaluation trio:
The key insight from OpenAI's 2025 Why-Hallucinate paper: under selective scoring, the "reasonable ranking" of all major models reshuffles — some benchmark leaders are calibration-bottom-feeders. If your application is hallucination-sensitive (medicine, law, finance, research), you cannot trust benchmark rankings; you must run selective eval yourself.
A minimal selective-eval framework (model-agnostic):
import numpy as np
from sklearn.metrics import auc
# data: list of (question, gold_answer, model_answer, model_confidence)
# model_confidence ∈ [0,1] — logprob avg / self-report / consistency rate
def coverage_accuracy_curve(samples):
# sort by confidence descending
sorted_s = sorted(samples, key=lambda s: -s["conf"])
points = []
correct_so_far = 0
for i, s in enumerate(sorted_s, 1):
correct_so_far += int(s["correct"])
coverage = i / len(sorted_s)
accuracy = correct_so_far / i
points.append((coverage, accuracy))
return points # plot + AUC
def ece(samples, n_bins=10):
# Expected Calibration Error
bins = np.linspace(0, 1, n_bins+1)
ece_val, total = 0.0, len(samples)
for i in range(n_bins):
bucket = [s for s in samples
if bins[i] <= s["conf"] < bins[i+1]]
if not bucket: continue
avg_conf = np.mean([s["conf"] for s in bucket])
acc = np.mean([s["correct"] for s in bucket])
ece_val += (len(bucket)/total) * abs(avg_conf - acc)
return ece_val # lower is better; <0.05 is very good
def abstention_f1(samples, gold_difficulty):
# gold_difficulty: which questions SHOULD be abstained (hard set / OOD set)
tp = sum(1 for s in samples
if s["abstained"] and gold_difficulty[s["id"]] == "hard")
fp = sum(1 for s in samples
if s["abstained"] and gold_difficulty[s["id"]] == "easy")
fn = sum(1 for s in samples
if not s["abstained"] and gold_difficulty[s["id"]] == "hard")
precision = tp / (tp + fp + 1e-9)
recall = tp / (tp + fn + 1e-9)
return 2 * precision * recall / (precision + recall + 1e-9)
# —— Reporting ——
def selective_report(samples, gold_difficulty):
curve = coverage_accuracy_curve(samples)
return {
"accuracy@100": curve[-1][1],
"accuracy@50": curve[len(curve)//2][1],
"AUC": auc(*zip(*curve)),
"ECE": ece(samples),
"abstention_F1": abstention_f1(samples, gold_difficulty)
}
The four sections above aren't independent tricks — they map to four engineering layers of hallucination mitigation. Adoption path by ROI:
Once you've done these five steps, your application's user trust in hallucination-sensitive scenarios will jump a tier — not because the model got smarter, but because it now admits ignorance, annotates uncertainty, and triggers external verification. These three properties matter more to final product quality than "swap in a bigger model."