Hallucination 不是 prompt 写不好——它是 RLHF 训练目标的结构性副产品。本期从 token 级风险图谱开始,把「让模型少编」这件事拆成 4 个可工程化的层。
2025 年的尴尬现实:Claude 4.7 / GPT-5 在 SimpleQA 和 HaluEval 上 hallucination rate 仍在 5-15% 区间,且问得越具体编得越细。更让人困惑的是,模型其实知道自己不知道——Kadavath 等 (2022) 的探针实验早就证明:模型最后一层 hidden state 里有强信号编码 "P(True)",但生成阶段几乎从不输出 "I don't know"。这个鸿沟不是 prompt engineering 能填的——它是 RLHF reward model 系统性偏好自信回答而惩罚弃答的结构性结果。这一期假设你已经懂 "什么是 hallucination" 和 "什么是 RAG"(上期讲过),不重复定义;直接进入 4 个工程治理层:① 理解 RLHF 偏差的机制 → ② 知道哪些 token 几乎 100% 编 → ③ 三层 grounding 防线 → ④ hallucination-aware eval。每层都对应实际可改的代码或配置,不是泛泛的「写更好的 prompt」。
核心证据来自 OpenAI 2025 年发表的 "Why Language Models Hallucinate":他们用受控实验证明,同一个 base model 在 RLHF 前后 hallucination rate 反而上升。原因可分解:(1)pretraining 用 cross-entropy 损失,模型最优策略是匹配真实概率分布,包括「这个 token 该有 30% 概率是 X」的不确定性;(2)RLHF 用 pairwise reward,标注员看到「I don't know」vs「具体但可能错」会偏好后者(因为前者「没用」),reward model 学到这个偏好;(3)RL fine-tune 阶段,模型为了最大化 reward 抛弃 calibration,生成最高自信回答而非最高真实概率回答。
Anthropic 的 Kadavath et al. 2022 用 linear probe 实验把这个机制坐实:在 base model 的最后一层 hidden state 上,仅用几百个样本就能训出 P(True) 探针,AUC > 0.85——意味着模型内部其实非常清楚自己不知道。但 RLHF 后的生成空间被 reward 拉成「自信」单极。所以工程对策不是「让模型变聪明」,是恢复 calibration 信号——通过 logprob、self-consistency、或者明确给模型一个「弃答有正分」的 prompt 框架。
给模型一个「弃答合法」的 prompt 框架 + 用 logprob 校准自信度(OpenAI / Anthropic 都支持):
import anthropic, math
client = anthropic.Anthropic()
# —— 关键 1:system prompt 里明确「不知道值正分」——
SYS = """You will be asked a factual question. Your response MUST start
with one of three tokens, then explain:
KNOWN: I am confident in the answer and can cite the source class.
UNSURE: I have partial information but cannot verify specifics.
UNKNOWN: I do not know; do not speculate.
Saying UNKNOWN when uncertain is rewarded, not penalized.
Hallucinating a specific answer is the worst outcome."""
def ask(q):
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=300,
system=SYS,
messages=[{"role":"user","content":q}]
)
return msg.content[0].text
# —— 关键 2:用 logprob 抽 token-level uncertainty(OpenAI API)——
# Claude 目前没开放 logprob,但 OpenAI / Gemini / 开源模型都有。
from openai import OpenAI
oa = OpenAI()
def answer_with_confidence(q):
r = oa.chat.completions.create(
model="gpt-5", messages=[{"role":"user","content":q}],
logprobs=True, top_logprobs=5
)
text = r.choices[0].message.content
# 平均 token logprob → perplexity → uncertainty proxy
logprobs = [t.logprob for t in r.choices[0].logprobs.content]
avg_lp = sum(logprobs)/len(logprobs)
conf = math.exp(avg_lp) # 0~1
if conf < 0.5: text = f"[LOW-CONF {conf:.2f}] " + text
return text, conf
# —— 关键 3:self-consistency 作为辅助校准(无 logprob 时) ——
# 同一问题跑 5 次 temperature=0.7,答案分歧大 → 不可信
def consistency_check(q, n=5):
answers = [ask(q) for _ in range(n)]
# 用另一个 LLM judge 是否「实质一致」
return answers
为什么这 6 类 token 编造率离谱地高?根本原因是它们在 pretraining 语料里的统计签名极弱。URL 字符串、commit hash 这类是「高熵不可推断」的字面字符串——模型没有可泛化的语义规律可学,只能匹配字面记忆。在没记住的位置,next-token sampling 会生成「看起来合理」的字符(URL 形态对、函数名风格对、日期格式对),但底层概率分布几乎是均匀的——模型不知道自己在编。OpenAI 2025 的实验对此有量化:当 prompt 里要求生成「arXiv 论文 ID」,gpt-4-class 模型在 50% 以上的输出里给出格式正确但不存在的 ID。
更反直觉的是specificity reverse principle:让模型「列出 3 篇关于 Transformer 的论文」,编造率 ~30%;让它「列出 3 篇 2023 年 ICLR 关于 Transformer attention 优化的论文,给出标题、作者、arXiv ID」,编造率 > 80%。问得越具体,越容易激活 hallucination——因为 prompt 把模型推到一个具体但训练覆盖稀疏的空间,模型只能"用风格补全细节"。这跟人类直觉相反——人会觉得「问具体一点应该更准」——所以工程上必须把这条作为 first-class assumption,主动检测、主动 grounding。
用 regex + LLM-as-judge 检测高风险 token,必要时改写 prompt 或拒绝输出:
import re
# —— 高风险 token 类别正则(在 LLM 输出中扫描)——
RISK_PATTERNS = {
"url": r'https?://[^\s)]+',
"arxiv": r'arXiv:\d{4}\.\d{4,5}',
"doi": r'10\.\d{4,9}/[-._;()/:A-Za-z0-9]+',
"commit": r'\b[a-f0-9]{7,40}\b',
"fn_call": r'\b[a-z_]+\.[a-z_]+\([^)]*\)',
"exact_date": r'\b\d{4}-\d{2}-\d{2}\b',
"page_num": r'\b(?:p\.?|page)\s*\d{1,4}\b',
"isbn": r'\bISBN[-: ]?(?:\d{9}[\dX]|\d{13})\b'
}
def scan_high_risk(text):
hits = {}
for kind, pat in RISK_PATTERNS.items():
m = re.findall(pat, text)
if m: hits[kind] = m
return hits
# —— 用法 1:输出后检测 ——
def answer_with_risk_audit(q):
raw = ask(q)
risks = scan_high_risk(raw)
if risks:
# 触发 grounding:用 web search tool 验证每个 URL / DOI
verified = verify_with_tools(raw, risks)
return verified
return raw
# —— 用法 2:prompt 时主动反 specificity-reverse ——
SAFE_SPECIFICITY = """When asked for citations, URLs, function signatures,
or exact numeric details: DO NOT fabricate.
- If you cannot verify the exact reference, say "I recall a paper by [author class] on [topic]
around [approximate year] but cannot verify the exact title/ID."
- Prefer naming conventions over fabricated specifics:
"the original Transformer paper (Vaswani et al. ~2017)" is preferred over
"Vaswani et al., 'Attention is All You Need', arXiv:1706.03762" if uncertain."""
# 注意:模型记得清楚的会保留细节,记不清的会降级表达——这是想要的行为
每层各自的工程理由:
enum: ["yes","no","unknown"],模型就物理上无法编造第四个选项。这比 prompt 约束强一个数量级——它不是劝告,是硬约束。三层叠加的 ROI:Min et al. FActScore 评测,纯 generator hallucination rate ~25%,加 tool grounding 降到 ~12%,再加 structured output 降到 ~7%,再加 verification chain 降到 ~5%。每层独立贡献 30-50%,且失败模式互不重叠(tool 漏的 structured 接、structured 漏的 verifier 接)。
三层防线的最小实现(generator → tool verify → structured → verifier):
import anthropic, json, re
client = anthropic.Anthropic()
# —— Layer A: tool fact-check ——————————————————————
TOOLS = [{
"name":"verify_citation",
"description":"Verify if a citation (arXiv ID, DOI, URL) actually exists.",
"input_schema":{"type":"object","properties":{
"identifier":{"type":"string"},
"kind":{"type":"string","enum":["arxiv","doi","url"]}
},"required":["identifier","kind"]}
}]
def verify_citation(identifier, kind):
# 真实实现:调 Semantic Scholar / arXiv API / HEAD request
if kind == "arxiv":
return arxiv_api.exists(identifier)
# ...
# —— Layer B: structured output with strict schema ——————
ANSWER_SCHEMA = {
"type":"object",
"properties":{
"answer":{"type":"string"},
"confidence":{"type":"string",
"enum":["known","unsure","unknown"]},
"citations":{"type":"array",
"items":{"type":"string",
"description":"Must be in verified_set or omitted."}}
},
"required":["answer","confidence"]
}
def generate(q, verified_citations):
return client.messages.create(
model="claude-opus-4-7", max_tokens=800,
tools=TOOLS,
system=f"Cite only from this verified set: {verified_citations}. "
"Set confidence=unknown rather than guessing. "
"Output JSON conforming to the response_format schema.",
messages=[{"role":"user","content":q}]
)
# —— Layer C: verifier (different model / different prompt) ——
def verify(answer_json, question):
# 用 Haiku 做 cheap verifier,prompt 角度切换为"找漏洞"
msg = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=400,
system="You are a strict fact-checker. List every claim "
"in the answer that CANNOT be verified from common knowledge. "
"Output JSON: {unverifiable: [...]}",
messages=[{"role":"user","content":
f"Question: {question}\nAnswer: {json.dumps(answer_json)}"}]
)
return json.loads(msg.content[0].text)
# —— Pipeline ——————————————————————————————————————————
def grounded_answer(q):
draft = quick_draft(q) # Layer 0: 草稿
risks = scan_high_risk(draft) # 从 §02
verified = {k: [v for v in vs if verify_citation(v,k)]
for k,vs in risks.items()} # Layer A
final = generate(q, verified) # Layer B (structured)
flags = verify(final, q) # Layer C
return annotate_unverified(final, flags)
required: ["citations"],否则模型必须编出引用;citations 永远是可选;(3)verifier prompt 写成"判断对不对"——会触发 sycophancy,倾向于赞同 generator;要明确"找出不能验证的 claim"(steelman 反向);(4)tool fact-check 没设 timeout——一个慢 API 拖垮整条 pipeline;要 async + p95 cutoff;(5)overhead 太大——三层串联意味着 3-4 倍延迟;只对高风险 query(含 citation/numeric/date)走全链,简单 query 跳过 verifier。
问题的根源是 benchmark 设计错位。SimpleQA、TriviaQA、HaluEval 普遍用 accuracy = correct / total——但这把"我不知道"和"答错"都算 0 分。模型完美的策略变成:永远不弃答、永远自信。这正好奖励 RLHF 已经植入的 bias,eval 和训练目标同向推动 hallucination。
2024-2025 学界共识转向 selective prediction 框架:把模型分成两个组件——(1)answer function:给具体回答;(2)confidence function:自评是否可靠。eval 三件套:
OpenAI 2025 Why-Hallucinate 论文里的关键点:切换到 selective scoring 指标后,目前所有主流模型的「合理性排序」会重排——某些 benchmark 上的"领先者"在 calibration 上其实是垫底。这意味着:如果你在用 hallucination 敏感场景(医疗、法律、金融、研究),你不能信任 benchmark 排名,必须自己跑 selective eval。
一个最小可用的 selective eval 框架(适配任何模型):
import numpy as np
from sklearn.metrics import auc
# 数据:list of (question, gold_answer, model_answer, model_confidence)
# model_confidence ∈ [0,1],可以是 logprob avg / self-report / consistency rate
def coverage_accuracy_curve(samples):
# 按 confidence 降序排列
sorted_s = sorted(samples, key=lambda s: -s["conf"])
points = []
correct_so_far = 0
for i, s in enumerate(sorted_s, 1):
correct_so_far += int(s["correct"])
coverage = i / len(sorted_s)
accuracy = correct_so_far / i
points.append((coverage, accuracy))
return points # 画图 + 算 AUC
def ece(samples, n_bins=10):
# Expected Calibration Error
bins = np.linspace(0, 1, n_bins+1)
ece_val, total = 0.0, len(samples)
for i in range(n_bins):
bucket = [s for s in samples
if bins[i] <= s["conf"] < bins[i+1]]
if not bucket: continue
avg_conf = np.mean([s["conf"] for s in bucket])
acc = np.mean([s["correct"] for s in bucket])
ece_val += (len(bucket)/total) * abs(avg_conf - acc)
return ece_val # 越低越好;<0.05 算很好
def abstention_f1(samples, gold_difficulty):
# gold_difficulty: 哪些题应当弃答(用 hard set 或 OOD set 标)
tp = sum(1 for s in samples
if s["abstained"] and gold_difficulty[s["id"]] == "hard")
fp = sum(1 for s in samples
if s["abstained"] and gold_difficulty[s["id"]] == "easy")
fn = sum(1 for s in samples
if not s["abstained"] and gold_difficulty[s["id"]] == "hard")
precision = tp / (tp + fp + 1e-9)
recall = tp / (tp + fn + 1e-9)
return 2 * precision * recall / (precision + recall + 1e-9)
# —— 报告 ——
def selective_report(samples, gold_difficulty):
curve = coverage_accuracy_curve(samples)
return {
"accuracy@100": curve[-1][1],
"accuracy@50": curve[len(curve)//2][1],
"AUC": auc(*zip(*curve)),
"ECE": ece(samples),
"abstention_F1": abstention_f1(samples, gold_difficulty)
}
本期 4 个点不是独立技巧——它们对应 hallucination 治理的 4 个工程层。落地路径按 ROI 排序:
完成 5 步,你的应用在 hallucination 敏感场景的用户信任度会跨档——不是因为模型变聪明了,是因为它会承认无知、会标注不确定、会触发外部验证。这三件事比"再换一个更大的模型"对最终产品质量影响大得多。