DAY 07 / PHASE 1 · ENGINEERING

Memory & State 管理

State 四层拆解 · 短期压缩 · 长期 3 架构 · Self-Maintained Profile

2026-05-25 · BigCat

LLM 是 stateless 纯函数；所有「记忆」都是你 token 预算里硬塞进去的状态。会塞和不会塞，差一个量级。

前置概念 → ai-ml-daily Day 1: LLM 基础（KV Cache 机制）

// WHY THIS MATTERS

问 100 个做 AI 产品的人「你怎么管理 memory」，90 个会说「我塞 vector DB 里 retrieve」。这就是大多数 AI agent 用三轮就开始胡言乱语的根本原因——把所有 state 当成一种东西，用一种工具（embedding + RAG）处理。资深玩家的认知是：context window 是稀缺资源，state 至少有四类，每类的写入/读取/失效逻辑完全不同。这一期不讲怎么调 chroma 参数（那是 101），讲四件事：怎么把混在一起的 state 拆成 conversation / scratchpad / profile / knowledge 四层并分别管理；短期记忆为什么 truncate > summarize > hierarchical 三档要分场景选；长期记忆为什么 vector retrieval 只能解决三分之一问题，剩下两类要 structured KV + episodic event log；以及 MemGPT / Letta 路线的 self-maintained profile 怎么做到「越用越懂你」。读完你应该能在白板上画出自己 agent 的 state 拓扑，并知道每条线该走哪条路。

// 01

把 State 拆成四层：混在一起就是事故现场

论断：90% 的 agent memory bug 都源于一个误解——把对话历史、中间结果、用户画像、领域知识混在同一个 context 里管理。它们的生命周期、可信度、检索方式完全不同。

背景与原理

OS 教科书第一章告诉你内存有 register / cache / RAM / disk 的分层。LLM 的「内存」也有同样的分层，只是大多数人没意识到。Anthropic 在 Building Effective Agents 里反复强调一句话——「context is a budget」——你必须主动决定每个 token 用在哪。下面四类 state 必须显式分开：

Conversation State：当前会话的 message history。生命周期 = 这一次会话。可信度高（用户刚说的）。写入 append-only，读取通常按时间倒序或全量。
Scratchpad / Working Memory：agent 在执行任务过程中的中间结果——tool call output、reasoning step、partial plan。生命周期 = 单个 task。**用完即焚**，绝大多数 scratchpad 不该跨任务留存。
User Profile：跨会话稳定的用户事实——name、偏好、长期目标、过往决策。生命周期 = 用户的整个生命。可信度需要校验（用户说的可能过时）。写入需要 LLM 抽取 + 去重 + 冲突解决，读取通常全量注入 system prompt。
Knowledge Base：与具体用户无关的领域知识——文档、代码库、产品 spec。生命周期 = 知识本身的版本。读取必须 query-aware retrieval（这才是 RAG 真正擅长的）。

四种 state 用一种机制（往 vector DB 塞）会同时踩三个坑：（1）对话历史按 embedding 检索丢失时序——「上一句」可能 retrieve 不到；（2）user profile 被淹在 1M 条 chunk 里，cosine similarity 选不到关键事实；（3）scratchpad 永远留存，下一次任务被无关中间结果污染。正确做法是给每层独立的存储 + 独立的 read/write policy。

Agent State 四层架构（生命周期由短到长） ┌─────────────────────────────────────────────────────────┐ │ Knowledge Base · 永久 · query-aware retrieval (RAG) │ ├─────────────────────────────────────────────────────────┤ │ User Profile · 跨会话 · LLM 抽取 · 全量注入 system │ ├─────────────────────────────────────────────────────────┤ │ Conversation · 单会话 · 滑窗+summary · 时序保留 │ ├─────────────────────────────────────────────────────────┤ │ Scratchpad · 单任务 · 用完即焚 · 不跨任务留存 │ └─────────────────────────────────────────────────────────┘ 每层 read / write 策略独立，混用即翻车

实战示例

把四层显式建模成 Python class，让 prompt 组装时强制走分层接口：

# memory_layers.py — 强制分层，禁止「往一个 dict 里乱塞」
from dataclasses import dataclass, field
from typing import Protocol

class MemoryLayer(Protocol):
    def read(self, query: str) -> str: ...
    def write(self, content: str, meta: dict) -> None: ...

@dataclass
class Conversation:                # 短期，append-only
    messages: list = field(default_factory=list)
    def read(self, _): return self.messages[-20:]   # 默认滑窗

@dataclass
class Scratchpad:                  # 单任务，task_id scoped
    by_task: dict = field(default_factory=dict)
    def clear(self, task_id): self.by_task.pop(task_id, None)

@dataclass
class Profile:                     # 跨会话稳定事实
    facts: dict = field(default_factory=dict)  # {"name":"BigCat", ...}
    def read(self, _): return self.facts            # 全量

@dataclass
class Knowledge:                   # 文档/代码，RAG 入口
    index: object   # 你的 vector store
    def read(self, query): return self.index.search(query, k=5)

# —— 组装 prompt 时强制走四层 ——
def build_context(query, conv, scratch, profile, kb):
    return {
        "system": f"User profile:\n{profile.read(None)}",
        "messages": conv.read(None),
        "working": scratch.by_task.get(current_task_id, []),
        "retrieved": kb.read(query),         # 仅与 query 相关
    }

关键不是这段代码本身，是它强制你回答四个问题：profile 哪些字段全量进 system？scratch 什么时候 clear？conv 滑窗多少？kb retrieve top-k 多少？想不清楚四个答案的 agent，上线必出 memory 事故。

失败模式：（1）把 user profile 用 embedding 检索——「我叫什么」这种事实型查询 cosine 经常选不到；profile 必须全量注入；（2）scratchpad 跨任务留存——上次 debug 的 stack trace 污染下次问答；（3）conversation 直接送进 vector DB——丢时序，「上一句」retrieve 不到；（4）四层混用同一个 store，rebuild index 把 profile 一起清掉。

进阶资源 · Anthropic Building Effective Agents, anthropic.com/research/building-effective-agents · Lilian Weng LLM Powered Autonomous Agents（Memory 章节）, lilianweng.github.io/posts/2023-06-23-agent

// 02

短期记忆三档：Truncate / Summarize / Hierarchical 何时用

论断：会话变长时，第一反应不该是「上 summarization」，应该是「能 truncate 就 truncate」。Summarize 是有损压缩，无脑用会把关键细节磨平。

背景与原理

Context 超了怎么办？三档处理方式，复杂度和损耗都不同：

Truncate（滑动窗口）：保留最近 N 条 message，前面整体丢。零成本、零损耗、可重现。适用：客服、Q&A、单轮指令——对话本身就「短期相关」，老消息没价值。
Summarize（增量摘要）：达到阈值后把旧消息压缩成一段摘要，挂在 system prompt 顶部。损耗中、成本中。适用：研究/规划/创作这类需要回顾的场景。Claude Code 和 ChatGPT 长会话都用这套（Anthropic 文档里叫 "compaction"，第 2 期我们详谈过）。
Hierarchical（分层 + 检索）：旧消息按段落 / 主题 chunk 后入 vector store，需要时按 query retrieve 回来。损耗低但架构复杂。MemGPT 论文（Packer 等 2023）是这条路线的开山：用「main context」+「external context」+「self-edit 函数」让 LLM 自己调 page-in/page-out。

三档的选择标准很简单——「我以后会不会查回早期对话的具体细节？」。客服一般不会（truncate）；写小说会但不要求精确（summarize）；做投研复盘要精确引用（hierarchical）。

短期记忆三档（按损耗 vs 复杂度）简单 ←──────────────────────────────→ 复杂 ┌──────────┐ ┌──────────┐ ┌──────────────┐ 损耗│ Truncate │ → │Summarize │ → │ Hierarchical │ 小 │ 滑窗 │ │ 增量压缩 │ │ chunk+检索 │ ↑ │ 0 成本 │ │ +1 LLM │ │ +向量库 │ │ 零保留 │ │ 摘要保留 │ │ 可精确召回 │ └──────────┘ └──────────┘ └──────────────┘ 客服/Q&A 规划/写作投研/code review 默认走最左边；右移只在「确实需要旧细节」时

实战示例

三档可以放在同一个 manager 里按 message 数自动升级：

# conversation_manager.py — 三档自动切换
class ConversationManager:
    def __init__(self, mode="truncate", window=20, summary_threshold=40):
        self.messages, self.summary = [], None
        self.mode, self.window, self.threshold = mode, window, summary_threshold

    def add(self, msg): self.messages.append(msg)

    def build(self):
        if self.mode == "truncate":
            return self.messages[-self.window:]

        if self.mode == "summarize":
            if len(self.messages) > self.threshold:
                old = self.messages[:-self.window]
                self.summary = compact(self.summary, old)  # LLM 增量摘要
                self.messages = self.messages[-self.window:]
            head = [{"role":"system","content":f"Prior summary:\n{self.summary}"}] \
                   if self.summary else []
            return head + self.messages

        if self.mode == "hierarchical":
            recent = self.messages[-self.window:]
            old_chunks = self.vector.search(latest_user_msg(recent), k=5)
            return [recall_block(old_chunks)] + recent

# —— compaction prompt（关键细节别丢）——
COMPACT = """Summarize the prior conversation. PRESERVE:
- All factual decisions made
- All open questions / TODOs
- Any explicit user preferences mentioned
DISCARD: greetings, clarifications already resolved, model's apologies.
Existing summary: {old_summary}
New messages: {new_msgs}
Output (≤300 tokens):"""

compaction prompt 的 PRESERVE / DISCARD 列表是决定摘要质量的关键。Anthropic 文档里给的版本特别强调「保留 decisions 和 open TODOs」——这正是研究/编码场景最容易丢的两类信息。

失败模式：（1）一上来就 summarize——10 轮以内的对话 summarize 是浪费 token + 损耗信息；先 truncate；（2）每次都重新摘要全部历史——成本爆炸；正确做法是增量摘要（旧摘要 + 新消息 → 新摘要）；（3）hierarchical 检索丢时序——retrieve 回来的 chunk 不带 timestamp 排序，LLM 会把昨天的 decision 当今天的；chunk metadata 必带 ts；（4）摘要 prompt 没列 PRESERVE 清单——摘要器自己决定丢什么，往往丢掉用户偏好这类「短不重要」实则关键的信息。

进阶资源 · Packer et al. MemGPT: Towards LLMs as Operating Systems, arxiv.org/abs/2310.08560 · Anthropic Long context prompting, docs.anthropic.com/.../long-context-tips

// 03

长期记忆三种架构：Vector / Structured KV / Episodic Event Log

论断：Vector retrieval 只能回答「找一段与 X 相似的内容」。它解决不了「我上次怎么决定的」「我有哪些偏好」。长期记忆要按 query 类型分三套存。

背景与原理

把 vector DB 当万能长期记忆是 2023 年的迷思。资深做法是按查询模式分三种存储，各司其职：

Vector store（语义检索）：擅长「找相似」「找相关」的开放查询。适合：文档、聊天历史的 fuzzy recall、case-based reasoning。失败在：精确事实、列表枚举、时序查询。
Structured KV / SQL（精确事实）：擅长「X 的值是什么」「列出所有 Y」。适合：user preferences、entity attributes、structured profile。失败在：模糊查询。MemGPT 的 core memory 本质就是 structured KV。
Episodic event log（时序事件）：擅长「上次发生了什么」「这两周做了什么」「上次失败的原因」。append-only 时间序列，按 (user_id, ts) 索引。适合：反思、复盘、追溯决策路径。Reflexion 论文（Shinn 等 2023）的 reflective memory 就是这种结构。

这三种不是互斥而是互补。一个成熟 agent 会同时有：vector store 装文档 + KV 装 profile + event log 装行为历史。Letta（MemGPT 的产品版本）官方文档把这套叫做「memory hierarchy」，并显式区分 core_memory（KV）、archival_memory（vector）、recall_memory（event log）。Microsoft GraphRAG 走了第四条路（知识图谱），适合实体关系密集的场景，但门槛高，普通团队优先建好前三种。

实战示例

用三种 store 实现一个能「记住你」的 personal agent，关键是写入路由——一条信息进来，决定它进哪个 store：

# memory_router.py — 一条信息进来，路由到正确的 store
ROUTE_PROMPT = """Classify this user utterance into ONE memory type:
- FACT      : a stable fact about the user (name, role, preference)
- EVENT     : an action/decision/experience that happened
- DOCUMENT  : reference content (article, code, doc)
- NONE      : transient (greeting, ack, clarification)
Return JSON: {"type": "...", "extract": "..."}"""

def ingest(utterance, user_id):
    r = llm(ROUTE_PROMPT, utterance)
    if r["type"] == "FACT":
        profile_kv.upsert(user_id, r["extract"])      # structured KV
    elif r["type"] == "EVENT":
        event_log.append(user_id, ts=now(), text=r["extract"])
    elif r["type"] == "DOCUMENT":
        vector_store.add(embed(r["extract"]), meta={"user":user_id})
    # NONE → drop

# —— 读取时按 query 类型路由 ——
def recall(query, user_id):
    intent = classify(query)  # fact / temporal / semantic
    if intent == "fact":
        return profile_kv.get_all(user_id)            # 全量 profile
    if intent == "temporal":
        return event_log.range(user_id, last="7d")    # 按时间
    return vector_store.search(query, k=5, filter={"user":user_id})

这套路由的精髓——问题决定存储。「我叫什么？」走 KV；「上周我们聊了什么？」走 event log；「找一段我之前提过的关于禅修的内容」走 vector。一种存储解所有问题是新手错觉。

失败模式：（1）只用 vector store——「列出我所有偏好」这种查询 cosine 必败；（2）event log 不做 retention——半年后 100k 条事件全部 retrieve，prompt 直接撑爆；要有 decay / summarize 策略；（3）KV 没有 conflict resolution——用户半年前说「我素食」，最近说「我吃肉了」，两条都在 → LLM 困惑；写入时 overwrite + 留时间戳；（4）三套 store 不共享 user_id schema——retrieve 时 join 不上，等于没存。

进阶资源 · Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning, arxiv.org/abs/2303.11366 · Letta（MemGPT）docs Memory hierarchy, docs.letta.com/concepts/memory · Edge et al. From Local to Global: GraphRAG, arxiv.org/abs/2404.16130

// 04

Self-Maintained User Profile：让 LLM 自己更新「关于你」的笔记

论断：把 user profile 当 LLM 的「自我维护文档」而不是「数据库表」——给它读 + 写的工具，让它自己决定记什么、改什么、忘什么。这是「越用越懂你」的核心机制。

背景与原理

第 3 节的 structured KV 解决了「存什么、怎么读」，但还有个工程问题——谁来决定写什么。三种方案：

用户显式声明：让用户填表/写 system prompt。覆盖率低、用户嫌烦。
规则抽取：写正则/规则识别「我叫 X」「我喜欢 Y」。脆弱、覆盖不全。
LLM self-maintained：给 LLM 一个 update_profile tool，让它在每轮对话后自主决定要不要更新 profile。MemGPT / Letta / ChatGPT memory / Claude memory 走的都是这条路。

self-maintained 的难点不是「让它写」，是「让它写对」。Profile 是要全量注入下一次会话 system prompt 的，里面塞了垃圾就永远污染。要解决四个问题：

抽取粒度：「用户提了下他在跑步」要不要写？太细 → profile 爆炸；太粗 → 漏关键。经验规则：跨会话仍然成立的稳定事实才写。
冲突解决：新事实和旧事实矛盾时（「我素食」→「我吃肉了」）overwrite 还是 append？默认 overwrite + 留 changelog。
遗忘机制：profile 不能无限增长。LRU / 显式过期 / 用户主动删。Letta 用「archival」把不活跃的 fact 移到 vector store。
幻觉过滤：LLM 可能从对话里推断出用户没说过的事（「用户提到孩子作业 → 推断有孩子在读小学」）。所有 self-maintained write 必须能引用具体 message ID 作为 evidence，无 evidence 不写。

实战示例

给 agent 一个 profile 工具，每轮对话后跑一遍「需不需要更新」：

# profile_tools.py — 给 LLM 的 self-maintained profile 工具
PROFILE_UPDATE_PROMPT = """Review the latest user message and decide if the user
profile should be updated. Only write facts that:
1. Are explicitly stated by the user (cite message)
2. Are likely to be true beyond this session
3. Are not already in the profile (or contradict it)

Current profile: {profile_json}
Latest message:  {message}

Output JSON:
{
  "action": "add" | "update" | "delete" | "none",
  "key": "...",
  "value": "...",
  "evidence_msg_id": "...",   // 必填，无 evidence 直接 "none"
  "reason": "..."
}"""

def maybe_update_profile(user_id, latest_msg, msg_id):
    current = profile_kv.get_all(user_id)
    decision = llm(PROFILE_UPDATE_PROMPT.format(
        profile_json=json.dumps(current), message=latest_msg))
    if decision["action"] == "none": return
    if not decision.get("evidence_msg_id"): return  # 无 evidence 拒绝
    profile_kv.apply(user_id, decision)
    changelog.append(user_id, decision, ts=now())   # 可追溯

# —— 每个 session 起手前注入 profile ——
def build_system_prompt(user_id):
    p = profile_kv.get_all(user_id)
    return f"""You are a personal assistant for the following user.
Stable facts about them (use to personalize, but verify before acting):
{json.dumps(p, indent=2, ensure_ascii=False)}

Important: if a fact seems outdated, ASK the user to confirm rather than
silently override the profile."""

三个细节决定品质：必填 evidence_msg_id（挡幻觉）、changelog（可追溯，用户问「你怎么知道我素食」能答出来）、系统 prompt 提示模型主动验证（避免基于过时 profile 自信犯错）。这三条都做到了，profile 才从「玩具」变「工程产物」。

失败模式：（1）LLM 自由写入无审计——半年后 profile 里 30% 是模型幻觉；（2）无遗忘机制——profile 长到 5000 token，全量注入吞掉一半 context budget；（3）profile 当作 ground truth 行动——用户偏好变了，agent 还按老 profile 推荐，体验崩；任何 high-stakes 决策前必须 confirm；（4）profile 全局共享给多个 agent——隐私 / scope 失控；profile 应按 agent 或场景隔离。

进阶资源 · Letta 官方教程 Building stateful agents, docs.letta.com · Simon Willison How ChatGPT memory works, simonwillison.net/2024/Apr/16 · Park et al. Generative Agents: Interactive Simulacra of Human Behavior（reflection/memory stream 设计）, arxiv.org/abs/2304.03442

// 综合实战 · 给手头 agent 画出 state 拓扑（30 分钟）

挑你正在做的或常用的一个 agent（个人 research bot / coding agent / 客服 / 写作助手），按以下 6 步画清 state 拓扑：

列出四层（§1，5 min）：分别写下 Conversation / Scratchpad / Profile / Knowledge 各装什么、生命周期多长。哪一层目前是空的？空的就是潜在 bug。
选短期策略（§2，5 min）：会话长度分布是多少？P95 超过 30 轮就要从 truncate 升 summarize；要追溯精确细节再升 hierarchical。
评估长期需求（§3，10 min）：列出用户最常问的 10 个 query 类型，标注每个该走 vector / KV / event log。如果 8 个都标 vector → 真的需要其它两种，否则 over-engineered。
设计写入路由（§3-4，5 min）：用户每条 utterance 进来，谁决定写哪？规则 / LLM router / 用户显式？写入有没有 evidence 字段？
定遗忘策略（§4，3 min）：Profile 上限多少 token？超过怎么办（LRU / archival / 用户审核）？Event log 多久 decay？
画一张图（2 min）：在白板/draw.io 画出四层 + 路由箭头。画不出来就是设计还没成型。

30 分钟后你应该有：一张 state 拓扑图、四层各自的 read/write policy、写入路由的判定逻辑、遗忘机制。这就是把 agent 从 demo 升级到「越用越懂你」的产品的关键文档。下一次同事问「我们的 agent 怎么记忆的」，你递这张图，不再支支吾吾说「塞 chroma 里」。

// ENGLISH GLOSSARY

State: Agent 的「记忆」总称；本期拆为 conversation / scratchpad / profile / knowledge 四层。
Conversation State: 当前会话的 message history，append-only，单会话生命周期。
Scratchpad / Working Memory: Agent 执行任务的中间结果（tool output、reasoning），用完即焚。
User Profile: 跨会话稳定的用户事实，通常 structured KV 存储，全量注入 system prompt。
Knowledge Base: 与用户无关的领域知识，query-aware retrieval 入口。
Truncate / Sliding Window: 保留最近 N 条 message，前面丢弃；最简单的短期记忆策略。
Summarize / Compaction: 把旧消息压缩成摘要，挂在 system prompt 顶部；有损但能保留要点。
Hierarchical Memory: 旧消息分块入 vector store，需要时检索回来；MemGPT 的核心架构。
MemGPT / Letta: Packer 等 2023 提出的「LLM as OS」内存分层框架，Letta 是其产品化。
Episodic Memory / Event Log: 按时序记录用户行为/决策的 append-only log，适合复盘与时间型查询。
Self-Maintained Profile: LLM 自主调用 tool 更新 user profile 的机制；ChatGPT memory / Claude memory 走此路。
Evidence-Bound Write: 每次 profile 写入必须挂一个 source message ID 作为 evidence；用于挡幻觉。
Conflict Resolution: 新事实与旧事实矛盾时的处理策略（overwrite + changelog 是默认）。

// 深入思考

Context window 已经 1M+ token，长期记忆架构还有必要吗？直接把所有历史塞进去不行？

不行，理由三层：（1）成本——1M token 每次调用 $3+，且没有 prefix cache 命中时延迟可到 30s+；（2）lost-in-the-middle（Liu 等 2023）——长 context 中段信息召回率掉到 50% 以下，塞进去不等于能用；（3）信噪比——历史里 90% 是噪音，全量注入会稀释 instruction，模型 attention 被分散。1M context 改变的是「上限提高」，不是「不再需要选择什么进 context」——context engineering 反而更重要。

MemGPT 的「self-edit memory」让 LLM 自己 page-in/page-out，看着很优雅，工业界为什么没大规模采用？

三个工程现实：（1）可预测性差——LLM 决定何时 page 进哪段，调试时复现困难；（2）额外 LLM 调用成本——每个 decision step 多一次 inference，延迟和 token 都翻倍；（3）对小模型不友好——self-edit 要求模型有强 meta-cognition，GPT-4/Claude Opus 级别才稳定，小模型经常 page 出关键 fact。多数产品（ChatGPT memory / Claude memory）走更朴素路线：固定 trigger（每轮跑一次 extract）+ 简单 KV，可控性优先。

User profile 是「越多越好」还是有最优大小？怎么测？

有最优区间，通常 200-1500 token。证据：（1）profile 太短 → 个性化弱、agent 通用化；（2）太长 → 稀释 instruction、注意力被分走、出现「按 profile 第 7 条但忽略用户当前指令」。测法：做一个 holdout 的 task eval，profile 长度作为 x 轴，task pass rate 作 y 轴，会看到倒 U 曲线。Letta 内部把 core memory 上限设到 2000 token 左右，archival 用 vector 兜底，就是基于这类经验。

Episodic event log 累计到 10 万条之后怎么办？全部 retain 还是有遗忘曲线？

必须遗忘，但用「分层 decay」而不是粗暴 TTL。三档：（1）最近 7 天 raw retain；（2）7-90 天按周做 weekly summary，原始事件归档；（3）90 天+ 转 monthly summary 进 vector store。这正好对应人类的工作记忆 / 短期 / 长期记忆。关键的「life events」（重大决策、显式 user 标 important）走单独 timeline 永久保留。Generative Agents 论文（Park 2023）的 reflection 机制是这条思路的学术雏形——定期把 episodic events 蒸馏成 higher-level reflections。

同一个用户在不同 agent（工作助理 / 写作助理 / 投研助理）之间，profile 应该共享还是隔离？

默认隔离，按需共享。共享的诱惑是「写一次 Profile 所有 agent 都懂我」；现实是 scope 混乱（工作 agent 不该用你的健康数据决策）+ 隐私泄露 + profile 互污染（投研 agent 误推断为「兴趣偏好」写入主 profile）。正确架构：（1）有一个 minimal global identity（name、语言偏好、时区）；（2）每个 agent 有 scoped local profile；（3）用户显式批准才能跨 agent 共享。这个边界设计比技术选型更重要，是 personal AI infra 的政治问题不只是工程问题。

// 延伸阅读

Packer et al. · MemGPT: Towards LLMs as Operating Systems — 内存分层的开山论文，本期理论底座
Letta · Memory hierarchy — MemGPT 团队的产品化文档，core/archival/recall 三层架构
Shinn et al. · Reflexion — verbal RL + reflective memory，episodic 思想原型
Park et al. · Generative Agents — memory stream + reflection 的经典实现
Liu et al. · Lost in the Middle — 长 context 中段信息丢失的实证
Lilian Weng · LLM Powered Autonomous Agents — Memory / Planning / Tool use 系统综述
Anthropic · Building Effective Agents — Context as budget 的官方表述