An LLM is a stateless pure function; every "memory" is state you forced into a token budget. The gap between people who do this well and badly is an order of magnitude.
Ask 100 AI builders "how do you manage memory" and 90 will say "I stuff everything in a vector DB and retrieve." That's exactly why most agents start hallucinating after three turns — treating every kind of state as one kind of thing, handled by one tool (embedding + RAG). The senior mental model: context window is a scarce resource, state has at least four distinct kinds, and each has completely different write / read / expiry semantics. This week is not about tuning Chroma parameters (that's 101); it's about four things: how to split tangled state into Conversation / Scratchpad / Profile / Knowledge and manage each separately; why short-term memory should pick from Truncate > Summarize > Hierarchical by scenario; why vector retrieval only solves about a third of long-term memory and you need Structured KV + Episodic Event Log for the other two-thirds; and how the MemGPT / Letta lineage achieves "gets to know you better over time" via self-maintained profile. By the end you should be able to sketch your agent's state topology on a whiteboard and know which line goes where.
// 01
Split State Into Four Layers — Mixing Them Is the Accident
Claim: 90% of agent memory bugs come from one mistake — putting conversation history, intermediate results, user profile, and domain knowledge into one shared context. Their lifecycles, trustworthiness, and access patterns are nothing alike.
Background & Principles
Chapter 1 of any OS textbook tells you memory is tiered: registers / cache / RAM / disk. An LLM's "memory" is also tiered — most people just don't realize it. Anthropic's Building Effective Agents repeats one line: "context is a budget" — you must actively decide what each token is spent on. The four classes below should be tracked separately:
Conversation State: the message history of the current session. Lifecycle = one session. High trust (the user just said it). Write is append-only, read is usually reverse-chronological or full.
Scratchpad / Working Memory: intermediate output of an in-progress task — tool calls, reasoning steps, partial plans. Lifecycle = one task. Burn after reading; almost no scratchpad should survive across tasks.
User Profile: stable facts about the user across sessions — name, preferences, long-term goals, prior decisions. Lifecycle = the user's lifetime. Trust needs validation (what the user said may be stale). Write requires LLM extraction + dedup + conflict resolution; read usually means full-inject into system prompt.
Knowledge Base: domain knowledge unrelated to a specific user — docs, codebase, product spec. Lifecycle = the knowledge's own version. Read must be query-aware retrieval (that's where RAG truly shines).
Cramming all four into one mechanism (stuff a vector DB) hits three failures simultaneously: (1) embedding-based retrieval over conversation loses temporal ordering — "the previous message" may not even rank; (2) user profile drowns in 1M chunks, cosine similarity can't pick the key fact; (3) scratchpad persists forever and pollutes the next task. The right answer is independent storage + independent read/write policy per layer.
Four-layer agent state (lifecycle short → long)
┌─────────────────────────────────────────────────────────┐
│ Knowledge Base · forever · query-aware retrieval RAG │
├─────────────────────────────────────────────────────────┤
│ User Profile · cross-session · LLM-extract · full │
├─────────────────────────────────────────────────────────┤
│ Conversation · session · window+summary · in-order │
├─────────────────────────────────────────────────────────┤
│ Scratchpad · task · burn after use · no carryover│
└─────────────────────────────────────────────────────────┘
per-layer read/write policy; mixing breaks things
Hands-on Example
Model the four layers explicitly as Python classes; force prompt assembly through the layered interface:
# memory_layers.py — force the split; ban "one dict for everything"from dataclasses import dataclass, field
from typing import Protocol
class MemoryLayer(Protocol):
def read(self, query: str) -> str: ...
def write(self, content: str, meta: dict) -> None: ...
@dataclass
class Conversation: # short-term, append-only
messages: list = field(default_factory=list)
def read(self, _): return self.messages[-20:] # sliding window default
@dataclass
class Scratchpad: # per-task scope
by_task: dict = field(default_factory=dict)
def clear(self, task_id): self.by_task.pop(task_id, None)
@dataclass
class Profile: # cross-session stable facts
facts: dict = field(default_factory=dict) # {"name":"BigCat", ...}def read(self, _): return self.facts # full
@dataclass
class Knowledge: # docs/code, RAG entry
index: object # your vector storedef read(self, query): return self.index.search(query, k=5)
# —— assemble through the four layers ——def build_context(query, conv, scratch, profile, kb):
return {
"system": f"User profile:\n{profile.read(None)}",
"messages": conv.read(None),
"working": scratch.by_task.get(current_task_id, []),
"retrieved": kb.read(query), # only what matches query
}
The point isn't the code — it's that the code forces you to answer four questions: which profile fields go fully into system? when does scratch clear? how large is the conv window? what's k for kb retrieval? An agent whose author can't answer all four will ship a memory incident.
Failure modes: (1) Retrieving user profile via embeddings — factual queries like "what's my name" often miss; profile must be fully injected. (2) Scratchpad persists across tasks — last debug's stack trace pollutes today's chat. (3) Sending conversation straight to a vector DB — temporal order is lost, "the previous message" is unretrievable. (4) All four sharing one store, so a rebuild wipes profile too.
Short-Term Memory in Three Tiers: When to Truncate / Summarize / Go Hierarchical
Claim: When the conversation gets long, your first reflex should NOT be "add summarization" — it should be "can I just truncate?" Summarization is lossy; using it blindly grinds important detail to dust.
Background & Principles
Context overflowed — what now? Three tiers, different complexity and loss:
Truncate (sliding window): keep the last N messages, drop the rest. Zero cost, zero loss, reproducible. Use for: support, Q&A, single-turn instruction — conversations where old messages have no future value.
Summarize (incremental compaction): at some threshold, compress old messages into a summary that sits at the top of the system prompt. Medium loss, medium cost. Use for: research, planning, creative writing — anywhere you may need to look back. Claude Code and ChatGPT long chats both do this (Anthropic calls it "compaction"; we covered it in Day 2).
Hierarchical (chunk + retrieve): old messages get chunked into a vector store; query-aware retrieval pulls them back when needed. Low loss, but architecturally complex. The MemGPT paper (Packer et al. 2023) opened this lane: "main context" + "external context" + self-edit functions let the LLM page in/out itself.
The selection rule is simple — "will I ever need to look up specific old details?" Support: usually no (truncate). Long-form writing: yes, but inexact (summarize). Investment research / code review: yes, with exact quotes (hierarchical).
Three short-term tiers (loss vs complexity)
simple ←──────────────────────────────→ complex
┌──────────┐ ┌──────────┐ ┌──────────────┐
loss│ Truncate │ → │Summarize │ → │ Hierarchical │
low │ window │ │ compact │ │ chunk+search │
↑ │ free │ │ +1 LLM │ │ +vector DB │
│ no recall│ │ gist kept│ │ exact recall │
└──────────┘ └──────────┘ └──────────────┘
support / QA plan / write research / code
start at the left; move right only if old detail is needed
Hands-on Example
All three tiers can live in one manager that auto-upgrades by message count:
# conversation_manager.py — three tiers, auto switchclass ConversationManager:
def __init__(self, mode="truncate", window=20, summary_threshold=40):
self.messages, self.summary = [], None
self.mode, self.window, self.threshold = mode, window, summary_threshold
def add(self, msg): self.messages.append(msg)
def build(self):
if self.mode == "truncate":
return self.messages[-self.window:]
if self.mode == "summarize":
if len(self.messages) > self.threshold:
old = self.messages[:-self.window]
self.summary = compact(self.summary, old) # LLM incremental
self.messages = self.messages[-self.window:]
head = [{"role":"system","content":f"Prior summary:\n{self.summary}"}] \
if self.summary else []
return head + self.messages
if self.mode == "hierarchical":
recent = self.messages[-self.window:]
old_chunks = self.vector.search(latest_user_msg(recent), k=5)
return [recall_block(old_chunks)] + recent
# —— compaction prompt (don't lose the important details) ——
COMPACT = """Summarize the prior conversation. PRESERVE:
- All factual decisions made
- All open questions / TODOs
- Any explicit user preferences mentioned
DISCARD: greetings, clarifications already resolved, model's apologies.
Existing summary: {old_summary}
New messages: {new_msgs}
Output (≤300 tokens):"""
The PRESERVE / DISCARD list in the compaction prompt is what determines summary quality. Anthropic's docs explicitly call out "preserve decisions and open TODOs" — exactly the two things research and coding sessions lose most often.
Failure modes: (1) Summarizing too early — within 10 turns it just wastes tokens and adds loss; truncate first. (2) Re-summarizing the entire history every turn — cost explodes; do it incrementally (old summary + new messages → new summary). (3) Hierarchical retrieval losing temporal order — chunks come back without timestamp sorting, the LLM treats yesterday's decision as today's; always carry ts in chunk metadata. (4) Compaction prompt with no PRESERVE list — the summarizer decides what to drop, and it often drops user preferences ("short and unimportant" but actually critical).
Claim: Vector retrieval can only answer "find something similar to X". It cannot answer "what did I decide last time" or "what are my preferences". Long-term memory needs three stores split by query type.
Background & Principles
Treating a vector DB as universal long-term memory was a 2023 myth. The senior pattern is splitting by query mode into three stores, each doing its own job:
Vector store (semantic retrieval): great for "find similar" / "find relevant" open-ended queries. Use for: docs, fuzzy chat recall, case-based reasoning. Fails at: exact facts, list enumeration, temporal queries.
Structured KV / SQL (exact facts): great for "what is the value of X" or "list all Y". Use for: user preferences, entity attributes, structured profile. Fails at: fuzzy queries. MemGPT's core memory is essentially structured KV.
Episodic event log (temporal events): great for "what happened last time", "what did I do in the past two weeks", "why did the last attempt fail". An append-only time series, indexed by (user_id, ts). Use for: reflection, postmortem, decision trace. The reflective memory in Reflexion (Shinn et al. 2023) is this structure.
These three are not mutually exclusive — they are complementary. A mature agent has all three simultaneously: vector for docs + KV for profile + event log for behavior history. Letta (the productized MemGPT) calls this the "memory hierarchy" and explicitly distinguishes core_memory (KV), archival_memory (vector), and recall_memory (event log). Microsoft GraphRAG takes a fourth path (knowledge graph), suitable for entity-relation-dense domains, but with high overhead — most teams should nail the first three first.
Hands-on Example
Implement a personal agent that "remembers you" with all three stores; the key is write routing — given an incoming utterance, decide which store it goes to:
# memory_router.py — one utterance in, routed to the right store
ROUTE_PROMPT = """Classify this user utterance into ONE memory type:
- FACT : a stable fact about the user (name, role, preference)
- EVENT : an action/decision/experience that happened
- DOCUMENT : reference content (article, code, doc)
- NONE : transient (greeting, ack, clarification)
Return JSON: {"type": "...", "extract": "..."}"""def ingest(utterance, user_id):
r = llm(ROUTE_PROMPT, utterance)
if r["type"] == "FACT":
profile_kv.upsert(user_id, r["extract"]) # structured KVelif r["type"] == "EVENT":
event_log.append(user_id, ts=now(), text=r["extract"])
elif r["type"] == "DOCUMENT":
vector_store.add(embed(r["extract"]), meta={"user":user_id})
# NONE → drop# —— on read, route by query type ——def recall(query, user_id):
intent = classify(query) # fact / temporal / semanticif intent == "fact":
return profile_kv.get_all(user_id) # full profileif intent == "temporal":
return event_log.range(user_id, last="7d") # by timereturn vector_store.search(query, k=5, filter={"user":user_id})
The essence of this router — the question chooses the store. "What's my name?" → KV. "What did we talk about last week?" → event log. "Find a passage I wrote about meditation" → vector. One store for everything is a beginner's illusion.
Failure modes: (1) Vector store only — queries like "list all my preferences" cosine-fail every time. (2) Event log without retention — six months in, retrieving 100k events blows the prompt; need decay / summarize. (3) KV without conflict resolution — user said "I'm vegetarian" half a year ago and "I eat meat now" recently, both present, model gets confused; on write, overwrite + keep timestamp. (4) The three stores not sharing a user_id schema — retrieval can't join, equivalent to not storing.
Self-Maintained User Profile: Let the LLM Edit Its Own Notes "About You"
Claim: Treat the user profile as an LLM-maintained living document, not a database table — give the model read + write tools and let it decide what to remember, what to update, what to forget. This is the core mechanism behind "the agent gets to know you better".
Background & Principles
Section 3's structured KV solves "what to store and how to read"; one engineering question remains — who decides what to write. Three options:
User explicitly declares: have the user fill a form or write a system prompt. Low coverage, annoying.
Rule extraction: regex / rules to spot "my name is X" / "I like Y". Brittle, low recall.
LLM self-maintained: give the LLM an update_profile tool; after each turn it decides whether to update. MemGPT / Letta / ChatGPT memory / Claude memory all take this path.
The hard part of self-maintained isn't "let it write" — it's "let it write the right thing". The profile is fully injected into the next session's system prompt; garbage in it poisons forever. Four problems to solve:
Granularity: "user mentioned they went jogging" — write or not? Too fine → profile explodes; too coarse → miss the key. Heuristic: only write facts that remain true across sessions.
Conflict resolution: when new contradicts old ("I'm vegetarian" → "I eat meat now"), overwrite or append? Default: overwrite + keep a changelog.
Forgetting: profile cannot grow unbounded. LRU / explicit TTL / user-initiated delete. Letta uses "archival" to move inactive facts into the vector store.
Hallucination filter: the LLM may infer facts the user never stated ("user mentioned their kid's homework → assume they have a school-aged child"). Every self-maintained write must cite a specific message ID as evidence; no evidence, no write.
Hands-on Example
Give the agent a profile tool; after each turn run a "does this need updating" check:
# profile_tools.py — self-maintained profile tool for the LLM
PROFILE_UPDATE_PROMPT = """Review the latest user message and decide if the user
profile should be updated. Only write facts that:
1. Are explicitly stated by the user (cite message)
2. Are likely to be true beyond this session
3. Are not already in the profile (or contradict it)
Current profile: {profile_json}
Latest message: {message}
Output JSON:
{
"action": "add" | "update" | "delete" | "none",
"key": "...",
"value": "...",
"evidence_msg_id": "...", // required; no evidence → "none"
"reason": "..."
}"""def maybe_update_profile(user_id, latest_msg, msg_id):
current = profile_kv.get_all(user_id)
decision = llm(PROFILE_UPDATE_PROMPT.format(
profile_json=json.dumps(current), message=latest_msg))
if decision["action"] == "none": returnifnot decision.get("evidence_msg_id"): return# no evidence → reject
profile_kv.apply(user_id, decision)
changelog.append(user_id, decision, ts=now()) # traceable# —— inject profile at session start ——def build_system_prompt(user_id):
p = profile_kv.get_all(user_id)
return f"""You are a personal assistant for the following user.
Stable facts about them (use to personalize, but verify before acting):
{json.dumps(p, indent=2, ensure_ascii=False)}
Important: if a fact seems outdated, ASK the user to confirm rather than
silently override the profile."""
Three details decide quality: mandatory evidence_msg_id (blocks hallucination), changelog (traceable — when the user asks "how do you know I'm vegetarian" you can answer), and a system prompt that pushes the model to verify (so it doesn't act overconfidently on stale profile). All three together turn the profile from toy into engineered artifact.
Failure modes: (1) Free LLM writes with no audit — six months later 30% of the profile is model hallucination. (2) No forgetting — profile grows to 5000 tokens and eats half your context budget on every call. (3) Treating profile as ground truth and acting on it — preferences changed, the agent still recommends from the stale profile, UX collapses; any high-stakes decision must confirm. (4) One global profile shared across agents — scope / privacy chaos; profiles should be scoped per agent or per use-case.
Going deeper · Letta tutorial Building stateful agents, docs.letta.com ·
Simon Willison How ChatGPT memory works, simonwillison.net/2024/Apr/16 ·
Park et al. Generative Agents: Interactive Simulacra of Human Behavior (reflection / memory stream design), arxiv.org/abs/2304.03442
// Putting it together · Sketch your agent's state topology (30 min)
Pick an agent you're building or use a lot (personal research bot / coding agent / support / writing assistant) and walk through these 6 steps:
List the four layers (§1, 5 min): write down what each of Conversation / Scratchpad / Profile / Knowledge holds, and the lifecycle. Which layer is empty? An empty layer is a latent bug.
Pick a short-term strategy (§2, 5 min): what's your session length distribution? If P95 > 30 turns, upgrade from truncate to summarize; if you need exact recall, upgrade to hierarchical.
Evaluate long-term needs (§3, 10 min): list the 10 most common user query types and label each with vector / KV / event log. If 8 are vector, you probably don't need the other two stores yet — otherwise you're over-engineering.
Design the write router (§3-4, 5 min): when a user utterance comes in, who decides where it goes? Rules / LLM router / user-explicit? Does write carry an evidence field?
Set the forgetting policy (§4, 3 min): profile token ceiling? what happens when exceeded (LRU / archival / user review)? event log decay window?
Draw one diagram (2 min): on a whiteboard / draw.io, sketch the four layers and the routing arrows. If you can't draw it, your design isn't done.
In 30 minutes you should have: a state topology diagram, read/write policy per layer, write-routing logic, and a forgetting policy. This is the document that upgrades an agent from demo to "gets to know you" product. Next time a teammate asks "how does our agent remember things", you hand them this diagram — no more mumbling "we stuff it into Chroma".
// Deep Thinking
Context windows are already 1M+ tokens — do we still need long-term memory architecture? Why not stuff all history?
No, for three reasons. (1) Cost: 1M tokens is $3+ per call, and without prefix-cache hits latency can be 30s+. (2) Lost-in-the-middle (Liu et al. 2023): middle-of-context recall drops below 50%; stuffing it in doesn't mean the model can use it. (3) Signal-to-noise: history is ~90% noise; full injection dilutes instructions and splits attention. A 1M context raises the ceiling but doesn't remove "what gets into context" — context engineering becomes more important, not less.
MemGPT's "self-edit memory" — LLM does its own page-in/page-out — looks elegant. Why hasn't industry adopted it at scale?
Three engineering realities: (1) Low predictability — the LLM decides when and what to page in; reproducing bugs in debug is hard. (2) Extra LLM-call cost — every decision step adds an inference, doubling latency and tokens. (3) Unfriendly to small models — self-edit requires strong meta-cognition; only GPT-4 / Claude Opus class is stable, smaller models often page out the key fact. Most products (ChatGPT memory / Claude memory) use the more boring path: fixed triggers (extract every turn) + simple KV — controllability wins.
Is user profile "the more the better" or is there an optimal size? How would you measure it?
There's an optimal range, typically 200–1500 tokens. Evidence: (1) too short → weak personalization, agent generic; (2) too long → dilutes instructions, splits attention, model "follows profile item #7 but ignores current user instruction". To measure: build a holdout task eval, x-axis = profile length, y-axis = task pass rate; you'll see an inverted U. Letta caps core memory around 2000 tokens and uses archival/vector as fallback — based on this kind of empirical signal.
What about an episodic event log that accumulates 100k+ events — retain everything or apply a forgetting curve?
Must forget, but via "tiered decay" rather than a blunt TTL. Three tiers: (1) last 7 days: raw retain; (2) 7–90 days: weekly summary, archive originals; (3) 90+ days: monthly summary into vector store. This maps cleanly to human working / short-term / long-term memory. Significant "life events" (major decisions, user-flagged important) live on a separate timeline kept forever. The reflection mechanism in the Generative Agents paper (Park et al. 2023) is the academic prototype — periodically distill episodic events into higher-level reflections.
For one user across multiple agents (work assistant / writing assistant / investment research), should profiles be shared or scoped?
Default scoped, share on request. The temptation: "write profile once, every agent knows me". The reality: scope confusion (the work agent shouldn't use your health data to decide), privacy leak, and profile cross-pollution (an investment agent mis-infers and writes to the main profile). Correct architecture: (1) one minimal global identity (name, language pref, timezone); (2) each agent has a scoped local profile; (3) cross-agent sharing requires explicit user approval. This boundary design matters more than tooling — it's the politics of personal AI infra, not just engineering.