DAY 07 / PHASE 1 · ENGINEERING

Memory & State Management

Four-Layer State · Short-term Compression · 3 Long-term Architectures · Self-Maintained Profile

2026-05-25 · BigCat

An LLM is a stateless pure function; every "memory" is state you forced into a token budget. The gap between people who do this well and badly is an order of magnitude.

Foundation concepts → ai-ml-daily Day 1: LLM Basics (KV Cache)

// WHY THIS MATTERS

Ask 100 AI builders "how do you manage memory" and 90 will say "I stuff everything in a vector DB and retrieve." That's exactly why most agents start hallucinating after three turns — treating every kind of state as one kind of thing, handled by one tool (embedding + RAG). The senior mental model: context window is a scarce resource, state has at least four distinct kinds, and each has completely different write / read / expiry semantics. This week is not about tuning Chroma parameters (that's 101); it's about four things: how to split tangled state into Conversation / Scratchpad / Profile / Knowledge and manage each separately; why short-term memory should pick from Truncate > Summarize > Hierarchical by scenario; why vector retrieval only solves about a third of long-term memory and you need Structured KV + Episodic Event Log for the other two-thirds; and how the MemGPT / Letta lineage achieves "gets to know you better over time" via self-maintained profile. By the end you should be able to sketch your agent's state topology on a whiteboard and know which line goes where.

// 01

Split State Into Four Layers — Mixing Them Is the Accident

Claim: 90% of agent memory bugs come from one mistake — putting conversation history, intermediate results, user profile, and domain knowledge into one shared context. Their lifecycles, trustworthiness, and access patterns are nothing alike.

Background & Principles

Chapter 1 of any OS textbook tells you memory is tiered: registers / cache / RAM / disk. An LLM's "memory" is also tiered — most people just don't realize it. Anthropic's Building Effective Agents repeats one line: "context is a budget" — you must actively decide what each token is spent on. The four classes below should be tracked separately:

Cramming all four into one mechanism (stuff a vector DB) hits three failures simultaneously: (1) embedding-based retrieval over conversation loses temporal ordering — "the previous message" may not even rank; (2) user profile drowns in 1M chunks, cosine similarity can't pick the key fact; (3) scratchpad persists forever and pollutes the next task. The right answer is independent storage + independent read/write policy per layer.

Four-layer agent state (lifecycle short → long) ┌─────────────────────────────────────────────────────────┐ │ Knowledge Base · forever · query-aware retrieval RAG │ ├─────────────────────────────────────────────────────────┤ │ User Profile · cross-session · LLM-extract · full │ ├─────────────────────────────────────────────────────────┤ │ Conversation · session · window+summary · in-order │ ├─────────────────────────────────────────────────────────┤ │ Scratchpad · task · burn after use · no carryover│ └─────────────────────────────────────────────────────────┘ per-layer read/write policy; mixing breaks things

Hands-on Example

Model the four layers explicitly as Python classes; force prompt assembly through the layered interface:

# memory_layers.py — force the split; ban "one dict for everything"
from dataclasses import dataclass, field
from typing import Protocol

class MemoryLayer(Protocol):
    def read(self, query: str) -> str: ...
    def write(self, content: str, meta: dict) -> None: ...

@dataclass
class Conversation:                # short-term, append-only
    messages: list = field(default_factory=list)
    def read(self, _): return self.messages[-20:]   # sliding window default

@dataclass
class Scratchpad:                  # per-task scope
    by_task: dict = field(default_factory=dict)
    def clear(self, task_id): self.by_task.pop(task_id, None)

@dataclass
class Profile:                     # cross-session stable facts
    facts: dict = field(default_factory=dict)  # {"name":"BigCat", ...}
    def read(self, _): return self.facts            # full

@dataclass
class Knowledge:                   # docs/code, RAG entry
    index: object   # your vector store
    def read(self, query): return self.index.search(query, k=5)

# —— assemble through the four layers ——
def build_context(query, conv, scratch, profile, kb):
    return {
        "system": f"User profile:\n{profile.read(None)}",
        "messages": conv.read(None),
        "working": scratch.by_task.get(current_task_id, []),
        "retrieved": kb.read(query),         # only what matches query
    }

The point isn't the code — it's that the code forces you to answer four questions: which profile fields go fully into system? when does scratch clear? how large is the conv window? what's k for kb retrieval? An agent whose author can't answer all four will ship a memory incident.

Failure modes: (1) Retrieving user profile via embeddings — factual queries like "what's my name" often miss; profile must be fully injected. (2) Scratchpad persists across tasks — last debug's stack trace pollutes today's chat. (3) Sending conversation straight to a vector DB — temporal order is lost, "the previous message" is unretrievable. (4) All four sharing one store, so a rebuild wipes profile too.
Going deeper · Anthropic Building Effective Agents, anthropic.com/research/building-effective-agents · Lilian Weng LLM Powered Autonomous Agents (Memory section), lilianweng.github.io/posts/2023-06-23-agent
// 02

Short-Term Memory in Three Tiers: When to Truncate / Summarize / Go Hierarchical

Claim: When the conversation gets long, your first reflex should NOT be "add summarization" — it should be "can I just truncate?" Summarization is lossy; using it blindly grinds important detail to dust.

Background & Principles

Context overflowed — what now? Three tiers, different complexity and loss:

The selection rule is simple — "will I ever need to look up specific old details?" Support: usually no (truncate). Long-form writing: yes, but inexact (summarize). Investment research / code review: yes, with exact quotes (hierarchical).

Three short-term tiers (loss vs complexity) simple ←──────────────────────────────→ complex ┌──────────┐ ┌──────────┐ ┌──────────────┐ loss│ Truncate │ → │Summarize │ → │ Hierarchical │ low │ window │ │ compact │ │ chunk+search │ ↑ │ free │ │ +1 LLM │ │ +vector DB │ │ no recall│ │ gist kept│ │ exact recall │ └──────────┘ └──────────┘ └──────────────┘ support / QA plan / write research / code start at the left; move right only if old detail is needed

Hands-on Example

All three tiers can live in one manager that auto-upgrades by message count:

# conversation_manager.py — three tiers, auto switch
class ConversationManager:
    def __init__(self, mode="truncate", window=20, summary_threshold=40):
        self.messages, self.summary = [], None
        self.mode, self.window, self.threshold = mode, window, summary_threshold

    def add(self, msg): self.messages.append(msg)

    def build(self):
        if self.mode == "truncate":
            return self.messages[-self.window:]

        if self.mode == "summarize":
            if len(self.messages) > self.threshold:
                old = self.messages[:-self.window]
                self.summary = compact(self.summary, old)  # LLM incremental
                self.messages = self.messages[-self.window:]
            head = [{"role":"system","content":f"Prior summary:\n{self.summary}"}] \
                   if self.summary else []
            return head + self.messages

        if self.mode == "hierarchical":
            recent = self.messages[-self.window:]
            old_chunks = self.vector.search(latest_user_msg(recent), k=5)
            return [recall_block(old_chunks)] + recent

# —— compaction prompt (don't lose the important details) ——
COMPACT = """Summarize the prior conversation. PRESERVE:
- All factual decisions made
- All open questions / TODOs
- Any explicit user preferences mentioned
DISCARD: greetings, clarifications already resolved, model's apologies.
Existing summary: {old_summary}
New messages: {new_msgs}
Output (≤300 tokens):"""

The PRESERVE / DISCARD list in the compaction prompt is what determines summary quality. Anthropic's docs explicitly call out "preserve decisions and open TODOs" — exactly the two things research and coding sessions lose most often.

Failure modes: (1) Summarizing too early — within 10 turns it just wastes tokens and adds loss; truncate first. (2) Re-summarizing the entire history every turn — cost explodes; do it incrementally (old summary + new messages → new summary). (3) Hierarchical retrieval losing temporal order — chunks come back without timestamp sorting, the LLM treats yesterday's decision as today's; always carry ts in chunk metadata. (4) Compaction prompt with no PRESERVE list — the summarizer decides what to drop, and it often drops user preferences ("short and unimportant" but actually critical).
Going deeper · Packer et al. MemGPT: Towards LLMs as Operating Systems, arxiv.org/abs/2310.08560 · Anthropic Long context prompting tips, docs.anthropic.com/.../long-context-tips
// 03

Three Long-Term Memory Architectures: Vector / Structured KV / Episodic Event Log

Claim: Vector retrieval can only answer "find something similar to X". It cannot answer "what did I decide last time" or "what are my preferences". Long-term memory needs three stores split by query type.

Background & Principles

Treating a vector DB as universal long-term memory was a 2023 myth. The senior pattern is splitting by query mode into three stores, each doing its own job:

These three are not mutually exclusive — they are complementary. A mature agent has all three simultaneously: vector for docs + KV for profile + event log for behavior history. Letta (the productized MemGPT) calls this the "memory hierarchy" and explicitly distinguishes core_memory (KV), archival_memory (vector), and recall_memory (event log). Microsoft GraphRAG takes a fourth path (knowledge graph), suitable for entity-relation-dense domains, but with high overhead — most teams should nail the first three first.

Hands-on Example

Implement a personal agent that "remembers you" with all three stores; the key is write routing — given an incoming utterance, decide which store it goes to:

# memory_router.py — one utterance in, routed to the right store
ROUTE_PROMPT = """Classify this user utterance into ONE memory type:
- FACT      : a stable fact about the user (name, role, preference)
- EVENT     : an action/decision/experience that happened
- DOCUMENT  : reference content (article, code, doc)
- NONE      : transient (greeting, ack, clarification)
Return JSON: {"type": "...", "extract": "..."}"""

def ingest(utterance, user_id):
    r = llm(ROUTE_PROMPT, utterance)
    if r["type"] == "FACT":
        profile_kv.upsert(user_id, r["extract"])      # structured KV
    elif r["type"] == "EVENT":
        event_log.append(user_id, ts=now(), text=r["extract"])
    elif r["type"] == "DOCUMENT":
        vector_store.add(embed(r["extract"]), meta={"user":user_id})
    # NONE → drop

# —— on read, route by query type ——
def recall(query, user_id):
    intent = classify(query)  # fact / temporal / semantic
    if intent == "fact":
        return profile_kv.get_all(user_id)            # full profile
    if intent == "temporal":
        return event_log.range(user_id, last="7d")    # by time
    return vector_store.search(query, k=5, filter={"user":user_id})

The essence of this router — the question chooses the store. "What's my name?" → KV. "What did we talk about last week?" → event log. "Find a passage I wrote about meditation" → vector. One store for everything is a beginner's illusion.

Failure modes: (1) Vector store only — queries like "list all my preferences" cosine-fail every time. (2) Event log without retention — six months in, retrieving 100k events blows the prompt; need decay / summarize. (3) KV without conflict resolution — user said "I'm vegetarian" half a year ago and "I eat meat now" recently, both present, model gets confused; on write, overwrite + keep timestamp. (4) The three stores not sharing a user_id schema — retrieval can't join, equivalent to not storing.
Going deeper · Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning, arxiv.org/abs/2303.11366 · Letta (MemGPT) docs Memory hierarchy, docs.letta.com/concepts/memory · Edge et al. From Local to Global: GraphRAG, arxiv.org/abs/2404.16130
// 04

Self-Maintained User Profile: Let the LLM Edit Its Own Notes "About You"

Claim: Treat the user profile as an LLM-maintained living document, not a database table — give the model read + write tools and let it decide what to remember, what to update, what to forget. This is the core mechanism behind "the agent gets to know you better".

Background & Principles

Section 3's structured KV solves "what to store and how to read"; one engineering question remains — who decides what to write. Three options:

The hard part of self-maintained isn't "let it write" — it's "let it write the right thing". The profile is fully injected into the next session's system prompt; garbage in it poisons forever. Four problems to solve:

  1. Granularity: "user mentioned they went jogging" — write or not? Too fine → profile explodes; too coarse → miss the key. Heuristic: only write facts that remain true across sessions.
  2. Conflict resolution: when new contradicts old ("I'm vegetarian" → "I eat meat now"), overwrite or append? Default: overwrite + keep a changelog.
  3. Forgetting: profile cannot grow unbounded. LRU / explicit TTL / user-initiated delete. Letta uses "archival" to move inactive facts into the vector store.
  4. Hallucination filter: the LLM may infer facts the user never stated ("user mentioned their kid's homework → assume they have a school-aged child"). Every self-maintained write must cite a specific message ID as evidence; no evidence, no write.

Hands-on Example

Give the agent a profile tool; after each turn run a "does this need updating" check:

# profile_tools.py — self-maintained profile tool for the LLM
PROFILE_UPDATE_PROMPT = """Review the latest user message and decide if the user
profile should be updated. Only write facts that:
1. Are explicitly stated by the user (cite message)
2. Are likely to be true beyond this session
3. Are not already in the profile (or contradict it)

Current profile: {profile_json}
Latest message:  {message}

Output JSON:
{
  "action": "add" | "update" | "delete" | "none",
  "key": "...",
  "value": "...",
  "evidence_msg_id": "...",   // required; no evidence → "none"
  "reason": "..."
}"""

def maybe_update_profile(user_id, latest_msg, msg_id):
    current = profile_kv.get_all(user_id)
    decision = llm(PROFILE_UPDATE_PROMPT.format(
        profile_json=json.dumps(current), message=latest_msg))
    if decision["action"] == "none": return
    if not decision.get("evidence_msg_id"): return  # no evidence → reject
    profile_kv.apply(user_id, decision)
    changelog.append(user_id, decision, ts=now())   # traceable

# —— inject profile at session start ——
def build_system_prompt(user_id):
    p = profile_kv.get_all(user_id)
    return f"""You are a personal assistant for the following user.
Stable facts about them (use to personalize, but verify before acting):
{json.dumps(p, indent=2, ensure_ascii=False)}

Important: if a fact seems outdated, ASK the user to confirm rather than
silently override the profile."""

Three details decide quality: mandatory evidence_msg_id (blocks hallucination), changelog (traceable — when the user asks "how do you know I'm vegetarian" you can answer), and a system prompt that pushes the model to verify (so it doesn't act overconfidently on stale profile). All three together turn the profile from toy into engineered artifact.

Failure modes: (1) Free LLM writes with no audit — six months later 30% of the profile is model hallucination. (2) No forgetting — profile grows to 5000 tokens and eats half your context budget on every call. (3) Treating profile as ground truth and acting on it — preferences changed, the agent still recommends from the stale profile, UX collapses; any high-stakes decision must confirm. (4) One global profile shared across agents — scope / privacy chaos; profiles should be scoped per agent or per use-case.
Going deeper · Letta tutorial Building stateful agents, docs.letta.com · Simon Willison How ChatGPT memory works, simonwillison.net/2024/Apr/16 · Park et al. Generative Agents: Interactive Simulacra of Human Behavior (reflection / memory stream design), arxiv.org/abs/2304.03442

// Putting it together · Sketch your agent's state topology (30 min)

Pick an agent you're building or use a lot (personal research bot / coding agent / support / writing assistant) and walk through these 6 steps:

  1. List the four layers (§1, 5 min): write down what each of Conversation / Scratchpad / Profile / Knowledge holds, and the lifecycle. Which layer is empty? An empty layer is a latent bug.
  2. Pick a short-term strategy (§2, 5 min): what's your session length distribution? If P95 > 30 turns, upgrade from truncate to summarize; if you need exact recall, upgrade to hierarchical.
  3. Evaluate long-term needs (§3, 10 min): list the 10 most common user query types and label each with vector / KV / event log. If 8 are vector, you probably don't need the other two stores yet — otherwise you're over-engineering.
  4. Design the write router (§3-4, 5 min): when a user utterance comes in, who decides where it goes? Rules / LLM router / user-explicit? Does write carry an evidence field?
  5. Set the forgetting policy (§4, 3 min): profile token ceiling? what happens when exceeded (LRU / archival / user review)? event log decay window?
  6. Draw one diagram (2 min): on a whiteboard / draw.io, sketch the four layers and the routing arrows. If you can't draw it, your design isn't done.

In 30 minutes you should have: a state topology diagram, read/write policy per layer, write-routing logic, and a forgetting policy. This is the document that upgrades an agent from demo to "gets to know you" product. Next time a teammate asks "how does our agent remember things", you hand them this diagram — no more mumbling "we stuff it into Chroma".

// Deep Thinking

Context windows are already 1M+ tokens — do we still need long-term memory architecture? Why not stuff all history?
No, for three reasons. (1) Cost: 1M tokens is $3+ per call, and without prefix-cache hits latency can be 30s+. (2) Lost-in-the-middle (Liu et al. 2023): middle-of-context recall drops below 50%; stuffing it in doesn't mean the model can use it. (3) Signal-to-noise: history is ~90% noise; full injection dilutes instructions and splits attention. A 1M context raises the ceiling but doesn't remove "what gets into context" — context engineering becomes more important, not less.
MemGPT's "self-edit memory" — LLM does its own page-in/page-out — looks elegant. Why hasn't industry adopted it at scale?
Three engineering realities: (1) Low predictability — the LLM decides when and what to page in; reproducing bugs in debug is hard. (2) Extra LLM-call cost — every decision step adds an inference, doubling latency and tokens. (3) Unfriendly to small models — self-edit requires strong meta-cognition; only GPT-4 / Claude Opus class is stable, smaller models often page out the key fact. Most products (ChatGPT memory / Claude memory) use the more boring path: fixed triggers (extract every turn) + simple KV — controllability wins.
Is user profile "the more the better" or is there an optimal size? How would you measure it?
There's an optimal range, typically 200–1500 tokens. Evidence: (1) too short → weak personalization, agent generic; (2) too long → dilutes instructions, splits attention, model "follows profile item #7 but ignores current user instruction". To measure: build a holdout task eval, x-axis = profile length, y-axis = task pass rate; you'll see an inverted U. Letta caps core memory around 2000 tokens and uses archival/vector as fallback — based on this kind of empirical signal.
What about an episodic event log that accumulates 100k+ events — retain everything or apply a forgetting curve?
Must forget, but via "tiered decay" rather than a blunt TTL. Three tiers: (1) last 7 days: raw retain; (2) 7–90 days: weekly summary, archive originals; (3) 90+ days: monthly summary into vector store. This maps cleanly to human working / short-term / long-term memory. Significant "life events" (major decisions, user-flagged important) live on a separate timeline kept forever. The reflection mechanism in the Generative Agents paper (Park et al. 2023) is the academic prototype — periodically distill episodic events into higher-level reflections.
For one user across multiple agents (work assistant / writing assistant / investment research), should profiles be shared or scoped?
Default scoped, share on request. The temptation: "write profile once, every agent knows me". The reality: scope confusion (the work agent shouldn't use your health data to decide), privacy leak, and profile cross-pollution (an investment agent mis-infers and writes to the main profile). Correct architecture: (1) one minimal global identity (name, language pref, timezone); (2) each agent has a scoped local profile; (3) cross-agent sharing requires explicit user approval. This boundary design matters more than tooling — it's the politics of personal AI infra, not just engineering.

// Further Reading