The context window is the model's working set — like a process's virtual memory: you imagine it holds anything, but it has a hard ceiling and performance degrades the closer you get to it. Backend analogues: DB buffer pool, SQL work_mem, CPU L1/L2 cache — sizing is always a trade-off, never "bigger is better".
Transformer attention is O(n²): every token computes a similarity against every other token in the window. A 1M-token window means ~10¹² similarity ops, and KV cache memory grows linearly (each token's Key/Value matrices must be stored). Window size is a hardware cost problem, not a "make it big" knob.
2026 state-of-the-art: Claude 4.7 / GPT-5 around 200K–1M, Gemini 2.5 around 2M, most open-source 32K–128K. But the counter-intuitive core finding from Liu et al. 2023 — "Lost in the Middle" — is this: when a key fact sits at the start or end of a 100K context, recall is ~80%; in the middle, it drops to ~50%. Attention is U-shaped:
Even 2025 models that advertise "100% needle-in-haystack recall" still show clear position bias on multi-needle reasoning tasks. Treating the context window as "unlimited attention" is the canonical misconception.
from anthropic import Anthropic client = Anthropic() # needs ANTHROPIC_API_KEY # Always measure tokens via the SDK before deciding to stuff something in long_doc = open("report.txt").read() resp = client.messages.create( model="claude-opus-4-7", max_tokens=1024, messages=[{ "role": "user", # Put the critical question LAST — ride the U-curve's recency end "content": f"<document>{long_doc}</document>\n\nBased on the above: what is the key conclusion?" }] ) print(resp.usage.input_tokens, "input tokens") # track usage
Treat a prompt like an HTTP request: the static system prompt + tool schemas + few-shot examples are your "static assets". Re-running attention on them every call is the equivalent of not using a CDN. Context caching persists that prefix's KV cache into GPU memory or fast SSD so the next hit skips prefill — same family as Redis cache, HTTP 304, Postgres prepared statements.
In a typical agent app, ~80% of the prompt is invariant (persona + tool descriptions + chat history); only the latest user turn changes. But the LLM re-runs prefill from scratch every call — for a 50K-token fixed prefix, that's 50K wasted attention ops. Caching mechanics:
from anthropic import Anthropic client = Anthropic() # cache_control marks "this block and everything before it" as a cacheable prefix resp = client.messages.create( model="claude-opus-4-7", max_tokens=512, system=[ {"type": "text", "text": "You are a senior financial analyst..."}, {"type": "text", "text": long_company_filing, # 50K-token filing, queried repeatedly "cache_control": {"type": "ephemeral"}} # ← cache breakpoint ], messages=[{"role": "user", "content": "Which segment drove revenue growth?"}] ) print(resp.usage.cache_creation_input_tokens) # first call: write print(resp.usage.cache_read_input_tokens) # later: hit (90% cheaper)
Long-conversation context management = database log rotation + compaction. MySQL binlog doesn't grow forever — past a threshold, rotate, compress, archive. LLM context is identical: as conversations grow, compression is a hard requirement, not an option.
Past 50K–100K tokens, three problems hit at once: window overflow (hard limit), attention dilution (lost-in-the-middle), linear cost growth per turn (all history recomputed). Four mainstream strategies, ordered by fidelity-vs-cost:
Anthropic's 2025 Memory tool and automatic compaction ship this machinery built-in — context auto-summarizes near the limit without app-level code. But what to compress and at what granularity remains a product decision, not a sane default.
def compress_history(messages, max_tokens=8000): # Simple two-tier: keep last 6 turns verbatim, summarize the rest recent = messages[-6:] older = messages[:-6] if not older or count_tokens(messages) < max_tokens: return messages older_text = "\n".join(f"{m['role']}: {m['content']}" for m in older) summary = client.messages.create( model="claude-haiku-4-5-20251001", # cheap model for summarization max_tokens=400, messages=[{"role": "user", "content": f"Summarize as 5 bullets — key facts, decisions, open issues:\n{older_text}"}] ).content[0].text # Reassemble: summary + recent verbatim (recent details preserved) return [{"role": "system", "content": f"<earlier_conversation_summary>\n{summary}\n</earlier_conversation_summary>"}] + recent
LLMs are stateless — every API call is a cold start. To make one "remember you", you must architect a memory hierarchy like an OS: CPU register (current context) → L1/L2 cache (cached prefix) → RAM (recent sessions) → disk (vector store + notes). "Memory management" is designing that hierarchy.
Stuffing "every conversation from the last three months" into context is impossible (window overflow) and pointless (lost in the middle). Four memory types borrowed from cognitive science, each mapped to a storage strategy:
Mainstream implementations: MemGPT / Letta use an OS-paging metaphor so the LLM decides what to swap in/out; mem0 / LangMem wrap episodic + semantic into SDKs; OpenAI ChatGPT "memory" and Anthropic's Memory tool are productized versions. Common pattern: extract → store → retrieve → inject.
Key engineering call: what is worth remembering? — not "everything said in chat", but use an extraction LLM to judge "will this affect a future answer?" Mem0's 2024 paper shows that active extraction + dedup beats "store everything + retrieve" by ~26% accuracy and 91% latency.
from mem0 import Memory # pip install mem0ai m = Memory() # 1) Write: mem0 uses an LLM to extract "worth-remembering" facts, dedup, store in vector DB m.add("I'm allergic to peanuts; my kid loves blue", user_id="bigcat") m.add("Last week we discussed Mamba — concluded SSMs suit long sequences", user_id="bigcat") # 2) Retrieve: pull memories relevant to the current query (not entire history) hits = m.search("birthday cake ideas for my kid", user_id="bigcat", limit=3) # 3) Inject — only the relevant few ctx = "\n".join(h["memory"] for h in hits["results"]) resp = client.messages.create( model="claude-opus-4-7", max_tokens=512, system=f"<known_about_user>\n{ctx}\n</known_about_user>", messages=[{"role": "user", "content": "Recommend a birthday cake recipe for Saturday"}]) # → model avoids peanuts, knows blue decorations might fit