AI/ML Deep Dive: Context Engineering

Day 8 · 2026-05-25

For: experienced engineers outside the AI/ML field

Engineering counterpart → super-individual D2: Context Engineering (lost-in-the-middle, information layout)

Context Window

hardware limitattention

One-line analogy

The context window is the model's working set — like a process's virtual memory: you imagine it holds anything, but it has a hard ceiling and performance degrades the closer you get to it. Backend analogues: DB buffer pool, SQL work_mem, CPU L1/L2 cache — sizing is always a trade-off, never "bigger is better".

What it solves + how it works

Transformer attention is O(n²): every token computes a similarity against every other token in the window. A 1M-token window means ~10¹² similarity ops, and KV cache memory grows linearly (each token's Key/Value matrices must be stored). Window size is a hardware cost problem, not a "make it big" knob.

2026 state-of-the-art: Claude 4.7 / GPT-5 around 200K–1M, Gemini 2.5 around 2M, most open-source 32K–128K. But the counter-intuitive core finding from Liu et al. 2023 — "Lost in the Middle" — is this: when a key fact sits at the start or end of a 100K context, recall is ~80%; in the middle, it drops to ~50%. Attention is U-shaped:

Key-fact position → model recall
start 80%
1/4 in 60%
middle 48% (trough)
3/4 in 65%
end 82%
↑ U-shape: long context ≠ everything is "seen"

Even 2025 models that advertise "100% needle-in-haystack recall" still show clear position bias on multi-needle reasoning tasks. Treating the context window as "unlimited attention" is the canonical misconception.

Code example

from anthropic import Anthropic
client = Anthropic()  # needs ANTHROPIC_API_KEY

# Always measure tokens via the SDK before deciding to stuff something in
long_doc = open("report.txt").read()
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        # Put the critical question LAST — ride the U-curve's recency end
        "content": f"<document>{long_doc}</document>\n\nBased on the above: what is the key conclusion?"
    }]
)
print(resp.usage.input_tokens, "input tokens")  # track usage

Pitfall + scenario

"With a 1M window I'll dump all company docs and skip RAG" — wrong. A 200K input is both expensive (cost scales linearly in tokens) and slow (TTFT scales with input length), and middle content still gets ignored. Both Anthropic and Google explicitly recommend RAG + short context > raw long context, especially with stable corpora and varied queries.

📌 BigCat scenario: when reading a paper, don't dump the whole book; build an 8–32K focused context of current chapter + abstract + your specific question. Quality far beats stuffing 200K, at 10× lower per-call cost.

Takeaway + reflection

💡 The context window is a working set, not a database. The goal is "put in the right things", not "put in more things".
🤔 When you defaulted to "longer context = smarter model", what cheaper engineering options did you miss?

Context Caching (Prompt Caching)

optimizationcost

One-line analogy

Treat a prompt like an HTTP request: the static system prompt + tool schemas + few-shot examples are your "static assets". Re-running attention on them every call is the equivalent of not using a CDN. Context caching persists that prefix's KV cache into GPU memory or fast SSD so the next hit skips prefill — same family as Redis cache, HTTP 304, Postgres prepared statements.

What it solves + how it works

In a typical agent app, ~80% of the prompt is invariant (persona + tool descriptions + chat history); only the latest user turn changes. But the LLM re-runs prefill from scratch every call — for a 50K-token fixed prefix, that's 50K wasted attention ops. Caching mechanics:

Hit condition: the prefix must match token-for-token. Change a punctuation mark and it invalidates — as brittle as a Redis key hash;
Order is critical: invariant content goes first, dynamic content last. Inserting the user query in the middle of the system prompt = cache never hits;
Pricing shape: Anthropic prompt caching costs ~25% more on write, ~90% less on read. ROI breaks even at ~2 hits — write once, read many is the play;
TTL: Anthropic defaults to 5 min, 1-hour TTL is purchasable; OpenAI auto-caches for ~5–10 min, free but uncontrollable.

Prompt layout (order matters)
① system prompt+② tools schema+③ few-shot+④ chat history+⑤ user query
└──────── cache hit (KV reused, prefill skipped) ────────┘└── recomputed each call ──┘

Anti-pattern: user query→system prompt = cache never hits

Code example

from anthropic import Anthropic
client = Anthropic()

# cache_control marks "this block and everything before it" as a cacheable prefix
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    system=[
        {"type": "text", "text": "You are a senior financial analyst..."},
        {"type": "text",
         "text": long_company_filing,  # 50K-token filing, queried repeatedly
         "cache_control": {"type": "ephemeral"}}  # ← cache breakpoint
    ],
    messages=[{"role": "user", "content": "Which segment drove revenue growth?"}]
)
print(resp.usage.cache_creation_input_tokens)  # first call: write
print(resp.usage.cache_read_input_tokens)      # later: hit (90% cheaper)

Pitfall + scenario

"I'll also cache_control the chat history to save more" — wrong. Chat history changes every turn; hit rate is ~zero and you pay the +25% write premium every time. Rule: only cache truly invariant prefixes. Test: can you predict the next request will contain this content byte-identical? If not, don't cache.

📌 BigCat scenario: cache a ~20K system prompt holding your child's learning profile + past mistakes + preferred frameworks. Each "explain today's math problem" then only pays for the user query. The monthly token bill drops visibly.

Takeaway + reflection

💡 Caching is the simplest, most underused lever in prompt engineering — zero IQ investment, 50–90% cost reduction.
🤔 Across your AI toolchain, which prompts are "near-static + frequently reused"? Are they cached right now?

Context Compression

data flowsummarization

One-line analogy

Long-conversation context management = database log rotation + compaction. MySQL binlog doesn't grow forever — past a threshold, rotate, compress, archive. LLM context is identical: as conversations grow, compression is a hard requirement, not an option.

What it solves + how it works

Past 50K–100K tokens, three problems hit at once: window overflow (hard limit), attention dilution (lost-in-the-middle), linear cost growth per turn (all history recomputed). Four mainstream strategies, ordered by fidelity-vs-cost:

① Sliding Window — keep the last N turns. Cheapest but loses the most; fits short-horizon tasks with no long-term memory needs;
② Summarization — LLM-compress old messages into 200–500 token summaries, replace original text. The default choice; beware "summary of summaries" drift over many cycles;
③ Hierarchical — tiered summaries: last 5 turns full text + turns 5–20 grouped summary + 20+ one-line summary. LSM-tree shaped;
④ Selective / RAG-style — store conversation history in an external vector DB, retrieve relevant slices per turn. Most expensive, highest fidelity.

Anthropic's 2025 Memory tool and automatic compaction ship this machinery built-in — context auto-summarizes near the limit without app-level code. But what to compress and at what granularity remains a product decision, not a sane default.

Code example

def compress_history(messages, max_tokens=8000):
    # Simple two-tier: keep last 6 turns verbatim, summarize the rest
    recent = messages[-6:]
    older  = messages[:-6]
    if not older or count_tokens(messages) < max_tokens:
        return messages

    older_text = "\n".join(f"{m['role']}: {m['content']}" for m in older)
    summary = client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheap model for summarization
        max_tokens=400,
        messages=[{"role": "user",
                   "content": f"Summarize as 5 bullets — key facts, decisions, open issues:\n{older_text}"}]
    ).content[0].text

    # Reassemble: summary + recent verbatim (recent details preserved)
    return [{"role": "system",
             "content": f"<earlier_conversation_summary>\n{summary}\n</earlier_conversation_summary>"}] + recent

Pitfall + scenario

"Compress everything, saving tokens always wins" — wrong. Crushing explicit user instructions ("never use library X"), critical numbers, or confirmed decisions into a 5-bullet summary almost always drops them. Rules: (1) keep user's literal preferences verbatim; (2) never compress numbers/code/IDs; (3) "discussed but rejected alternatives" are safe to compress. Compression is a lossy codec — know what you're willing to lose.

📌 BigCat scenario: a "long-term project notebook" with Claude — each week have the model summarize last week's chat into a 200-word "decisions + open questions + verbatim user preferences" block, archive raw messages to markdown. After a year the main window stays light, but the full history is traceable.

Takeaway + reflection

💡 Compression isn't a cost trick, it's a lossy-encoding decision — you must explicitly declare what to keep and what to drop.
🤔 In your long AI conversations, which info must be preserved verbatim, which can become a one-line summary? That list IS your "personal AI protocol".

Memory Management

architecturestate

One-line analogy

LLMs are stateless — every API call is a cold start. To make one "remember you", you must architect a memory hierarchy like an OS: CPU register (current context) → L1/L2 cache (cached prefix) → RAM (recent sessions) → disk (vector store + notes). "Memory management" is designing that hierarchy.

What it solves + how it works

Stuffing "every conversation from the last three months" into context is impossible (window overflow) and pointless (lost in the middle). Four memory types borrowed from cognitive science, each mapped to a storage strategy:

LLM memory hierarchy (with OS analogy)

① Working Memory ≈ CPU register — current context window content
② Procedural ≈ ROM / firmware — long-lived system prompts, workflows (cache-friendly)
③ Episodic ≈ RAM / SSD — past conversation history, time-indexed, retrieved on demand
④ Semantic ≈ knowledge base — facts distilled from conversation, stored in a vector DB ("user dislikes emoji", "project due 6/30")

Read/write path: each query → retrieve relevant slices of ③④ → inject into ① → model reasons → extract new facts → write back to ④

Mainstream implementations: MemGPT / Letta use an OS-paging metaphor so the LLM decides what to swap in/out; mem0 / LangMem wrap episodic + semantic into SDKs; OpenAI ChatGPT "memory" and Anthropic's Memory tool are productized versions. Common pattern: extract → store → retrieve → inject.

Key engineering call: what is worth remembering? — not "everything said in chat", but use an extraction LLM to judge "will this affect a future answer?" Mem0's 2024 paper shows that active extraction + dedup beats "store everything + retrieve" by ~26% accuracy and 91% latency.

Code example

from mem0 import Memory  # pip install mem0ai
m = Memory()

# 1) Write: mem0 uses an LLM to extract "worth-remembering" facts, dedup, store in vector DB
m.add("I'm allergic to peanuts; my kid loves blue", user_id="bigcat")
m.add("Last week we discussed Mamba — concluded SSMs suit long sequences", user_id="bigcat")

# 2) Retrieve: pull memories relevant to the current query (not entire history)
hits = m.search("birthday cake ideas for my kid", user_id="bigcat", limit=3)

# 3) Inject — only the relevant few
ctx = "\n".join(h["memory"] for h in hits["results"])
resp = client.messages.create(
    model="claude-opus-4-7", max_tokens=512,
    system=f"<known_about_user>\n{ctx}\n</known_about_user>",
    messages=[{"role": "user", "content": "Recommend a birthday cake recipe for Saturday"}])
# → model avoids peanuts, knows blue decorations might fit

Pitfall + scenario

"More memory = smarter" — wrong. The dominant failure mode of memory systems is contamination: a wrong fact ("user hates Python" — really one frustrated comment) gets persisted and poisons all future answers. Production rules: (1) assign confidence scores (explicit statement > inference), (2) let users view + delete, (3) run periodic "memory audits" — an LLM checks each memory is still valid.

📌 BigCat scenario: build a "BigCat operating manual" memory — your work preferences (dark theme, terse output, no emoji), family facts (kid's age range, allergies, interests), preferred thinking frames (Buddhism / complexity science / distributed-systems analogies). Have every AI tool read this shared memory for a consistent experience across tools — far better than repeating yourself each session.

Takeaway + reflection

💡 Context is working set; Memory is persistent storage — conflating them is the most common architectural mistake in AI apps.
🤔 If "working with AI for a year" is the design of a stateful system, what does your "data schema" look like? Which fields need persistence, and which are just session state?

Deep Questions

1. How does "prompt engineering" (Day 3) relate to "context engineering" (today)? Is one an upgrade of the other, or are they different layers?

They are different layers of abstraction, not versions. Prompt engineering is about "for this single LLM call, what's the most effective prompt?" — CoT, few-shot, role play — fundamentally single-inference techniques. Context engineering is about "across many calls, across long time spans, what should the model see?" — caching strategy, compression strategy, memory architecture — fundamentally system-level data flow design. A great prompt engineer can be terrible at context architecture (writes systems that demo perfectly but fall apart in week 2 of usage), and vice versa. The relationship is "SQL query optimization" vs "database schema design" — one looks at queries, the other at schemas. BigCat, your distributed-systems background is a massive advantage on context engineering: cache coherence, data lifecycle, hot/cold tiering — all directly transferable. The 2024–2025 industry consensus: once models reach Claude 4.7 / GPT-5 quality, the bottleneck for app quality moves from model to context engineering. That's why "context engineering" became the term of 2025.

2. KV caches get evicted under GPU memory pressure. What's the same and different vs Redis LRU or OS page replacement? What should you consider when designing your own caching strategy?

Similarities: all are bounded storage + retain hot + evict cold — classic caching problems. Anthropic/OpenAI don't publish details, but the industry uses LRU/LFU variants + explicit TTL. Differences: (a) eviction granularity — Redis evicts whole keys; KV cache evicts along prefix hash chains, partial mismatches partially invalidate; (b) asymmetric write cost — Redis writes and reads are symmetric; KV caching is +25% write, −90% read, so the read:write ratio dominates ROI; (c) externally opaque — you don't know the provider's GPU load, two identical prompts can land hit/miss back-to-back. Design moves: (1) use 1-hour cache over default 5-min if prefix is invoked > 1×/hour; (2) pre-warm off-peak with a no-op call; (3) monitor cache_read/cache_creation ratio; below ~5:1, your caching policy is losing money on the write premium; (4) stabilize prefix boundaries — never embed wall-clock timestamps or random session IDs that wreck the hash.

3. In 2026, when do you pick long context (1M–2M tokens) over RAG? When is long context "good enough" that RAG isn't needed?

When long context shipped in 2024 the industry chanted "RAG is dead"; 2025 reality slapped that down — they coexist for most apps. Decision axes: (a) knowledge stability — fixed corpora (one book, one contract) fit long-context + caching; dynamic corpora (news, product docs, user notes) require RAG; (b) query diversity — queries highly aligned with the doc ("what does clause N of this contract say") work in long context; divergent queries ("of all our customers last year, which look like A") need RAG, because "global aggregation" recall in long context is poor; (c) cost budget — long context bills the full input every call; a 100K-MAU app cannot afford it; RAG pulls 5–10K per call and scales linearly. Where you really don't need RAG: single-doc analysis (paper reader, contract review) + doc ≤ 500K tokens + low query frequency + cacheable prefix. Everything else is hybrid: RAG narrows to 50K–100K candidates, long context does the final pass. BigCat, your "read paper / read filing" use case is bucket 1; "personal long-term knowledge base" is bucket 2 — different architectures.

4. Memory systems introduce "AI memory contamination" — wrong facts persist and poison future answers. What engineering patterns transfer from "dirty-data governance" in databases?

Fundamentally the same problem: low-quality writes → persistent storage → poisoned downstream inference, the core challenge of any stateful system. Transferable patterns: (1) pre-write validation (schema check) — an independent LLM judges "is this fact asserted or inferred? user statement or AI hypothesis?", and only high-confidence items persist — DB constraint check; (2) soft delete + audit log — never DROP, mark invalidated, keep history so you can debug "why did the model suddenly get this wrong"; (3) periodic GC + consistency check — sweep memory with an LLM, flag contradictions ("user prefers dark mode" vs "user prefers light recently") and surface them; (4) user-visible + editable — every memory has a UI for view/delete/edit — this is both informed consent and dirty-data governance; (5) multi-source corroboration — high-stakes facts (medical, legal) require two independent confirmations before persisting. Mem0 / Letta partially do 1–3, but visibility and tiered confidence are widely under-built — BigCat, this is exactly what you should enforce when building your own "personal AI manual".

5. If "working with AI" is cognitive outsourcing, which memories should you outsource to AI, and which must stay in your own head? What does this question mean for personal growth?

This is the deepest personal-level extension of context engineering. Safe to outsource: (a) factual memory (names, dates, config values) — AI beats you 100×, zero-loss outsource; (b) procedural memory (how to configure Kubernetes, how to write a PR template) — AI + docs is enough; occasional review keeps skills sharp; (c) infrequent specialist knowledge (legal clauses, tax workflow used yearly) — pure outsource. Must stay in your head: (i) core judgment — "what is good code" / "what is a good decision" — outsource it and you lose agency; (ii) emotional bonds — every detail of conversations with your kid, your loved ones' preferences — outsourcing the memory outsources the relationship; (iii) cross-domain intuition — your Buddhism × distributed systems × complexity science connections are combinatorial memory; outsource the parts and the connections vanish; (iv) embodied skills — anything that requires "learning by doing" — outsource it and you never learn it. The deeper point: context engineering isn't just technical, it forces you to answer "who am I?" — which memories make up the non-outsourceable part of you. BigCat, the heart of "AI super-individual" is knowing precisely: which AI capabilities augment you vs atrophy you. That line is a personal "context schema design" — only you can draw it.

Context Window

Context Caching (Prompt Caching)

Context Compression

Memory Management

Further Reading

Deep Questions