DAY 02 / PHASE 1 · ENGINEERING

Context Engineering

Lost-in-the-Middle · Information Layout · Chunk Design · Context Compaction

2026-05-22 · BigCat

"200K context" is not "200K of usable attention".

Foundation concepts → ai-ml-daily Day 1: LLM Basics (Attention, Tokenization, Positional Encoding)

// WHY THIS MATTERS

Every frontier lab is racing on context window size: Claude Sonnet 4.5 hits 1M, Gemini 2.5 hits 2M, GPT-5 caught up to 400K. So everyone starts stuffing whole codebases / company-wide docs / 30-turn chat histories into a single prompt — then gets confused: "Why can't the model answer? I literally put the answer in there." The answer is that a long context window is not effective context. Attention degrades on long sequences, where you place information in the prompt determines whether the model can use it, and how you chunk matters 10× more than swapping embedding models. This issue covers four things: measuring the real usable length of long context, where to physically place information in a prompt, why 90% of RAG systems lose to their own chunking strategy, and how to compact long histories without losing the critical signal.

// 01

Lost-in-the-Middle: The Real Effective Length of Long Context

Claim: putting critical information in the middle of a prompt is burying it in the model's blind spot.

Background & Principles

Liu et al. 2023's Lost in the Middle: How Language Models Use Long Contexts (TACL 2024) is the foundational reference. They had models find an answer across a set of documents, placing the ground-truth document at positions 1, 5, 10, 15, 20 and plotting accuracy. Every tested model (GPT-3.5, Claude, Longchat) showed the same U-curve: ~75% accuracy at the head and tail, collapsing below 50% in the middle — worse than turning retrieval off entirely.

Two reasons. First, in pretraining, token dependencies are mostly local or anchored to document head (causal masking biases attention toward recent tokens; RoPE-style positional encodings have long-tail decay that weakens far-range attention). Second, post-training instruction data is mostly "short prompt → answer"; the model has seen few samples of "instruction at the tail, key facts in the middle".

In 2024 NoLiMa (Modarressi et al.) went further: when question and answer share no lexical overlap, supposedly 100K+ models fall below baseline at 32K. Chroma's 2025 Context Rot report retested 18 frontier models (Claude Sonnet 4, GPT-4.1, Gemini 2.5) and concluded the same: effective context is much shorter than advertised max context, and the gap grows with task difficulty.

Hands-on Example

Run a minimal NIAH (Needle in a Haystack) test to measure the real effective length of your go-to model:

import anthropic, random, string

client = anthropic.Anthropic()
needle = "BigCat's meeting password is PURPLE-OWL-9182."

def haystack(n_tokens, needle_pos):
    # Pad to n_tokens with Paul Graham essays
    filler = open("pg_essays.txt").read()
    chunks = filler.split("\n\n")
    random.shuffle(chunks)
    text = "\n\n".join(chunks)[:n_tokens * 4]  # ~4 char/token approximation
    pos = int(len(text) * needle_pos)
    return text[:pos] + "\n\n" + needle + "\n\n" + text[pos:]

for length in [8000, 32000, 100000, 180000]:
    for pos in [0.1, 0.5, 0.9]:
        ctx = haystack(length, pos)
        r = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=128,
            messages=[{"role":"user","content": ctx +
                "\n\nQ: What is BigCat's meeting password? Answer with the password only."}])
        print(length, pos, r.content[0].text.strip())

You'll end up with a (length × position) hit table. Trust only the cells with ≥ 95% hit rate — beyond that, regardless of the marketed K, don't put critical information there.

Failure modes: many teams measure "100% pass at 1M context" on a single-needle NIAH and confidently stuff multiple contracts, emails, and chat logs into a prompt. NIAH only tests literal lookup, while real tasks require multi-hop reasoning + integration — 1–2 orders of magnitude harder. BABILong, RULER, and LongBench v2 are far closer to reality.
Going deeper · Liu et al. 2023, arxiv.org/abs/2307.03172 · Chroma 2025 Context Rot, research.trychroma.com/context-rot · Greg Kamradt's original NIAH repo, github.com/gkamradt/LLMTest_NeedleInAHaystack
// 02

Information Layout Principles: U-Shape + Anchor + Restate

Claim: a prompt is not a bag of tokens. It's an attention pipeline with a gradient.

Background & Principles

Since head and tail carry the highest attention weight, your prompt's physical layout should mirror that importance gradient. Anthropic's official long-context guidance (claude.com/docs · Long context tips) is blunt: put long reference documents at the very top of the prompt and the question at the end of the user turn. This has been measured repeatedly on Claude — long-doc Q&A accuracy goes up 10–20%.

Three concrete moves:

┌──────────────── Prompt physical layout ─────────────────┐ │ ↑ HIGH attention │ │ ┌─────────────────────────────────────────────────┐ │ │ │ SYSTEM: role + guardrails + tools schema │ │ ← cache-friendly │ ├─────────────────────────────────────────────────┤ │ │ │ REFERENCE DOCS (large, stable per task) │ │ ← put long stuff here │ │ <doc id="1">...</doc> │ │ │ │ <doc id="2">...</doc> │ │ │ ├─────────────────────────────────────────────────┤ │ │ │ … mid-context attention sag … │ │ ← never put critical info here │ ├─────────────────────────────────────────────────┤ │ │ │ CRITICAL CONSTRAINTS (restate) │ │ │ │ CURRENT QUESTION (specific, last) │ │ ← what the model attends to first │ └─────────────────────────────────────────────────┘ │ │ ↑ HIGH attention │ └─────────────────────────────────────────────────────────┘

Hands-on Example

<documents>
  <document index="1">
    <source>contracts/acme-2024.pdf</source>
    <content>...8K tokens...</content>
  </document>
  <document index="2">...</document>
</documents>

Instructions (restated):
- Cite only clauses explicitly written in the documents above. Do not extrapolate.
- First list the verbatim passages you will cite, then give the conclusion.

Question:
Can we unilaterally terminate the Acme contract at month 18? Walk through the basis clause by clause.
Failure modes: in a long-running chat, dropping a new instruction into system and expecting the model to "re-read the system prompt". The model does not "re-read" anything — it just runs attention over every token once per call. Instructions need to be restated in the most recent user turn to actually land.
Going deeper · Anthropic Long context tips, docs.claude.com/.../long-context-tips · Lilian Weng's working-memory section in LLM Powered Autonomous Agents, lilianweng.github.io/posts/2023-06-23-agent
// 03

RAG's Real Bottleneck: Chunk Design Matters Way More Than the Embedding Model

Claim: 90% of RAG systems are broken — and it's not the embedding, it's how you cut up the document.

Background & Principles

Standard team workflow when shipping RAG: fixed 512-token chunk + 50 overlap + OpenAI embedding + simple top-k. Quality is bad, so they swap embedding models, then swap vector DBs. Same outcome. Why:

The three things that actually determine your RAG ceiling: chunk boundaries (semantic), chunk context (contextual prefix), and retrieval fusion (hybrid + rerank). Embedding model choice, among reasonable options, is a ±5% knob.

Hands-on Example

Anthropic's recommended Contextual Retrieval flow (drop-in for your own corpus):

# Step 1: generate context for each chunk (batch with Claude Haiku + prompt caching)
context_prompt = """
<document>{whole_doc}</document>
Here is the chunk we want to situate within the whole document:
<chunk>{chunk}</chunk>
Please give a short succinct context to situate this chunk within
the overall document for improving search retrieval. Answer only
with the context, nothing else.
"""

# Step 2: prepend context to chunk, then build embedding + BM25 indexes
indexed_chunk = f"{generated_context}\n\n{original_chunk}"

# Step 3: at query time, hybrid (dense + BM25), top-150 → rerank to top-20
dense_hits = vector_store.search(query, k=150)
bm25_hits  = bm25_index.search(query, k=150)
fused = reciprocal_rank_fusion([dense_hits, bm25_hits])
top20 = cohere_rerank(query, fused[:150], top_n=20)

Critical engineering detail: in step 1, the {whole_doc} sits in the same cached system + document prefix across all chunks of that document; with prompt caching enabled, cost drops to about 1/10. Anthropic's own number: contextualizing a 100-page document costs about $1.02 — cheaper than swapping embedding models, and significantly more effective.

Failure modes: treating contextual retrieval as a silver bullet. It helps a lot on chunks that need external information to disambiguate; it does almost nothing for chunks that read independently (e.g. per-endpoint API docs) and just adds cost. Run a baseline eval first, then decide which corpora to apply it to.
Going deeper · Anthropic Introducing Contextual Retrieval 2024, anthropic.com/news/contextual-retrieval · LlamaIndex chunking experiments, llamaindex.ai/blog · Jina AI Late Chunking, arxiv.org/abs/2409.04701
// 04

Context Compaction: Controlled Compression of Long Conversations

Claim: long sessions aren't a "window overflowed" problem — they're a "signal diluted by noise" problem.

Background & Principles

Agents and long chats often see accuracy fall off a cliff around turn 50. The cause isn't always token overflow — often you're still well under max context, but the model is already confused. This is context rot: relevant signal diluted by irrelevant history, combined with lost-in-the-middle, pushes attention onto irrelevant tokens.

Four mainstream strategies for long history; the engineering choice maps to different trade-offs:

Hands-on Example

Minimal compaction loop for a long-running coding agent:

def maybe_compact(history, model="claude-sonnet-4-6", threshold=120_000):
    tok = count_tokens(history)
    if tok < threshold:
        return history
    # Keep the last 6 turns verbatim; compress everything before into structured summary
    recent, old = history[-6:], history[:-6]
    summary = client.messages.create(model=model, max_tokens=2000,
        system="You compress agent histories. Output strict JSON.",
        messages=[{"role":"user","content": f"""
Compress the following turns into JSON with keys:
- goal: the user's overall objective
- decisions: list of decisions made (with rationale)
- artifacts: files/code created or modified (paths only)
- open_questions: anything explicitly deferred
- forbidden: constraints user said NOT to do
<turns>{serialize(old)}</turns>
"""}]).content[0].text
    return [{"role":"system","content":f"<prior_session>{summary}</prior_session>"}] + recent

Key lesson: the compression schema matters more than the compression algorithm. Forcing the model to emit fixed keys (goal / decisions / artifacts / open_questions / forbidden) lets the next turn directly reference <prior_session>.forbidden — far more robust than free-form summarization.

Failure modes: (1) summarizing code snippets along with everything else — next turn the agent has to re-fetch the original, costing more tokens, not fewer. Rule: preserve code / file paths / exact identifiers verbatim; only compress natural language. (2) Compacting every turn — the cache shatters every time and you spend more, not less. Use hysteresis: trigger at 120K, drop to 60K, don't re-trigger until you cross 120K again.
Going deeper · Anthropic Effective context engineering for AI agents engineering blog, anthropic.com/engineering/effective-context-engineering-for-ai-agents · MemGPT paper, arxiv.org/abs/2310.08560 · Simon Willison Context engineering tag, simonwillison.net/tags/context-engineering

// Putting it together · End-to-end personal RAG rebuild

Wire the four ideas above into a weekend project: use your own Notion / Obsidian notes as corpus, build a personal knowledge agent you'll actually use daily.

  1. Chunk: split by Markdown headers (NOT fixed size). Each ##-and-above section is one chunk, with frontmatter + breadcrumb (file path + heading chain).
  2. Contextualize: use Haiku 4.5 to generate a 1–2 sentence context prefix per chunk (cache the whole document as prefix; the per-chunk incremental token cost is tiny).
  3. Retrieve: dense (voyage-3-large or BGE-M3) + BM25 → RRF → Cohere rerank.
  4. Assemble: top-K documents at the front of <documents>; instructions + question at the very tail, with a tail-end restate of "cite only content explicitly present in the documents".
  5. Long sessions: trigger summarization compaction past 80K; keep the most recent 6 turns + structured summary. Code blocks are never compressed.
  6. Eval: write 30 questions with ground-truth answers yourself (you wrote the notes, you're the oracle); track hit rate and citation accuracy. Next issue covers how to automate this eval.

Once this is running you'll see that chunking + contextualization + rerank lifts quality more than swapping your embedding model five times.

// Deep Thinking

Is lost-in-the-middle an inherent flaw of the transformer architecture or a training-data bias? Will future models fix it?
Both. Architecturally: positional encoding precision degrades on long context, and attention gets diluted in the middle. Training-wise: most instruction-tuning data is short, so the model never learned that "the middle still matters". Newer models (GPT-4o long, Claude 3.5 200k) improve via needle-in-haystack training, but Liu et al. 2024 still measures a 20%+ gap on real tasks. Not disappearing soon.
For RAG chunking, fixed-character vs. semantic-boundary cutting — how big is the gap, and why do frameworks default to the former?
Semantic boundary typically recalls 10–30% better — embedding models are trained on semantic units; half-sentence embeddings are noise. Frameworks default to fixed-character because it's simple and doesn't require document parsing. Production should be hybrid: cut by paragraph/heading, fall back to max_tokens.
U-shape layout puts critical info at head and tail. If you only have one critical thing, head or tail?
Tail. Recency bias is stronger (attention weight to the most recent tokens is the highest), and on most tasks "say it again at the end" beats "say it once at the start". Exception: role/persona must go at the head (it conditions the entire output style); the task instruction goes at the tail. That's why standard system-prompt templates put role on top, user query at the bottom.
Context compaction via LLM summarization is standard practice, but it hallucinates. How do you compress without distortion?
Three mechanisms: (1) structured compression (force JSON, not free prose; constrain schema); (2) two-pass verification (after compaction, a separate prompt verifies "is X actually in the original"); (3) retain references (compacted content cites the original chunk id; fetch on demand). Claude Code's conversation compaction uses (1) and (2).
Should chunks be 200 tokens or 800? What are the variables?
Three variables: (1) question granularity (detail questions want small chunks, macro want large); (2) embedding-model training length (OpenAI text-embedding-3 is trained ~256 tokens; longer gets mean-pool distorted); (3) retrieval k (with k=1, large chunks preserve info; with k=10, small chunks preserve diversity). Production sweet spot is 200–500 + ~50 overlap.

// Further Reading