DAY 21 / PHASE 2 · SYSTEMS

AI for Research

Deep Research Loop · Research Harness · Source Curation · Citation Tracing

2026-06-04 · BigCat

Depth doesn't come from one big retrieval — it comes from "read, then search again".

// WHY THIS MATTERS

By 2026, "Deep Research" has gone from a feature name to a class of systems: OpenAI, Gemini, Perplexity, and Anthropic each ship one. Yet most people treat it as a "stronger search box" — and get back a report that reads like an expert wrote it, with citations that are half wrong when you check. This issue isn't about "what Deep Research is." It's about using it as an engineering system, and building your own research agent. Four points: (1) why Deep Research is fundamentally a search-read-reflect agent loop, not RAG, and when it's overkill; (2) building a research harness with an orchestrator + fan-out subagents, and the boundary set by its 15× token cost; (3) source curation — found ≠ trustworthy; (4) citation tracing and adversarial verification — the most dangerous failure of a research agent is to confidently synthesize a conclusion with no source backing.

// 01

Deep Research is an agent loop, not RAG

Claim: depth comes from iteration — "generate the next query from what you just read" — which no single big retrieval can reach.

Background & Principle

Plain RAG is one-shot: embed query → retrieve top-k → stuff into context → generate. Its ceiling is "did you get the keywords right the first time." Deep Research is different — a multi-turn agentic loop: plan → search → read → search again based on what you just learned → … → synthesize. OpenAI states Deep Research was trained with end-to-end RL on browsing + reasoning tasks; the model learned to plan, backtrack, and react to real-time information. Gemini Deep Research describes it almost identically — "browses the way you do: searching, finding interesting pieces, then starting a new search based on what it learned." The key difference is that queries are generated dynamically: you don't know what to search at first; only after reading the first batch do you know what to ask next. RAG can't do this — its query is frozen the moment it retrieves.

The cost is real: a Deep Research report typically runs 5–30 minutes and burns far more tokens than a normal chat. So the decision isn't "can it," it's "is it worth it."

RAG (one-shot) Deep Research (iterative agent loop) ┌──────────┐ ┌─────────────────────────────────────┐ │ query │ │ plan ─▶ search ─▶ read ─┐ │ │ ▼ │ │ ▲ │ learned │ │ retrieve │ │ └──── new query ◀─────┘ something │ │ ▼ │ │ ▼ (enough?) │ │ generate │ │ synthesize + cite │ └──────────┘ └─────────────────────────────────────┘ depth = recall quality depth = #iterations × per-round curation

Hands-on

A selection checklist — match the task; don't default to Deep Research:

# A single search is enough (seconds)
- Fact lookups: "latest version / release date of X"
- Known keywords, answer on a single page

# RAG (you have a private corpus)
- Q&A over a fixed doc set (internal wiki, contracts, papers)
- One query hits the relevant chunk

# Deep Research (minutes, worth the wait)
- Open-ended, needs cross-source synthesis: "compare X across A/B/C"
- Answer requires "read a batch first to know what to search next"
- Output valuable enough to justify 15× cost + tens of minutes

The test is "what to search next depends on what you just read" — only then go Deep Research; otherwise a cheaper path is just as accurate and far faster.

Failure mode: using Deep Research for fact lookups. Ask "what's stock X right now" and it'll run 10 minutes synthesizing secondhand takes — worse than one search that grabs first-party data. The iterative loop pays off only when synthesis is needed; for fact lookups it's pure waste plus secondhand noise.
Resources · OpenAI Introducing deep research, openai.com/index/introducing-deep-research · Gemini Deep Research, gemini.google/overview/deep-research
// 02

Build a research harness: orchestrator + fan-out subagents

Claim: research is a natural breadth-first, parallelizable multi-agent case — but 15× token cost draws its boundary.

Background & Principle

Anthropic's How we built our multi-agent research system gives a copyable architecture: a Lead Researcher (orchestrator) splits the question into subdirections and spawns multiple subagents in parallel, each searching one subdirection in its own context window, then returns summaries to the lead for synthesis. They report this beats a single agent by 90%+ on internal research evals — because spreading reasoning across independent contexts sidesteps a single agent's context ceiling. The same post pours cold water too: multi-agent burns roughly 15× the tokens of a normal chat — worthwhile only when the outcome's value far exceeds cost and the task splits in parallel. This connects to Day 13's "coordination tax": research is one of the rare cases where coordination gains > coordination cost, because subdirections are weakly dependent (searching company A vs B don't block each other).

Hands-on

A minimal research-harness skeleton — the point: the orchestrator only issues queries and never reads; each subagent reads one direction with isolated context:

def research(question):
    # 1) ORCHESTRATOR: split into weakly-dependent subdirections
    subqs = plan(question)        # LLM emits 3-5 orthogonal subquestions
    # 2) FAN-OUT: one isolated-context subagent per direction (parallel)
    findings = parallel_map(subagent, subqs)
    # 3) SYNTHESIZE: lead sees only summaries + cites, never raw pages
    return synthesize(question, findings)

def subagent(subq, max_iters=8):
    notes = []
    for _ in range(max_iters):
        q = next_query(subq, notes)      # dynamically generate next query
        hits = web_search(q)
        for h in rank_sources(hits)[:3]:   # see §3: curate first
            notes.append(extract(fetch(h.url), keep_quote=True))
        if enough(subq, notes): break     # loop control, don't search forever
    return compress(notes)   # to {claim, url, quote}, drop raw text

Four keys: orthogonal subquestions (overlap = wasted parallelism), context isolation (a subagent's web noise doesn't pollute the lead), a max_iters per subagent (prevents infinite searching), and return {claim, url, quote}, not raw text (leaves anchors for §4's citation tracing).

Failure mode: Anthropic's real pitfalls — early agents spawned 50 subagents for a simple query, scoured the web for nonexistent sources, and distracted each other with excessive updates. The root cause was the orchestrator prompt not giving a clear budget ("how many, when to stop"). The main tuning lever for a research harness is the orchestrator prompt, not the model.
Resources · Anthropic How we built our multi-agent research system, anthropic.com/engineering/multi-agent-research-system · Simon Willison's notes, simonwillison.net/2025/Jun/14
// 03

Source curation: found ≠ trustworthy

Claim: the bottleneck on research-agent quality is rarely "searched enough" — it's "did you curate sources before reading."

Background & Principle

web_search returns results ranked by relevance, not credibility. SEO farms, AI-generated rehashes, and stale content mix in with first-party authoritative sources in the top-k. If a subagent blindly fetches the top few, it feeds noise-as-fact into the synthesis layer — and an LLM won't spontaneously question source authority; it assumes whatever you give it is trustworthy. So curation must be an explicit harness step that happens before fetch. Usable signals: primary vs secondary (official docs/original papers > blog retellings > aggregators), domain authority (official domains, arXiv, known institutions > content farms), recency (down-weight old pages for fast-moving topics), and dedup (many sites copying one press release is essentially a single source — don't count it as "cross-source confirmation"). This step is also the prerequisite for §4's cross-source corroboration — you must first know which sources are genuinely independent.

Hands-on

A lightweight source ranking, inserted between search and fetch:

def rank_sources(hits):
    AUTHORITY = {"arxiv.org":3, "*.gov":3, ".edu":2,
                 "official_docs":3, "content_farm":-2}
    def score(h):
        s = AUTHORITY.get(domain_class(h.url), 0)
        s += 2 if is_primary(h) else 0   # weight primary sources
        s -= recency_penalty(h.date, topic)        # down-weight stale for fast topics
        return s
    uniq = dedup_by_content(hits)   # merge copies: don't count reprints as independent
    return sorted(uniq, key=score, reverse=True)

An even simpler version: state it directly in the subagent's prompt — "prefer primary sources; trace secondhand retellings back to the original, and tag single-source key conclusions as [needs corroboration]." Front-loading curation is far cheaper than fixing it afterward.

Failure mode: treating "N sites say so" as strong evidence. They're likely all reprinting one press release — that's pseudo-multi-source. Dedup by content, not URL, or your "cross-source corroboration" is self-deception. Another trap: over-applying recency penalties on slow-moving topics (math, history), killing the classic authoritative sources.
Resources · OpenAI deep research (designed to cite first-party sentences/passages), openai.com/index/introducing-deep-research · This series, Day 11 Hallucination (citations/URLs are high-fabrication tokens), hallucination-day11
// 04

Citation tracing & adversarial verification

Claim: a research agent's most dangerous failure isn't "can't find it" — it's "confidently synthesizing a conclusion with no source backing."

Background & Principle

Following Day 11: citations, URLs, dates, and numbers are the tokens LLMs most easily fabricate — and those are exactly the substance of a research report. A flawless-reading review may have half its citations hallucinated — correct format, 404 link, or a real link whose content never said that. Two fixes. First, grounded generation: the synthesis layer may only cite {claim, url, quote} the subagent actually fetched, leaving no room for "free-form citation." Anthropic's Citations API makes this native — it chunks source docs into sentences and has Claude cite the exact sentences it actually used, which Anthropic reports is more reliable than prompt-only approaches (and you aren't charged output tokens for the quoted text). Second, adversarial verification: after synthesis, run another pass that goes "each claim → back to the quote → grade the support" — the evaluator pattern applied to research.

Hands-on

A quote-grounded synthesis + verification prompt, copy-paste ready:

# Synthesis stage: ban free-form citation
Write the review using ONLY the FINDINGS below. Each conclusion must end with [url].
If a conclusion has no directly-supporting sentence in FINDINGS → write
"insufficient evidence". Do not fill the gap.

FINDINGS = [{claim, url, quote}, ...]   # from subagents, with source sentence

# Verification stage (separate pass, adversarial): check each one
Check every cited conclusion in the review above:
1. Is the conclusion directly supported by its quote? (yes/partial/no)
2. Does it rest on a single source? If so → tag [single-source, unconfirmed]
3. For "no" → delete it or downgrade to "some claim that...".
Output the revised review + a claim→evidence-strength table.

The core is to split "generate" and "verify" into two passes with different stances. Letting the model self-check while writing is nearly useless (it rationalizes what it just wrote); an independent adversarial pass is what catches fabricated citations.

Failure mode: (1) verifying only "does the URL open" — a real link whose content doesn't support the sentence is the most insidious error; you must check whether the quote truly comes from that URL and truly says that. (2) Generating and verifying in one pass, one stance — that's the model grading itself, biased high (backward rationalization). (3) Trusting the model's self-reported confidence — it isn't calibrated to actual correctness.
Resources · Anthropic Introducing Citations on the API, claude.com/blog/introducing-citations-api · Simon Willison Anthropic's new Citations API, simonwillison.net/2025/Jan/24

// Capstone · Build a personal research agent this weekend

Thread the four points into a research harness that does a half-hour topic survey for you. The goal isn't to clone Deep Research — it's to walk "iterative loop + curation + citation tracing" end to end yourself.

  1. Type check (§1): first confirm your question truly needs iterative synthesis ("what to search next depends on what you just read"). Otherwise one search is done — don't build an agent.
  2. Orchestrator (§2): split into 3–5 orthogonal subdirections; hard-code the budget in the prompt — "at most N subagents, at most 8 searches each."
  3. Subagent + curation (§2/§3): isolated context per subagent; rank_sources before fetch, take top-3; return {claim, url, quote}, drop the raw text.
  4. Grounded synthesis (§4): the synthesis layer may only cite the returned quotes; if unsupported, write "insufficient evidence".
  5. Adversarial verify (§4): a separate pass checks each claim→quote, tags single-source/unconfirmed, deletes fabricated citations.
  6. Self-eval: write 5 questions with known answers; compare "your agent" vs "just asking Deep Research" on citation accuracy and token cost. You'll feel §1's claim firsthand: whether it's worth it depends on whether the task warrants iteration.

Once you've done this, you'll instinctively peel back any "AI research product" — where's its iterative loop, how does it curate, how does it trace citations — instead of being fooled by a report that "reads professional."

// GLOSSARY

Deep Research
A class of agent systems that autonomously run open-ended surveys via an iterative search-read-reflect loop and produce cited reports.
Search-Read-Reflect Loop
The core research-agent cycle: search → read → generate the next query from what you read; unlike one-shot RAG.
Orchestrator-Worker
A lead agent splits subdirections, fans them out to isolated-context subagents, then collects and synthesizes.
Fan-out
Expanding one research question into multiple weakly-dependent subtasks in parallel.
Source Curation
Ranking/filtering search results by primacy/authority/recency/dedup before fetching.
Primary vs Secondary Source
First-party (original paper/official docs) vs secondhand retellings; agents should prefer primary and trace back.
Grounded Generation
Allowing the model to cite only retrieved content; no "free-form" citation.
Citations API
Anthropic's native capability: chunk source docs into sentences and have Claude cite the exact sentences used.
Citation Tracing
Tracing each conclusion back to the supporting quote and verifying it.
Adversarial Verification
A separate, adversarial pass that grades each claim's evidence strength to catch fabricated citations.

// DEEP THINKING

Deep Research is an end-to-end RL-trained "browsing reasoner." Where's the capability gap between that and a research harness you wire from off-the-shelf APIs?
The gap is in stopping judgment and query quality. RL training internalizes "when to backtrack, when it's enough" — an implicit policy ground out by reward signals; a harness can only crudely approximate it with max_iters + a heuristic enough(). Next is query-generation "taste" — the RL model has seen vast browsing trajectories and knows what to search next more efficiently. But the harness's controllability is its edge: you can precisely insert curation, citation tracing, domain allowlists. Conclusion: hosted Deep Research suits open exploration; a self-built harness suits auditable, customizable cases needing private-source access — not substitutes, different trade-offs.
Anthropic reports multi-agent beats single-agent by 90%+ but burns 15× tokens. When is a single long-context agent the better choice?
When the task is strongly serially dependent or needs globally consistent judgment. Research suits multi-agent because subdirections are weakly dependent and parallelizable (searching A and B don't block). Conversely, if every step depends on the previous step's exact conclusion (multi-step math, legal analysis needing strict cross-reference consistency), splitting into subagents lets context drift between agents and decisions diverge — exactly Cognition's Don't Build Multi-Agents retrospective. Test: can subtasks be done independently and stitched together? Yes → fan-out; no → single long-context agent.
"N websites all say so" feels like strong evidence — why is it dangerous in AI research?
Because independence is an illusion. Vast amounts of web content reprint/rehash one press release, and increasingly much is AI-generated secondhand content citing each other. On the surface N URLs; in essence a single source amplified N times — pseudo-multi-source. Worse, once AI-generated content pollutes search results, you can get an echo chamber where "a model's fabrication gets written into a page, then searched by the next model as fact." So real corroboration must dedup by content and trace to primary, asking "how many genuinely independent first-party sources," not "how many URLs."
Why is "let the model verify citations while writing the review" nearly useless, requiring a separate adversarial pass?
In the same pass the model is in generation mode and tends to rationalize what it just wrote (backward rationalization) — it's already committed to the claim, so self-checking only finds reasons to support it. Same as a person proofreading right after writing and missing errors. A separate pass with a critic's stance ("assume this is fabricated — does the evidence really support it?") breaks the generation inertia and exposes holes. This is the essence of the evaluator-optimizer pattern: generator and evaluator must be different "personas," ideally different context and prompt framing, or evaluation degrades into self-endorsement.
As more of the web becomes AI-generated, what structural problem does a research agent's "curation" face?
The core problem is the dilution and disguise of primary sources. AI-generated content can perfectly mimic authoritative format (fake citations, fake data, plausible methodology), so traditional domain-authority/format signals fail. Likely evolution: rely on verifiable provenance (content signatures, provenance chains, official APIs rather than web scraping), rely on institutional trust anchors (trust only a few verifiable first-party channels), and make "can this trace to a responsible person/institution" a hard gate. This upgrades §3's curation from a "ranking problem" to a "provenance problem" — trustworthiness is no longer a property of the search result, but of the source's identity.

// FURTHER READING