Depth doesn't come from one big retrieval — it comes from "read, then search again".
By 2026, "Deep Research" has gone from a feature name to a class of systems: OpenAI, Gemini, Perplexity, and Anthropic each ship one. Yet most people treat it as a "stronger search box" — and get back a report that reads like an expert wrote it, with citations that are half wrong when you check. This issue isn't about "what Deep Research is." It's about using it as an engineering system, and building your own research agent. Four points: (1) why Deep Research is fundamentally a search-read-reflect agent loop, not RAG, and when it's overkill; (2) building a research harness with an orchestrator + fan-out subagents, and the boundary set by its 15× token cost; (3) source curation — found ≠ trustworthy; (4) citation tracing and adversarial verification — the most dangerous failure of a research agent is to confidently synthesize a conclusion with no source backing.
Plain RAG is one-shot: embed query → retrieve top-k → stuff into context → generate. Its ceiling is "did you get the keywords right the first time." Deep Research is different — a multi-turn agentic loop: plan → search → read → search again based on what you just learned → … → synthesize. OpenAI states Deep Research was trained with end-to-end RL on browsing + reasoning tasks; the model learned to plan, backtrack, and react to real-time information. Gemini Deep Research describes it almost identically — "browses the way you do: searching, finding interesting pieces, then starting a new search based on what it learned." The key difference is that queries are generated dynamically: you don't know what to search at first; only after reading the first batch do you know what to ask next. RAG can't do this — its query is frozen the moment it retrieves.
The cost is real: a Deep Research report typically runs 5–30 minutes and burns far more tokens than a normal chat. So the decision isn't "can it," it's "is it worth it."
A selection checklist — match the task; don't default to Deep Research:
# A single search is enough (seconds)
- Fact lookups: "latest version / release date of X"
- Known keywords, answer on a single page
# RAG (you have a private corpus)
- Q&A over a fixed doc set (internal wiki, contracts, papers)
- One query hits the relevant chunk
# Deep Research (minutes, worth the wait)
- Open-ended, needs cross-source synthesis: "compare X across A/B/C"
- Answer requires "read a batch first to know what to search next"
- Output valuable enough to justify 15× cost + tens of minutes
The test is "what to search next depends on what you just read" — only then go Deep Research; otherwise a cheaper path is just as accurate and far faster.
Anthropic's How we built our multi-agent research system gives a copyable architecture: a Lead Researcher (orchestrator) splits the question into subdirections and spawns multiple subagents in parallel, each searching one subdirection in its own context window, then returns summaries to the lead for synthesis. They report this beats a single agent by 90%+ on internal research evals — because spreading reasoning across independent contexts sidesteps a single agent's context ceiling. The same post pours cold water too: multi-agent burns roughly 15× the tokens of a normal chat — worthwhile only when the outcome's value far exceeds cost and the task splits in parallel. This connects to Day 13's "coordination tax": research is one of the rare cases where coordination gains > coordination cost, because subdirections are weakly dependent (searching company A vs B don't block each other).
A minimal research-harness skeleton — the point: the orchestrator only issues queries and never reads; each subagent reads one direction with isolated context:
def research(question):
# 1) ORCHESTRATOR: split into weakly-dependent subdirections
subqs = plan(question) # LLM emits 3-5 orthogonal subquestions
# 2) FAN-OUT: one isolated-context subagent per direction (parallel)
findings = parallel_map(subagent, subqs)
# 3) SYNTHESIZE: lead sees only summaries + cites, never raw pages
return synthesize(question, findings)
def subagent(subq, max_iters=8):
notes = []
for _ in range(max_iters):
q = next_query(subq, notes) # dynamically generate next query
hits = web_search(q)
for h in rank_sources(hits)[:3]: # see §3: curate first
notes.append(extract(fetch(h.url), keep_quote=True))
if enough(subq, notes): break # loop control, don't search forever
return compress(notes) # to {claim, url, quote}, drop raw text
Four keys: orthogonal subquestions (overlap = wasted parallelism), context isolation (a subagent's web noise doesn't pollute the lead), a max_iters per subagent (prevents infinite searching), and return {claim, url, quote}, not raw text (leaves anchors for §4's citation tracing).
web_search returns results ranked by relevance, not credibility. SEO farms, AI-generated rehashes, and stale content mix in with first-party authoritative sources in the top-k. If a subagent blindly fetches the top few, it feeds noise-as-fact into the synthesis layer — and an LLM won't spontaneously question source authority; it assumes whatever you give it is trustworthy. So curation must be an explicit harness step that happens before fetch. Usable signals: primary vs secondary (official docs/original papers > blog retellings > aggregators), domain authority (official domains, arXiv, known institutions > content farms), recency (down-weight old pages for fast-moving topics), and dedup (many sites copying one press release is essentially a single source — don't count it as "cross-source confirmation"). This step is also the prerequisite for §4's cross-source corroboration — you must first know which sources are genuinely independent.
A lightweight source ranking, inserted between search and fetch:
def rank_sources(hits):
AUTHORITY = {"arxiv.org":3, "*.gov":3, ".edu":2,
"official_docs":3, "content_farm":-2}
def score(h):
s = AUTHORITY.get(domain_class(h.url), 0)
s += 2 if is_primary(h) else 0 # weight primary sources
s -= recency_penalty(h.date, topic) # down-weight stale for fast topics
return s
uniq = dedup_by_content(hits) # merge copies: don't count reprints as independent
return sorted(uniq, key=score, reverse=True)
An even simpler version: state it directly in the subagent's prompt — "prefer primary sources; trace secondhand retellings back to the original, and tag single-source key conclusions as [needs corroboration]." Front-loading curation is far cheaper than fixing it afterward.
Following Day 11: citations, URLs, dates, and numbers are the tokens LLMs most easily fabricate — and those are exactly the substance of a research report. A flawless-reading review may have half its citations hallucinated — correct format, 404 link, or a real link whose content never said that. Two fixes. First, grounded generation: the synthesis layer may only cite {claim, url, quote} the subagent actually fetched, leaving no room for "free-form citation." Anthropic's Citations API makes this native — it chunks source docs into sentences and has Claude cite the exact sentences it actually used, which Anthropic reports is more reliable than prompt-only approaches (and you aren't charged output tokens for the quoted text). Second, adversarial verification: after synthesis, run another pass that goes "each claim → back to the quote → grade the support" — the evaluator pattern applied to research.
A quote-grounded synthesis + verification prompt, copy-paste ready:
# Synthesis stage: ban free-form citation
Write the review using ONLY the FINDINGS below. Each conclusion must end with [url].
If a conclusion has no directly-supporting sentence in FINDINGS → write
"insufficient evidence". Do not fill the gap.
FINDINGS = [{claim, url, quote}, ...] # from subagents, with source sentence
# Verification stage (separate pass, adversarial): check each one
Check every cited conclusion in the review above:
1. Is the conclusion directly supported by its quote? (yes/partial/no)
2. Does it rest on a single source? If so → tag [single-source, unconfirmed]
3. For "no" → delete it or downgrade to "some claim that...".
Output the revised review + a claim→evidence-strength table.
The core is to split "generate" and "verify" into two passes with different stances. Letting the model self-check while writing is nearly useless (it rationalizes what it just wrote); an independent adversarial pass is what catches fabricated citations.
Thread the four points into a research harness that does a half-hour topic survey for you. The goal isn't to clone Deep Research — it's to walk "iterative loop + curation + citation tracing" end to end yourself.
rank_sources before fetch, take top-3; return {claim, url, quote}, drop the raw text.Once you've done this, you'll instinctively peel back any "AI research product" — where's its iterative loop, how does it curate, how does it trace citations — instead of being fooled by a report that "reads professional."