DAY 44 / PHASE 4 · ENGINEERING

AI-Assisted Data Analysis

NL2SQL · Code Interpreter · Statistical Traps · Analysis-Agent Loop

2026-06-23 · BigCat

Getting the model to "fetch a number" is easy; getting it to fetch right, compute right, and explain right is a whole engineering stack.

// WHY THIS MATTERS

"Query your database in natural language" is the most dazzling demo and the most fragile in production—because it stacks three hard problems: grounding a vague question onto a precise schema (NL2SQL), handing computation to executable code instead of model hallucination, and turning results into trustworthy conclusions without falling into statistical traps. This issue does not explain "what text-to-SQL is"—it assumes you've already had Claude write SQL and used a Code Interpreter. We cover the engineering: why a large schema must be retrieved before generation, why "let the model write code and run it" beats "let the model just give a number," and how a data agent reconciles its own numbers so it doesn't confidently compute the wrong thing. The brutal BIRD/Spider 2.0 numbers run through the whole piece—they tell you the gap from demo to production isn't model intelligence, it's the verification scaffolding around the model.

// 01

The NL2SQL Bottleneck Is Schema Linking, Not SQL Syntax

Claim: the model has long known how to write SQL; what it can't do is figure out "which tables and columns to use in your real 3,000-column database."

Background & Principle

On academic Spider, SOTA execution accuracy passed 90% long ago. But on BIRD (large databases close to real business, arXiv 2305.03111), the strongest model of the day hit only ~40% execution accuracy vs. 93% for humans; on Spider 2.0 (enterprise-grade, single databases often >1,000 columns, multiple dialects, arXiv 2411.07763, ICLR 2025 Oral), even o1-preview reached only ~21%. Nearly all the gap comes from schema linking: mapping "active users" or "last month's revenue" to concrete tables, columns, join paths and filters. In real databases column names are usr_st, status codes are magic numbers, and synonymous columns are scattered across tables—this is a semantic-alignment problem, not a syntax one.

The right engineering posture: don't stuff the whole schema into context. A large database's DDL is tens of thousands of tokens, which both blows the budget and dilutes attention. The fix is to treat schema linking as retrieval: pull the most relevant subset of tables/columns for the question, then feed it with real sample values (so the model knows status takes 'A'/'C', not the guessed 'active').

Hands-on

# Don't: stuff the full DDL into the prompt (token blowup + attention dilution)
# Do: retrieve relevant tables -> attach sample values -> then generate

def build_schema_context(question, db):
    tables = retrieve_tables(question, db, top_k=8)   # vector recall of relevant tables
    ctx = []
    for t in tables:
        samples = db.query(f"SELECT * FROM {t} LIMIT 3")  # anchor semantics with real values
        ctx.append(f"{t.ddl}\n-- samples: {samples}")
    return "\n\n".join(ctx)

# Add a one-line business glossary to fix column-name / magic-number ambiguity
GLOSSARY = "active user = users.status='A' AND last_login > now()-30d"

That single line of business glossary (metric definition) often pays off more than upgrading the model—it fixes the organization's definition of "active" and "revenue" so the model stops improvising.

Failure modes: (1) recall misses a key table → the generated SQL is syntactically valid but semantically wrong, and silently returns a plausible-looking number; this is more dangerous than an error. (2) Giving DDL without sample values → the model guesses enum values / date formats literally and returns an empty set. Always alert on "returned 0 rows" separately—an empty set is often a sign of schema-linking failure, not "no data exists."
Going deeper · BIRD Can LLM Already Serve as a Database Interface?, arXiv 2305.03111 · Spider 2.0 Real-World Enterprise Text-to-SQL, arXiv 2411.07763
// 02

The Code-Interpreter Paradigm: Have the Model Write Executable Code, Not Emit Numbers

Claim: a model computing a 12-digit weighted average will hallucinate; have it write pandas and run it—if it's wrong, it errors. Outsource computation to a deterministic runtime.

Background & Principle

An LLM is a probabilistic text machine, not an arithmetic engine. Letting it "do mental math" on aggregates, correlations, or year-over-year deltas yields results that are neither trustworthy nor reproducible. The Code-Interpreter paradigm rewrites analysis as: model generates Python/SQL → execute in a sandbox → feed the real output (numbers + charts) back to the model. Three wins: computation goes to a deterministic runtime, errors get caught by execution feedback and self-corrected, and the code itself is an auditable record of the analysis. Anthropic's code execution tool and OpenAI's Code Interpreter are both server-side sandboxes: the script consumes the tool results and the model only sees the final output—which also saves tokens. Anthropic's advanced-tool-use engineering blog notes that orchestrating tool calls inside code execution (programmatic tool calling) cut billed input tokens by roughly 38% on a multi-tool benchmark with no change in accuracy.

Key engineering point: don't have the model "describe" charts either—have it draw them. Emitting chart JSON/SVG directly invites hallucinated coordinates and values; the right move is generating matplotlib/plotly code that executes into a real figure.

Hands-on

import anthropic
client = anthropic.Anthropic()

resp = client.beta.messages.create(
    model="claude-sonnet-4-6",
    betas=["code-execution-2025-05-22"],
    max_tokens=4096,
    tools=[{"type": "code_execution_20250522", "name": "code_execution"}],
    messages=[{"role": "user", "content":
        "From sales.csv, compute month-over-month by region and plot a line chart. "
        "Read with pandas, assert no nulls before computing, and print the numbers too."}],
)
# Model writes code -> sandbox runs it -> numbers & chart come from a real run, reproducible
# On error, execution feedback flows back; the model self-corrects the next round of code

The phrases "assert no nulls before computing" and "print the numbers too" are engineering discipline: force the model to write assumptions into the code and expose intermediate values, so you can verify.

Failure modes: (1) sandbox has no network / missing packages → the model burns turns on trial and error; spell out the environment and available packages up front. (2) Loading a huge dataset straight into memory with read_csv → OOM; for big tables, push the aggregation down to the database and only pull results back into the sandbox. (3) Blindly trusting the chart—the model may silently drop half the rows via dropna and still draw a "beautiful" trend. Always require it to print len(df) and the before/after row-count delta.
Going deeper · Anthropic Code Execution Tool, docs.anthropic.com/.../code-execution-tool · Anthropic Advanced Tool Use, anthropic.com/engineering/advanced-tool-use
// 03

Engineering Defense Against Statistical Traps: A Data Story ≠ Stringing Numbers Into Prose

Claim: the most dangerous thing about an analysis agent isn't miscalculating—it's wrapping a statistical error in fluent narrative and selling it as "insight."

Background & Principle

The "data story" is the selling point of these tools and also the biggest trap: models are naturally gifted at spinning any numbers into causal narrative. Common crashes: Simpson's paradox (trend reverses after grouping), survivorship bias (computing satisfaction only over retained users), type/NULL silent errors (summing the string '100', NULL treated as 0 dragging down the mean), and p-hacking (the agent auto-runs dozens of slices and picks the "significant" one to tell a story about). The model won't spontaneously question the data-generating process—it'll only make your numbers sound more convincing.

Engineering defense isn't "tell the model to be careful" (useless)—it's hardening skepticism into mandatory steps: EDA first (missingness, distributions, outliers, cardinality) before any conclusion; make the model state the assumptions and definitions behind every claim explicitly; apply multiple-comparison correction to auto-slicing or simply cap the slice count. The single most effective move: require every number in the narrative to carry the query/code that produced it, traceable in one click.

Hands-on

# Bake statistical discipline into the system prompt as hard constraints (excerpt)
You are a data analyst. Before any output you MUST:
1. Do EDA first: report row count, per-column missingness, distribution & outliers
2. Before any aggregation, state explicitly: NULL handling, grain, who's excluded
3. Distinguish [correlation] from [causation]; no unverified causal wording
4. If drilling down by dimension, check overall vs. grouped trend (Simpson check)
5. Append [source: code block N] after every key number for traceability
# Bad: "East region has the highest conversion, so increase spend"
# Good: "East has the highest conversion (12%), but its sample is only 3% of
#        total with an abnormal new-customer ratio; confirm representativeness first"
Failure mode: treating "let the agent freely explore the data to find insights" as a feature. Open-ended exploration = an automated p-hacking machine: run enough slices and you'll always find a "significant" result, and the agent will confidently narrate it as a story. Exploration is fine, but its output must be tagged "hypothesis to verify"—never taken directly as a conclusion; real conclusions go through pre-declared metrics and verification.
Going deeper · Simon Willison Datasette Agent (a real tool—and its limits—for NL querying SQLite + charting), simonwillison.net/2026/May/21/datasette-agent
// 04

The Analysis-Agent Loop: Numbers Must Reconcile

Claim: only what you can verify deserves to be an agent; an analysis agent's "verifiability" is reconciling its numbers.

Background & Principle

Upgrading a single query into an analysis agent (multi-round plan → query → inspect → refine) lets you handle open tasks that depend on intermediate results ("look at the distribution first, then decide how to slice"); the cost is it can go off the rails without noticing. What separates a toy from a trustworthy agent is a reconciliation layer: cross-validating the agent's numbers against an independent ground truth. Cheap, effective reconciliation: total checks (sum of all groups = total table rows/amount), cross-path validation (the same metric computed two different ways should match), and magnitude sanity (is the result within the historical range). These are hardwired into the harness, not left to the model to judge.

┌──────── Analysis-Agent Loop (with reconciliation) ────────┐ │ │ │ Question ─▶ (1) Schema retrieval (tables/cols + samples) │ │ │ │ │ ▼ │ │ (2) Generate SQL/code ─▶ (3) Sandbox run ─┐ │ │ ▲ │ │ │ │ error / empty set ▼ │ │ └────────────────────── (4) Inspect result │ │ (rows/dist/types) │ │ │ │ │ ┌─────────────┴───────────┐ │ │ ▼ pass fail ▼ │ │ (5) Reconcile (hard checks) back to (2) │ │ · group sum = total? │ │ · two methods agree? │ │ · magnitude in historical range? │ │ │ pass │ │ ▼ │ │ (6) Narrative + chart (each number sourced)│ └────────────────────────────────────────────────────────────┘

Hands-on

# Reconciliation as a hard gate: numbers that fail checks are bounced back
def reconcile(result, db):
    checks = []
    # 1) sum of groups == full-table total
    grp_sum = sum(r.value for r in result.rows)
    total   = db.query("SELECT SUM(amount) FROM orders")[0]
    checks.append(("group_sum==total", abs(grp_sum-total) < 1e-6))
    # 2) magnitude sanity: within historical ±3σ
    checks.append(("magnitude_ok", in_historical_range(result)))
    failed = [name for name, ok in checks if not ok]
    if failed:
        return {"pass": False, "retry_hint": f"reconcile failed:{failed}, check join double-counting"}
    return {"pass": True}

The most common root cause of failed reconciliation is join-induced double counting (a one-to-many join doubles the amount)—"group sum ≠ total" catches it at a glance, while the model itself almost never will.

Failure mode: using LLM-as-judge for reconciliation. Having another model "check whether this analysis is right" only catches logical-phrasing issues, not numeric errors—it can't do the math either. Numeric reconciliation must be deterministic (SQL/code), not another model layer on top.
Going deeper · Spider 2.0 (why enterprise workflows need multi-step + verification), github.com/xlang-ai/Spider2 · Anthropic Building Effective Agents (when to escalate to an agent), anthropic.com/engineering/building-effective-agents

// CAPSTONE · Build Yourself a "Reconcilable Analysis Agent"

String the four points into a weekend project: pick a real dataset you know well (orders / fitness / personal finance) and build an analysis agent that can Q&A + chart + self-reconcile.

  1. Schema layer: write a business glossary (metric definitions + magic-number meanings) and attach LIMIT 3 sample values when retrieving relevant tables. First verify: a weaker model + good glossary vs. a strong model with no glossary—which is more accurate?
  2. Execution layer: route all computation through code execution / a sandbox; forbid the model from "mental math." Push big-table aggregation to the DB, pull only results back.
  3. Statistical discipline: bake the 5 hard constraints from §3 into the system prompt; conclusions must carry source traceability.
  4. Reconciliation layer: implement reconcile() with at least "group sum = total" and "magnitude sanity"; bounce back to recompute on failure.
  5. Eval: write 10 questions with ground truth (answers you computed/checked by hand), run a "no-reconcile" vs. "with-reconcile" version, and tally error rate. Most people find that a large share of the errors the reconciliation layer catches are join double-counting and NULL handling—errors completely invisible from the narrative alone.

Once you've built this, you'll look at any "AI data analysis" product and first hunt for its reconciliation layer—anything without one is, at bottom, a random-number generator that tells stories.

// DEEP THINKING

40% on BIRD, 21% on Spider 2.0—does that mean NL2SQL is simply unusable in production?
Not unusable—just not usable as "fully automatic." Those numbers are the hard metric for "one shot, no human in the loop." The right production shape is human-in-the-loop: the agent proposes SQL + its assumptions + sample results, and someone with domain knowledge reviews before execution; or you constrain it to parameterized, pre-validated metrics/templates. Treat NL2SQL as a "draft generator that must be reviewed" rather than an "automatic answer machine," and 21% one-shot accuracy plus human review yields overall usability far above 21%. The key is a UI that makes review cheap.
If computation is outsourced to code and reconciliation to deterministic checks, what irreplaceable value does the LLM actually add in data analysis?
The LLM's value is at the two ends—"translation" and "narrative"—not the computation in the middle. Front end: translating a vague business question into precise queries/code (schema linking + intent understanding), which rule systems do poorly. Back end: translating a pile of numbers into human-readable, contextualized conclusions. The arithmetic, aggregation, and statistical tests in the middle should go to a deterministic runtime—let the LLM do the semantic work it's good at and let code do the precise computation it's good at. That division of labor is exactly the philosophy of the code-interpreter paradigm.
"Let the agent freely explore the data for insights" sounds like a killer feature—why does this piece list it as an anti-pattern?
Because open exploration is statistically equivalent to automated multiple comparisons. With enough dimensions, you can find "significant" slices even in purely random data—a mathematical inevitability, not a model flaw. The agent will also wrap noise in fluent narrative as insight, faster and more confidently than a human's p-hacking. It's not that you can't explore—it's that exploration's output must be strictly demoted to "hypothesis to verify," and only promoted to a conclusion after pre-registered metrics + independent-sample validation. Conflating "explore" with "verify" is the most insidious danger of these tools.
Reconciliation catches numeric errors, but can it catch "asking the wrong question"—e.g., a metric definition that's wrong to begin with?
No. Reconciliation only guarantees "given this definition, the number is correct"—it can't guarantee "this definition is what the business wants." Definition errors are an upstream problem, addressed only by fixing the business glossary (metric definitions) up front + human confirmation of the definition. That's why this piece keeps stressing the glossary: it takes value judgments like "how do we define an active user?" out of the model's hands and gives them to organizational consensus. Reconciliation guarantees correctness; the glossary guarantees relevance—you need both. That's precisely the boundary of purely technical means.
Is the analysis agent's plan→query→inspect→refine loop fundamentally the same as a coding agent's loop? Where do they differ?
Same skeleton (agentic loop + verifiable output), but the nature of "verification" differs. A coding agent's verification is discrete: tests pass/fail, compiles or errors—clear signals. An analysis agent's verification is continuous and partial: a number "looking reasonable" doesn't mean it's right, and reconciliation only covers the formalizable parts (totals, magnitudes), not "is this correlation causal." So analysis agents rely more on humans to gate the un-formalizable semantic correctness—the weaker the verification signal in a domain, the lower the ceiling on autonomy.

// FURTHER READING