Getting the model to "fetch a number" is easy; getting it to fetch right, compute right, and explain right is a whole engineering stack.
"Query your database in natural language" is the most dazzling demo and the most fragile in production—because it stacks three hard problems: grounding a vague question onto a precise schema (NL2SQL), handing computation to executable code instead of model hallucination, and turning results into trustworthy conclusions without falling into statistical traps. This issue does not explain "what text-to-SQL is"—it assumes you've already had Claude write SQL and used a Code Interpreter. We cover the engineering: why a large schema must be retrieved before generation, why "let the model write code and run it" beats "let the model just give a number," and how a data agent reconciles its own numbers so it doesn't confidently compute the wrong thing. The brutal BIRD/Spider 2.0 numbers run through the whole piece—they tell you the gap from demo to production isn't model intelligence, it's the verification scaffolding around the model.
On academic Spider, SOTA execution accuracy passed 90% long ago. But on BIRD (large databases close to real business, arXiv 2305.03111), the strongest model of the day hit only ~40% execution accuracy vs. 93% for humans; on Spider 2.0 (enterprise-grade, single databases often >1,000 columns, multiple dialects, arXiv 2411.07763, ICLR 2025 Oral), even o1-preview reached only ~21%. Nearly all the gap comes from schema linking: mapping "active users" or "last month's revenue" to concrete tables, columns, join paths and filters. In real databases column names are usr_st, status codes are magic numbers, and synonymous columns are scattered across tables—this is a semantic-alignment problem, not a syntax one.
The right engineering posture: don't stuff the whole schema into context. A large database's DDL is tens of thousands of tokens, which both blows the budget and dilutes attention. The fix is to treat schema linking as retrieval: pull the most relevant subset of tables/columns for the question, then feed it with real sample values (so the model knows status takes 'A'/'C', not the guessed 'active').
# Don't: stuff the full DDL into the prompt (token blowup + attention dilution)
# Do: retrieve relevant tables -> attach sample values -> then generate
def build_schema_context(question, db):
tables = retrieve_tables(question, db, top_k=8) # vector recall of relevant tables
ctx = []
for t in tables:
samples = db.query(f"SELECT * FROM {t} LIMIT 3") # anchor semantics with real values
ctx.append(f"{t.ddl}\n-- samples: {samples}")
return "\n\n".join(ctx)
# Add a one-line business glossary to fix column-name / magic-number ambiguity
GLOSSARY = "active user = users.status='A' AND last_login > now()-30d"
That single line of business glossary (metric definition) often pays off more than upgrading the model—it fixes the organization's definition of "active" and "revenue" so the model stops improvising.
An LLM is a probabilistic text machine, not an arithmetic engine. Letting it "do mental math" on aggregates, correlations, or year-over-year deltas yields results that are neither trustworthy nor reproducible. The Code-Interpreter paradigm rewrites analysis as: model generates Python/SQL → execute in a sandbox → feed the real output (numbers + charts) back to the model. Three wins: computation goes to a deterministic runtime, errors get caught by execution feedback and self-corrected, and the code itself is an auditable record of the analysis. Anthropic's code execution tool and OpenAI's Code Interpreter are both server-side sandboxes: the script consumes the tool results and the model only sees the final output—which also saves tokens. Anthropic's advanced-tool-use engineering blog notes that orchestrating tool calls inside code execution (programmatic tool calling) cut billed input tokens by roughly 38% on a multi-tool benchmark with no change in accuracy.
Key engineering point: don't have the model "describe" charts either—have it draw them. Emitting chart JSON/SVG directly invites hallucinated coordinates and values; the right move is generating matplotlib/plotly code that executes into a real figure.
import anthropic
client = anthropic.Anthropic()
resp = client.beta.messages.create(
model="claude-sonnet-4-6",
betas=["code-execution-2025-05-22"],
max_tokens=4096,
tools=[{"type": "code_execution_20250522", "name": "code_execution"}],
messages=[{"role": "user", "content":
"From sales.csv, compute month-over-month by region and plot a line chart. "
"Read with pandas, assert no nulls before computing, and print the numbers too."}],
)
# Model writes code -> sandbox runs it -> numbers & chart come from a real run, reproducible
# On error, execution feedback flows back; the model self-corrects the next round of code
The phrases "assert no nulls before computing" and "print the numbers too" are engineering discipline: force the model to write assumptions into the code and expose intermediate values, so you can verify.
read_csv → OOM; for big tables, push the aggregation down to the database and only pull results back into the sandbox. (3) Blindly trusting the chart—the model may silently drop half the rows via dropna and still draw a "beautiful" trend. Always require it to print len(df) and the before/after row-count delta.
The "data story" is the selling point of these tools and also the biggest trap: models are naturally gifted at spinning any numbers into causal narrative. Common crashes: Simpson's paradox (trend reverses after grouping), survivorship bias (computing satisfaction only over retained users), type/NULL silent errors (summing the string '100', NULL treated as 0 dragging down the mean), and p-hacking (the agent auto-runs dozens of slices and picks the "significant" one to tell a story about). The model won't spontaneously question the data-generating process—it'll only make your numbers sound more convincing.
Engineering defense isn't "tell the model to be careful" (useless)—it's hardening skepticism into mandatory steps: EDA first (missingness, distributions, outliers, cardinality) before any conclusion; make the model state the assumptions and definitions behind every claim explicitly; apply multiple-comparison correction to auto-slicing or simply cap the slice count. The single most effective move: require every number in the narrative to carry the query/code that produced it, traceable in one click.
# Bake statistical discipline into the system prompt as hard constraints (excerpt)
You are a data analyst. Before any output you MUST:
1. Do EDA first: report row count, per-column missingness, distribution & outliers
2. Before any aggregation, state explicitly: NULL handling, grain, who's excluded
3. Distinguish [correlation] from [causation]; no unverified causal wording
4. If drilling down by dimension, check overall vs. grouped trend (Simpson check)
5. Append [source: code block N] after every key number for traceability
# Bad: "East region has the highest conversion, so increase spend"
# Good: "East has the highest conversion (12%), but its sample is only 3% of
# total with an abnormal new-customer ratio; confirm representativeness first"
Upgrading a single query into an analysis agent (multi-round plan → query → inspect → refine) lets you handle open tasks that depend on intermediate results ("look at the distribution first, then decide how to slice"); the cost is it can go off the rails without noticing. What separates a toy from a trustworthy agent is a reconciliation layer: cross-validating the agent's numbers against an independent ground truth. Cheap, effective reconciliation: total checks (sum of all groups = total table rows/amount), cross-path validation (the same metric computed two different ways should match), and magnitude sanity (is the result within the historical range). These are hardwired into the harness, not left to the model to judge.
# Reconciliation as a hard gate: numbers that fail checks are bounced back
def reconcile(result, db):
checks = []
# 1) sum of groups == full-table total
grp_sum = sum(r.value for r in result.rows)
total = db.query("SELECT SUM(amount) FROM orders")[0]
checks.append(("group_sum==total", abs(grp_sum-total) < 1e-6))
# 2) magnitude sanity: within historical ±3σ
checks.append(("magnitude_ok", in_historical_range(result)))
failed = [name for name, ok in checks if not ok]
if failed:
return {"pass": False, "retry_hint": f"reconcile failed:{failed}, check join double-counting"}
return {"pass": True}
The most common root cause of failed reconciliation is join-induced double counting (a one-to-many join doubles the amount)—"group sum ≠ total" catches it at a glance, while the model itself almost never will.
String the four points into a weekend project: pick a real dataset you know well (orders / fitness / personal finance) and build an analysis agent that can Q&A + chart + self-reconcile.
LIMIT 3 sample values when retrieving relevant tables. First verify: a weaker model + good glossary vs. a strong model with no glossary—which is more accurate?reconcile() with at least "group sum = total" and "magnitude sanity"; bounce back to recompute on failure.Once you've built this, you'll look at any "AI data analysis" product and first hunt for its reconciliation layer—anything without one is, at bottom, a random-number generator that tells stories.