Getting a model to emit valid JSON is the tutorial; getting your schema out of it without dropping IQ is the engineering.
// WHY THIS MATTERS
Structured output is the type boundary between an agent and the systems around it — tool-call arguments, extracted fields, routing labels all ride on it. Many people think "please return JSON" settles the matter, then production teaches them three lessons on repeat: the model occasionally returns "almost-valid" JSON wrapped in markdown fences, reasoning quality mysteriously drops once you enforce a schema, and a max_tokens truncation hands you half a JSON object that blows up json.loads. This issue skips "what JSON is" and covers four things that decide reliability: why the four ways of getting JSON are not equivalent, the hidden reasoning tax of constrained decoding and how to dodge it, how to design a schema given that it is itself a prompt, and the fault-tolerance stack for when "100% schema" still fails. Every point assumes you've already called tool use in the API.
// 01
Four Ways to Get JSON: Not the Same Order of Reliability
Claim: "prompt-for-JSON / JSON mode / tool use / constrained decoding" are four guarantee strengths; mixing them up is the root of most parse crashes.
Background & principle
"Make the model output JSON" looks like one thing, but the underlying mechanisms differ wildly, from weak to strong:
Prompt + prefill: write "return JSON only" in the system, then prefill a { to skip pleasantries. Zero guarantee — the model can still add fences, comments, or misspell fields.
JSON mode: the API guarantees valid JSON, but not that it matches your schema. Parseable ≠ correct fields.
Tool use / function calling: describe structure via the tool's input_schema and the model fills params. Stronger than JSON mode, but still not 100% in practice — OpenAI's own number for function calling is about 86% schema compliance.
Constrained decoding (Structured Outputs): compile the schema into a grammar at the decoding layer and, at each step, only allow valid tokens to be sampled. Both OpenAI and Anthropic now offer a strict mode (Anthropic has opened Structured Outputs for Claude Sonnet 4.5 / Opus 4.1), claiming 100% schema compliance and zero parse errors.
The key engineering intuition: the first two are "pray afterwards," the last two are "constrain beforehand." Constrained decoding works by converting the JSON Schema into a CFG/FSM and masking every illegal token at each generation step — so it is physically impossible to emit a structure-violating token. That's also why it's cheaper than retry-until-valid: instead of generating then validating, you never let the wrong token be born.
Four tiers of getting JSON · guarantee / cost
weak ───────────────────────────────────────▶ strong
┌───────────┬───────────┬───────────┬──────────────┐
│ prompt+ │ JSON mode │ tool use │ constrained │
│ prefill { │ │ │ decoding │
│ │ │ │(struct out) │
├───────────┼───────────┼───────────┼──────────────┤
│valid JSON?│ no │ yes │ yes │ yes
│matches │ no │ no │ ~86% │ yes 100%
│ schema? │ │ │ │
│mechanism │ text pray │ text pray │ train pref. │ decode mask
└───────────┴───────────┴───────────┴──────────────┘
per step: schema→
grammar→mask illegal
tokens→sample
Hands-on
Same extraction task, using Anthropic strict structured output (decode-layer guarantee) rather than prompt-for-JSON:
import anthropic
client = anthropic.Anthropic()
schema = {
"type":"object",
"properties":{
"sentiment":{"type":"string","enum":["pos","neg","neutral"]},
"key_entities":{"type":"array","items":{"type":"string"}}
},
"required":["sentiment","key_entities"],
"additionalProperties":False
}
r = client.messages.create(
model="claude-sonnet-4-5", max_tokens=1024,
messages=[{"role":"user","content": review_text}],
output_format={"type":"json_schema","schema": schema} # decode-layer constraint
)
data = r.content[0].input # already a dict, no json.loads + try/except
Note additionalProperties:false — strict mode usually requires it, or the model invents extra fields.
Failure mode: using JSON mode but assuming it manages schema. JSON mode only promises "parseable." The model spells sentiment as sentyment, or wraps an extra layer like {"result": {...}} — all valid JSON, and your data["sentiment"] KeyErrors. Want schema-level guarantees? Use constrained decoding, don't lean on JSON mode.
The Hidden Tax of Constrained Decoding: Reason First, Then Constrain
Claim: slapping a strong schema directly onto reasoning output lowers accuracy; the fix isn't to drop structure, it's to split "reasoning" from "structure" into two stages.
Background & principle
The 2024 EMNLP paper Let Me Speak Freely? (Tam et al.) dropped a bomb: across multiple reasoning benchmarks, forcing models to answer in JSON/XML formats significantly degraded reasoning accuracy, with stricter constraints degrading more. The intuitive explanation: constrained decoding masks out the tokens the model would have used to "think out loud," effectively stripping its chain-of-thought.
But the story isn't over. The Outlines team (.txt) published a rebuttal, Say What You Mean: on reproduction, they found that as long as the schema is well-designed and good few-shot structure examples are provided, structured generation doesn't drop scores — it can slightly raise them. The two aren't actually contradictory; together they yield one shippable conclusion: it's not structure that hurts, it's squeezing out the reasoning field that hurts.
So the real tactic is field-ordering engineering: JSON is generated autoregressively, so when the model writes later fields it can see the earlier ones it already wrote. Put the reasoning field before the answer field and the model does its CoT inside the structure, then produces the answer; put answer first and you force it to answer before thinking. When you need to go all the way, use two stages: stage one is free CoT with no constraint, stage two is the light task of "extract the conclusion into a schema."
Hands-on
# ❌ answer first: model is forced to answer before reasoning
{"answer": ..., "reasoning": ...}
# ✅ reasoning first: CoT embedded in structure, answer follows it
schema = {"type":"object",
"properties":{
"reasoning":{"type":"string",
"description":"Think step by step before answering."},
"answer":{"type":"number"}
},
"required":["reasoning","answer"]} # required order = generation order# ✅✅ reasoning-heavy task: two stages, CoT stage unconstrained
cot = ask(prompt) # free text, full reasoning
out = ask(f"Extract JSON from this analysis:\n{cot}", schema=schema)
Failure mode: wrapping a whole strict-JSON constraint around a reasoning model (one with thinking), which double-strips it — thinking is already outside the structure, and the body gets masked too. These models are most stable with "free thinking + final answer via structured output"; don't let grammar constraints reach into the thinking block.
Schema Is a Prompt: Field Design Decides Fill Quality
Claim: a schema isn't just a constraint for the parser — it's also an instruction the model reads; field names, descriptions, enums, and nesting depth all change output quality.
Background & principle
Constrained decoding guarantees "structurally legal," but whether the filled-in content is correct still depends on how the schema is written. This shares a root with Day 4's tool-use conclusion: description matters more than field name. A few repeatedly-validated design disciplines:
Write a description for every field: the model treats it as a field-level instruction. "due_date" plus "ISO 8601, null if absent" is far more reliable than a bare field name.
Use enum to converge categories: make classification fields enums, not free strings — constrained decoding physically guarantees a value within the set, stronger than "please pick one of the following."
Flat beats deeply nested: nesting beyond three levels noticeably lowers fill accuracy and slows or even breaks some constraint engines. Flatten what you can.
Be wary of optional / union: in strict mode OpenAI requires all fields in required, expressing optionality via a "type union null"; heavy anyOf unions have limited support in most constraint engines. "Explicitly fill null when missing" is more stable than a pile of optional fields.
Leave room for injection defense: when extracting from untrusted text, don't let the model stuff raw text into fields that decide control flow — that's the Day 24 prompt-injection boundary.
Hands-on
A flat schema with descriptions and enums via Pydantic (both OpenAI/Anthropic SDKs eat it directly):
from pydantic import BaseModel, Field
from enum import Enum
class Priority(str, Enum):
low="low"; med="med"; high="high"class Ticket(BaseModel):
"""Extract a support ticket from a user message."""
summary: str = Field(description="One-line problem, <=80 chars")
priority: Priority = Field(description="high only if blocking/data-loss")
due_date: str | None = Field(description="ISO 8601 or null")
# flat + per-field description + enum + explicit null
This schema does three jobs at once: constrain the structure, use descriptions as field-level instructions, and use the enum to physically pin priority to three values. The reader isn't just the parser — it's the model itself.
Failure mode: (1) cramming all decision logic into field names (shouldEscalateToTier2BecauseSLA) without a description — the model can only guess the semantics. (2) Deep nesting + a screen full of optionals: the model fills sloppily, the constraint engine slows down, and during debugging you can't tell whether the model erred or the schema is too contrived.
The Fault-Tolerance Stack: When "100% Schema" Still Fails
Claim: constrained decoding guarantees legal structure, but not that it's semantically right, untruncated, or not a refusal — production reliability comes from an outer validate-repair-fallback stack.
Background & principle
"100% schema compliance" is an easily-misread promise. It guarantees the emitted tokens satisfy the grammar, but it can't handle any of these:
Truncation: you hit max_tokens, stop_reason == "max_tokens", and what you got is structurally unclosed output. Always check stop_reason first, don't blindly parse.
Refusal: the model may take a refusal branch, in which case the structured channel returns something other than your schema. Handle this path separately.
Semantic errors: the schema says email: string, the model fills "none" — a legal string, but not an email. Structure right, semantics wrong.
Streaming partials: mid-stream JSON is incomplete and needs an incremental/tolerant parser; you can't wait until the end to parse.
So the real production shape isn't "call once, get a dict," it's a validate-repair-fallback chain — the same resilience thinking as Day 3's harness "feed the error back to the model": handing a validation error back to the model to self-repair beats throwing an exception.
Hands-on
from pydantic import ValidationError
def extract(text, max_repair=2):
msgs = [{"role":"user","content": text}]
for _ in range(max_repair+1):
r = call(msgs, schema=Ticket)
if r.stop_reason == "max_tokens": # 1) truncation: don't parseraise Truncated("raise max_tokens / split task")
try:
return Ticket.model_validate(r.data) # 2) semantic validationexcept ValidationError as e:
msgs += [{"role":"assistant","content": str(r.data)},
{"role":"user",
"content": f"Validation failed, fix and resend: {e}"}] # 3) feed backraise Unrepairable() # 4) fallback: human/default
Four layers backstop in order: truncation detection → semantic validation (Pydantic validators cover emails, ranges, cross-field constraints the schema can't express) → feed the error back for self-repair → fall back if unfixable. Constrained decoding only covers the "structure" layer; the other three you build yourself.
Failure mode: blindly trusting "structured output can't be wrong," skipping validation and truncation checks, and piping r.data straight downstream. One day an extra-long input triggers truncation and half a JSON object lands in the database, or "none" gets treated as an email and fires an alert — legal structure masking semantic garbage, and the bug is especially hard to trace.
// Capstone · Upgrade a Fragile Extractor to Production Grade
Take any script you have that "asks the model to return JSON" and harden it layer by layer per this issue's four points — half an hour takes it from demo grade to production grade:
Swap the mechanism (§1): move from prompt-for-JSON / JSON mode up to constrained decoding (output_format / response_format), and delete that legacy json.loads + regex fence-stripping code.
Order the fields (§2): if the task involves reasoning, put a reasoning field in the schema before the answer; for reasoning-heavy tasks switch to two stages with an unconstrained CoT stage.
Fix the schema (§3): add a description to every field, swap classification fields to enums, flatten nesting, change optionals to "explicit null."
Build an eval: produce 20 inputs with ground truth and compare "schema compliance × semantic correctness" before and after. You'll find compliance was already 100% — what actually improves is semantic correctness.
After this, your mental model of "structured output" shifts from "beg the model for JSON" to "pin the structure at the decode layer, instruct in the schema, backstop semantics at the outer layer" — which is exactly the line between demo and production.
// KEY TERMS
Structured Output
Making a model emit output conforming to a predefined schema rather than free text. This issue's subject.
JSON Mode
A weak mode where the API guarantees valid JSON but not conformance to your schema.
Constrained Decoding
Masking illegal tokens by grammar at the decode layer, physically guaranteeing legal structure. Aka guided/grammar-constrained decoding.
Schema Compliance
The fraction of outputs matching a given JSON Schema. Constrained decoding can reach 100%; function calling measures ~86%.
CFG / FSM
Context-free grammar / finite-state machine. Constrained decoding compiles a schema into these to generate the token mask.
Prefill
Pre-filling the assistant's opening (e.g. {) to force the model into JSON; a weak-guarantee tactic.
Field Ordering
The generation order of schema fields; putting reasoning before answer preserves CoT within the structure.
Refusal Path
The branch a model takes when refusing; it returns no target schema and needs separate handling.
Validate-Repair Loop
A fault-tolerance loop that feeds validation errors back for self-repair, echoing harness recovery.
additionalProperties
A JSON Schema field; set false to forbid the model adding fields, often required by strict mode.
// DEEP DIVE
If constrained decoding can guarantee 100% legal structure, why don't tool use / function calling just always use it, tolerating an 86% rate?
History and trade-offs. Early function calling relied on training the model to "prefer" schema output — a text-layer preference, not a hard decode-layer constraint — so some slip through. Constrained decoding requires the inference engine to compile the schema into a grammar and maintain a token mask, with engineering cost and latency overhead, and limited support for dynamic/recursive schemas. Today's major platforms now offer strict constraints as an optional mode layered on top of tool use (OpenAI's strict, Anthropic's structured outputs); the direction is to merge the two, but it's not forced by default, to stay compatible with old interfaces and complex-schema scenarios.
Let Me Speak Freely and the .txt rebuttal look opposed. If you had to design one experiment to settle it, how would you control variables?
The key confound is "the constraint itself" vs "the prompt/field-order changes the constraint drags along." I'd fix model and temperature and build three arms: (A) free CoT + free answer; (B) free CoT field first + constrained answer field after; (C) pure constraint with answer field first. If B≈A>C, the harm comes from squeezing out reasoning, not the constraint (supports .txt); if A>B≈C, the constraint itself is lossy (supports the original paper). Add an arm controlling presence/absence of few-shot structure examples to isolate "the model hasn't seen this structure." Most reproductions point to: field layout and example quality dominate, not the mere presence of a constraint.
Constrained decoding masks illegal tokens at each step. Could it erase a token the model finds high-probability but "currently illegal," forcing it into a low-quality path?
Yes — this is the real cost of constrained decoding, called distribution distortion. The FSM only sees "is this legal grammar," not "is this better semantics." If the optimal token is illegal right now, it's masked, the model is forced to sample among the remaining legal tokens, and it may slide onto a low-probability branch and "snowball" further off course. Mitigations: don't over-restrict the schema (avoid narrow patterns / huge enums), provide enough structure examples so the model's high-probability distribution naturally lands in the legal region, and keep reasoning outside the constraint. This also explains why "stricter schema is better" is an illusion — over-strictness amplifies the distortion.
If the downstream can accept both, would you have the agent return structured results inside a tool call, or in the final message as structured output? What's the engineering difference?
A tool call carries the semantics of "the agent wants to invoke an external capability," and the result is fed back into the loop for the agent to continue; a final structured output is "this turn's deliverable." If the result is to be consumed by the agent itself and feed later decisions, a tool call is more natural (and carries schema validation natively); if it's a terminal deliverable for an external consumer (API response, DB write), a final structured output is more direct, saving a round trip through the model. The trap in mixing: forcing a terminal deliverable into a tool call makes the loop spin once more and requires extra logic to judge "is this tool call a real invocation or just returning a result."
Structured output turns an LLM into a "type-safe function." What does this mean for system architecture — does it move where we draw the line between AI and deterministic code?
It means an LLM can be embedded into a traditional type system as an impure function with a type signature: input text, output a schema-validated typed object. The boundary can thus be drawn finer — no longer the coarse "AI module vs code module" isolation, but a type contract at each call site. But stay clear-eyed: type safety ≠ semantic safety; a schema blocks KeyError, not "right type, hallucinated value." So the new boundary is: use structured output to kill "shape uncertainty" at the decode layer, and pour the saved effort into validating and evaluating "value correctness." Architecturally this pushes AI calls to look more and more like RPC — with schemas, retries, fallbacks, and SLAs.