Streaming isn't about showing the answer earlier — it's about turning a half-formed, ever-shifting, possibly-cut-off intermediate state into an engineerable object.
Almost every LLM product ships streaming, but most people use only its shallowest layer: typing tokens onto the screen one by one. The real engineering difficulty lies beyond the typewriter effect — every frame that arrives is incomplete: a half-rendered Markdown table, JSON cut off mid-quote, a tool call whose arguments aren't fully collected, and a "Stop" button the user can hit at any moment. None of these are prompt problems; they're state-management problems in the harness layer. Treating streaming as "show it sooner" buries three classes of production incidents: feeding half a JSON to parse and crashing, executing a tool input that hasn't finished generating, and the user closing the tab while the server keeps burning tokens and firing side effects. This issue nails down four things: streaming is a perception optimization, not a throughput one; incremental parsing must tolerate the partial; the buffering trade-off in streaming tool calls; and rolling back side effects on cancellation.
An LLM call's latency splits into two parts: TTFT (time to first token, from request to the first token landing) and per-token generation (inter-token latency × token count). Streaming changes neither sum — wall-clock is nearly identical — it just lets the user start seeing text right after TTFT instead of waiting for the whole thing. Research and product experience converge: for the same total time, a streaming interface is perceived as far faster, because it eliminates the worst stretch — staring at a blank screen.
This essence yields two counterintuitive decisions. First: monitor and optimize p95 of TTFT, not total latency — an endpoint that takes 10s total but emits the first token in 200ms feels far better than one that takes 4s total but stalls 3s before moving. Second: streaming does not raise throughput; turning it on for a pure backend batch job is pointless and only adds parsing complexity.
import time, anthropic
client = anthropic.Anthropic()
t0 = time.time(); ttft = None
with client.messages.stream(model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role":"user","content":"Write..."}]) as stream:
for text in stream.text_stream:
if ttft is None:
ttft = time.time() - t0 # ← the metric to actually watch
print(f"TTFT={ttft*1000:.0f}ms")
render(text) # incremental paint
total = time.time() - t0 # total time: streaming can't save it
Plain-text streaming is easy (just append). The moment output has structure, it gets hard. Standard Markdown renderers, JSON.parse, XML parsers all assume input is complete and valid — but in a stream you always hold an intermediate state: an unclosed code block, a half-written table, a string cut off at "add. Hand that to a strict parser and you get one of two outcomes: it throws and crashes, or it waits for completeness before rendering — and the latter erases streaming's entire value.
The fix is progressive parsing designed for the partial: render an open container the instant a code block starts, draw a table row as each arrives, show a temporary closing quote for an unclosed string. Vercel's Streamdown packages this as a streaming replacement for react-markdown — it assumes input is never complete and "remends" unterminated syntax before rendering. Structured data is the same: use a tolerant JSON parser (partial JSON / trailing-strings mode) that turns {"city":"Bei into a partial object like {"city":"Bei"}, letting the UI fill fields as they stream.
# Tolerant parsing: partial-json turns half JSON into a partial object
from partial_json_parser import loads
buf = ""
for chunk in stream: # chunks don't respect JSON boundaries
buf += chunk
try:
obj = loads(buf) # even half-formed yields a partial object
render_partial(obj) # render each field as it lands
except Exception:
continue # this frame isn't enough; wait for the next
# Key: only the obj after the stream ends is the "final state"; mid-stream is preview
strict JSON.parse on every frame and swallowing the throw with try/catch — it runs, but you re-parse for nothing each frame and risk mistaking a momentarily-valid intermediate for the final. (2) Worse: treating a partially parsed result as final and firing a side effect — e.g. executing a delete the instant you stream {"action":"delete", when the full intent was {"action":"delete_draft","confirm":false}. An intermediate state is for rendering preview only — never to drive decisions or side effects.
In Anthropic's SSE stream, a response is a series of content blocks, each shaped content_block_start → many content_block_delta → content_block_stop. Text flows as text_delta; tool arguments flow as input_json_delta, in a partial_json field — you must concatenate the partial_json fragments in arrival order to reconstruct the argument JSON, and these fragments don't respect JSON boundaries (they can break at any character). By default the SDK buffers to content_block_stop, where you finally get validated, complete JSON — safe but latent: large arguments (say, having the model write a big chunk of file content) wait until the whole thing is accumulated.
To remove that wait, Anthropic offers fine-grained tool streaming (beta header fine-grained-tool-streaming-2025-05-14): arguments stream out without buffering or JSON validation, sharply cutting first-byte latency for large parameters. The cost is in the docs: no guarantee the stream is valid JSON — if max_tokens is reached, the stream may stop mid-parameter, leaving you broken JSON that can never be completed. It's a clean trade: lower latency ⇄ handle the partial yourself.
acc = ""
for ev in stream:
if ev.type == "content_block_delta" \
and ev.delta.type == "input_json_delta":
acc += ev.delta.partial_json # concat in order; don't parse early
elif ev.type == "content_block_stop":
args = json.loads(acc) # default mode: valid here
run_tool(name, args) # ← the only safe execution point
# Under fine-grained beta, acc may still be broken JSON at stop;
# use a tolerant parser + validation, and on failure treat the call
# as incomplete and DO NOT execute
partial_json before content_block_stop — arguments aren't collected, so you act on half an intent. (2) Enabling fine-grained yet still assuming the end is valid JSON, skipping the broken-input check — once max_tokens truncates, json.loads throws, or worse, a tolerant parser fabricates a semantically corrupted argument that gets executed. Fine-grained presumes you can safely handle a half-formed parameter.
Cancellation hides a truth: client disconnect ≠ generation stop. To actually stop the bleeding, you must use an AbortController (or the SDK's cancel) to explicitly abort that HTTP/SSE connection so the server gets the disconnect and stops generating — closing the UI and discarding later chunks isn't enough; that just means you stopped looking while tokens still burn. This is the first layer: stop the bleed.
The second layer is harder, in streaming agents: at the moment of cancellation, a tool may already have executed a side effect (order created, email sent, file half-written). Streaming sharpens this because output lands as it generates — where the cancel point falls is uncertain. Three principles: (a) explicitly mark the already-streamed partial output as partial/aborted — never persist it as a complete result; (b) side-effecting tools must be idempotent, and already-fired effects go through compensation/rollback (see Day 39); (c) pair a long stream with a checkpoint, so "cancel" means "stop at a clean breakpoint" rather than "freeze at an arbitrary half-state."
// Frontend: truly abort generation, not just drop chunks
const ctrl = new AbortController();
const stream = await client.messages.stream(
{ model: "claude-opus-4-8", max_tokens: 2048, messages },
{ signal: ctrl.signal }); // ← bind the cancel signal
stopBtn.onclick = () => ctrl.abort(); // disconnect; server stops generating
let acc = "";
try {
for await (const ev of stream) acc += ev.text ?? "";
commit(acc); // only a normal finish is the complete state
} catch (e) {
if (e.name === "AbortError")
save_as_partial(acc); // mark partial; trigger effect compensation
}
Take any streaming chat/agent you run and check it against this list — each item maps to a failure mode above:
input_json_delta, parse + run at content_block_stop; enable fine-grained only if you truly need it and can safely handle the partial.AbortController to actually disconnect, not just stop rendering; verify the server logs show generation actually halting.partial — not persisted, not trusted, not driving downstream.Do this and you'll see: making streaming "look like it's moving" takes ten lines; making it survive all three forms of partial — half-formed, truncated, cut off — is the real line between demo and production.
content_block_stop — that can't be broken; acting on half an argument is a pure bug. But an agent's "think-and-act" happens at a coarser granularity: once one tool call fully completes, the agent immediately issues the next, without waiting for the whole reply. So the correct boundary is: a structure (a single tool input) must be complete to be used, but a sequence (multiple tool calls) can advance in a stream. Fine-grained streaming tries to push the line inward a bit — letting huge arguments be consumed as they stream — but it explicitly requires you to prove you can safely handle the partial, shifting completeness validation from platform to you.max_tokens and server-side timeout as a hard upper bound, so the worst case of "cancel failed" is bounded; (3) build cancellation semantics on idempotency + compensation rather than on "stopped successfully" — i.e. assume generation may still be running and rely on a rollback-able side-effect layer. In short: cancel is best-effort bleeding control; correctness rests on the Day 39 resilience stack, not on disconnect always landing.max_tokens, the already-streamed portion is usable (just mark it incomplete). Never enable it when the parameter is a small control structure, e.g. {"action":..., "target_id":..., "confirm":...} — small, so the buffer latency is negligible, yet its fields are tightly coupled, so a missing field corrupts semantics or turns dangerous. The test: is the parameter "content" (large, incrementally consumable, partial-usable) or an "instruction" (small, tightly coupled, must be complete)? Content: enable. Instruction: don't.