DAY 41 / PHASE 4 · PRODUCTION

Streaming & Interruption Engineering

TTFT · Partial Parsing · Streaming Tool Use · Cancellation

2026-06-20 · BigCat

Streaming isn't about showing the answer earlier — it's about turning a half-formed, ever-shifting, possibly-cut-off intermediate state into an engineerable object.

// WHY THIS MATTERS

Almost every LLM product ships streaming, but most people use only its shallowest layer: typing tokens onto the screen one by one. The real engineering difficulty lies beyond the typewriter effect — every frame that arrives is incomplete: a half-rendered Markdown table, JSON cut off mid-quote, a tool call whose arguments aren't fully collected, and a "Stop" button the user can hit at any moment. None of these are prompt problems; they're state-management problems in the harness layer. Treating streaming as "show it sooner" buries three classes of production incidents: feeding half a JSON to parse and crashing, executing a tool input that hasn't finished generating, and the user closing the tab while the server keeps burning tokens and firing side effects. This issue nails down four things: streaming is a perception optimization, not a throughput one; incremental parsing must tolerate the partial; the buffering trade-off in streaming tool calls; and rolling back side effects on cancellation.

// 01

The Essence of Streaming: It Optimizes TTFT, Not Total Time

Claim: streaming shaves off zero total generation time; it swaps a stretch of "nothing happening" for "continuous feedback" — what it fools is perception.

Background & Principle

An LLM call's latency splits into two parts: TTFT (time to first token, from request to the first token landing) and per-token generation (inter-token latency × token count). Streaming changes neither sum — wall-clock is nearly identical — it just lets the user start seeing text right after TTFT instead of waiting for the whole thing. Research and product experience converge: for the same total time, a streaming interface is perceived as far faster, because it eliminates the worst stretch — staring at a blank screen.

This essence yields two counterintuitive decisions. First: monitor and optimize p95 of TTFT, not total latency — an endpoint that takes 10s total but emits the first token in 200ms feels far better than one that takes 4s total but stalls 3s before moving. Second: streaming does not raise throughput; turning it on for a pure backend batch job is pointless and only adds parsing complexity.

Same total time, wildly different feel — optimize TTFT, not total No-stream [████████ blank 8s wait ████████]→ dumped at once user: "Did it freeze?" Stream [▌200ms]→ char▌char▌char▌char▌char▌char (over 8s) user: "It's moving, feels fast" — same 8s

Hands-on

import time, anthropic
client = anthropic.Anthropic()

t0 = time.time(); ttft = None
with client.messages.stream(model="claude-opus-4-8",
        max_tokens=1024,
        messages=[{"role":"user","content":"Write..."}]) as stream:
    for text in stream.text_stream:
        if ttft is None:
            ttft = time.time() - t0          # ← the metric to actually watch
            print(f"TTFT={ttft*1000:.0f}ms")
        render(text)                          # incremental paint
total = time.time() - t0                       # total time: streaming can't save it

Failure mode: reporting streaming as a perf win — "latency dropped." Total latency didn't drop at all; you optimized perception. If TTFT itself is high (prompt too long, no prompt caching, model deep in CoT), streaming won't save you — the user still stares at a blank. The fix is to cut TTFT first (cache the prefix, shorten the system prompt, stream a placeholder/skeleton), not to bet on the typewriter.

Resources · Redis Streaming LLM Responses: Make Your AI App Feel Fast, redis.io/blog/streaming-llm-responses · DeepInfra LLM Provider KPIs 101: TTFT, Throughput, E2E, deepinfra.com/blog

// 02

Incremental Parsing: Treat Every Frame as Partial; a Strict Parser Kills Streaming

Claim: feeding stream chunks to a parser that demands complete input degrades streaming back to buffering — with added crash risk.

Background & Principle

Plain-text streaming is easy (just append). The moment output has structure, it gets hard. Standard Markdown renderers, JSON.parse, XML parsers all assume input is complete and valid — but in a stream you always hold an intermediate state: an unclosed code block, a half-written table, a string cut off at "add. Hand that to a strict parser and you get one of two outcomes: it throws and crashes, or it waits for completeness before rendering — and the latter erases streaming's entire value.

The fix is progressive parsing designed for the partial: render an open container the instant a code block starts, draw a table row as each arrives, show a temporary closing quote for an unclosed string. Vercel's Streamdown packages this as a streaming replacement for react-markdown — it assumes input is never complete and "remends" unterminated syntax before rendering. Structured data is the same: use a tolerant JSON parser (partial JSON / trailing-strings mode) that turns {"city":"Bei into a partial object like {"city":"Bei"}, letting the UI fill fields as they stream.

Hands-on

# Tolerant parsing: partial-json turns half JSON into a partial object
from partial_json_parser import loads

buf = ""
for chunk in stream:            # chunks don't respect JSON boundaries
    buf += chunk
    try:
        obj = loads(buf)            # even half-formed yields a partial object
        render_partial(obj)         # render each field as it lands
    except Exception:
        continue                    # this frame isn't enough; wait for the next
# Key: only the obj after the stream ends is the "final state"; mid-stream is preview

Failure mode: (1) running strict JSON.parse on every frame and swallowing the throw with try/catch — it runs, but you re-parse for nothing each frame and risk mistaking a momentarily-valid intermediate for the final. (2) Worse: treating a partially parsed result as final and firing a side effect — e.g. executing a delete the instant you stream {"action":"delete", when the full intent was {"action":"delete_draft","confirm":false}. An intermediate state is for rendering preview only — never to drive decisions or side effects.

Resources · Vercel Streamdown — Markdown for AI streaming, streamdown.ai / github.com/vercel/streamdown · Vercel AI SDK streamObject / partial object, ai-sdk.dev/docs

// 03

Streaming Tool Calls: Accumulating input_json_delta, and the Fine-grained Trade-off

Claim: tool-argument streaming defaults to "buffer until you get valid JSON"; to get large arguments sooner, enable fine-grained — and own the risk of broken JSON yourself.

Background & Principle

In Anthropic's SSE stream, a response is a series of content blocks, each shaped content_block_start → many content_block_delta → content_block_stop. Text flows as text_delta; tool arguments flow as input_json_delta, in a partial_json field — you must concatenate the partial_json fragments in arrival order to reconstruct the argument JSON, and these fragments don't respect JSON boundaries (they can break at any character). By default the SDK buffers to content_block_stop, where you finally get validated, complete JSON — safe but latent: large arguments (say, having the model write a big chunk of file content) wait until the whole thing is accumulated.

To remove that wait, Anthropic offers fine-grained tool streaming (beta header fine-grained-tool-streaming-2025-05-14): arguments stream out without buffering or JSON validation, sharply cutting first-byte latency for large parameters. The cost is in the docs: no guarantee the stream is valid JSON — if max_tokens is reached, the stream may stop mid-parameter, leaving you broken JSON that can never be completed. It's a clean trade: lower latency ⇄ handle the partial yourself.

A streaming response with a tool call — how events rebuild the args message_start └ content_block_start (type=tool_use, name=edit_file) ├ delta: partial_json = '{"path":"a.' ├ delta: partial_json = 'py","content":"def ' ├ delta: partial_json = 'main():..."}' ← fragments unaligned └ content_block_stop ← default: validate + deliver full JSON here message_delta (stop_reason) message_stop Default: parse only at stop → safe, slow Fine-grained: use each delta now → fast, may be broken (max_tokens cut)

Hands-on

acc = ""
for ev in stream:
    if ev.type == "content_block_delta" \
       and ev.delta.type == "input_json_delta":
        acc += ev.delta.partial_json          # concat in order; don't parse early
    elif ev.type == "content_block_stop":
        args = json.loads(acc)                # default mode: valid here
        run_tool(name, args)                  # ← the only safe execution point
# Under fine-grained beta, acc may still be broken JSON at stop;
# use a tolerant parser + validation, and on failure treat the call
# as incomplete and DO NOT execute

Failure mode: (1) parsing and executing the accumulated partial_json before content_block_stop — arguments aren't collected, so you act on half an intent. (2) Enabling fine-grained yet still assuming the end is valid JSON, skipping the broken-input check — once max_tokens truncates, json.loads throws, or worse, a tolerant parser fabricates a semantically corrupted argument that gets executed. Fine-grained presumes you can safely handle a half-formed parameter.

Resources · Anthropic Streaming Messages, docs.claude.com/.../streaming · Anthropic Fine-grained tool streaming, docs.claude.com/.../fine-grained-tool-streaming

// 04

Cancellation & Rollback: Disconnect ≠ Stop; Side Effects Already Fired Must Be Compensated

Claim: a user hitting "Stop" or closing the tab only severs your receiving end; the server may keep generating, keep billing, and a tool may already have committed its side effect.

Background & Principle

Cancellation hides a truth: client disconnect ≠ generation stop. To actually stop the bleeding, you must use an AbortController (or the SDK's cancel) to explicitly abort that HTTP/SSE connection so the server gets the disconnect and stops generating — closing the UI and discarding later chunks isn't enough; that just means you stopped looking while tokens still burn. This is the first layer: stop the bleed.

The second layer is harder, in streaming agents: at the moment of cancellation, a tool may already have executed a side effect (order created, email sent, file half-written). Streaming sharpens this because output lands as it generates — where the cancel point falls is uncertain. Three principles: (a) explicitly mark the already-streamed partial output as partial/aborted — never persist it as a complete result; (b) side-effecting tools must be idempotent, and already-fired effects go through compensation/rollback (see Day 39); (c) pair a long stream with a checkpoint, so "cancel" means "stop at a clean breakpoint" rather than "freeze at an arbitrary half-state."

After "Stop" is hit, the three things to actually do User cancels │ ├─① abort connection → server stops generating (else still billing/running) ├─② mark output → tokens received = partial; don't persist or trust └─③ handle effects → compensate executed tools; discard unexecuted ones plus idempotency keys to avoid a "resend" second hit

Hands-on

// Frontend: truly abort generation, not just drop chunks
const ctrl = new AbortController();
const stream = await client.messages.stream(
    { model: "claude-opus-4-8", max_tokens: 2048, messages },
    { signal: ctrl.signal });                  // ← bind the cancel signal

stopBtn.onclick = () => ctrl.abort();          // disconnect; server stops generating

let acc = "";
try {
    for await (const ev of stream) acc += ev.text ?? "";
    commit(acc);                               // only a normal finish is the complete state
} catch (e) {
    if (e.name === "AbortError")
        save_as_partial(acc);                  // mark partial; trigger effect compensation
}

Failure mode: (1) assuming "close the tab / drop chunks" stops it — without an abort signal the server often runs the whole generation to completion, the bill stands, subscription-style tool side effects fire anyway. (2) After cancel, persisting or returning the half-streamed output as the final answer, corrupting data. (3) Cancel lands right after a non-idempotent side-effecting tool with no compensation — the user thinks "I hit Stop, so it didn't happen," but the email already went out.

Resources · MDN AbortController, developer.mozilla.org/.../AbortController · This series Day 39 Agent Error Recovery & Resilience (idempotency / compensation / checkpoint)

// PUTTING IT TOGETHER · Harden a streaming endpoint for "partial safety"

Take any streaming chat/agent you run and check it against this list — each item maps to a failure mode above:

Instrument TTFT: timestamp at the first delta; monitor p50/p95 of TTFT, not just total latency. High TTFT → cut the prefix (cache, shorten the system prompt), don't bet on the typewriter.
Swap in tolerant rendering: Markdown via a streaming renderer like Streamdown; structured fields via partial-json parsing — half-formed is preview only.
Execute tool args only after stop: accumulate input_json_delta, parse + run at content_block_stop; enable fine-grained only if you truly need it and can safely handle the partial.
Wire a real cancel: bind the Stop button to AbortController to actually disconnect, not just stop rendering; verify the server logs show generation actually halting.
Tag intermediate state: any output left incomplete by cancel/truncation is marked partial — not persisted, not trusted, not driving downstream.
Backstop side effects: every "write" tool in a streaming agent gets an idempotency key + compensation; on cancel, roll back fired effects in reverse.

Do this and you'll see: making streaming "look like it's moving" takes ten lines; making it survive all three forms of partial — half-formed, truncated, cut off — is the real line between demo and production.

// DEEP THINKING

If streaming optimizes perception, not total time, why not just hide the process behind a polished loading animation and "pretend to think," then present the finished result at once?

Because streaming provides more than "feedback" — it provides falsifiable progress and a chance to intervene early. A loading spinner is unfalsifiable — ten seconds and one second convey the same information; the user can't tell whether the system is really working or heading the right way. Streaming lets the user judge as it generates: if the first sentence goes off the rails, they hit Stop immediately, saving the next 9 seconds of wasted generation and tokens. This is exactly the premise of point 4's cancellation — streaming turns "waiting" into "interruptible collaboration." It matters most for long agent tasks: the streamed intermediate steps let the user catch, at step 3, that the agent misread the intent, instead of waiting for all 40 steps. Hiding the process forfeits this human-in-the-loop layer.

Point 2 says intermediate state "may render but must not drive side effects." But isn't the whole value of a streaming agent to think-and-act in stride? Doesn't this red line neuter it into "think first, then act"?

The red line sits inside a single unfinished structure, not across the agent's whole lifecycle. A tool call's argument JSON is untrustworthy half-formed until content_block_stop — that can't be broken; acting on half an argument is a pure bug. But an agent's "think-and-act" happens at a coarser granularity: once one tool call fully completes, the agent immediately issues the next, without waiting for the whole reply. So the correct boundary is: a structure (a single tool input) must be complete to be used, but a sequence (multiple tool calls) can advance in a stream. Fine-grained streaming tries to push the line inward a bit — letting huge arguments be consumed as they stream — but it explicitly requires you to prove you can safely handle the partial, shifting completeness validation from platform to you.

Cancellation relies on AbortController to truly disconnect. But many LLM gateways/proxies sit in the middle — does the client's abort really propagate all the way to the upstream inference engine and stop decoding? And if it doesn't?

This is a real architectural hole. To stop billing and generation, the abort signal must propagate hop by hop: browser → your backend → gateway → model provider → inference engine. Any hop that swallows the disconnect lets upstream run the whole decode to completion. In reality, multi-tier proxies, connection pools, and buffering often eat this signal. The pragmatic move is to not bet all your loss-stopping on disconnect: (1) ensure abort passthrough at the topmost hop you control; (2) set a tight max_tokens and server-side timeout as a hard upper bound, so the worst case of "cancel failed" is bounded; (3) build cancellation semantics on idempotency + compensation rather than on "stopped successfully" — i.e. assume generation may still be running and rely on a rollback-able side-effect layer. In short: cancel is best-effort bleeding control; correctness rests on the Day 39 resilience stack, not on disconnect always landing.

Fine-grained tool streaming trades "lower latency" for "possibly broken JSON." In which concrete scenarios is this trade worth it? In which should you never enable it?

Worthwhile scenarios share one trait: the parameter is huge, and its half-formed content already has display value. The classic case is the model streaming a large file body, long document, or big code block — the user wants to watch it appear line by line rather than wait for the whole file to accumulate and pop out at once; even if the tail is cut by max_tokens, the already-streamed portion is usable (just mark it incomplete). Never enable it when the parameter is a small control structure, e.g. {"action":..., "target_id":..., "confirm":...} — small, so the buffer latency is negligible, yet its fields are tightly coupled, so a missing field corrupts semantics or turns dangerous. The test: is the parameter "content" (large, incrementally consumable, partial-usable) or an "instruction" (small, tightly coupled, must be complete)? Content: enable. Instruction: don't.

All four of this issue's topics (perception/parsing/tool/cancel) are about fighting the "intermediate state." Does that hint at a more general engineering principle?

The implied principle: the moment you turn a "result" into a "process," you must define a legal state for every instant of that process. The non-streaming world has only two states: no result, has result — simple. Streaming shatters it into infinitely many intermediate states, each frame forced to answer "what can this half-thing be used for right now?" This is isomorphic to distributed systems moving from "transactions" to "eventual consistency": you trade a latency/perception gain for the obligation to explicitly handle a pile of intermediate states that didn't exist before. The deeper lesson: the real cost of any "real-time / incremental" rework is not making it stream, but supplying semantics and a safety boundary for every partial instant in flight. Streaming is just this universal trade-off projected onto the LLM interface.

// FURTHER READING

Anthropic · Streaming Messages — authoritative definition of SSE event structure, text_delta and input_json_delta
Anthropic · Fine-grained Tool Streaming — the low-latency ⇄ broken-JSON trade-off and how to enable it
Vercel · Streamdown — a streaming Markdown renderer that assumes input is never complete
Redis · Streaming LLM Responses — a systematic take on TTFT, perceived latency, and streaming UX
MDN · AbortController — the standard mechanism to truly abort a connection / cancel a request