Real HITL isn't "pop a dialog and wait for you to click yes" — it's teaching the agent to pause gracefully, hand the decision back to you, and pick up seamlessly afterward.
Day 31 covered "put a human gate before irreversible actions" — but that was one line of input(). Once the agent runs in the background, runs hundreds of turns, and you're not watching the terminal, synchronous blocking breaks: the agent freezes waiting on a terminal nobody's at; or you get annoyed and just set everything to auto. Production HITL isn't a dialog box, it's an orchestration discipline: when is it worth interrupting a human (confidence routing), how do you ask without blocking (async approval + state persistence), how does a human take over and hand back (interruptible checkpoints), and how does all of it stay auditable (approval audit). Today we upgrade require_human from one line of input() into a system that can suspend while you sleep and resume from a tap on your phone. The hard part of HITL was never "should we ask a human" — it's "what does the agent do while it asks."
Day 31 sorted by reversibility (reversible runs free, irreversible gets a gate). Add a second axis here: confidence × impact. The cross yields three tiers — auto (high confidence + low impact, run free), ask (low confidence or medium-high impact, ask a human), deny (over-budget / unauthorized, just refuse — it never enters the human queue). The key trap: confidence can't be self-reported by the model — it systematically overestimates. Use verifiable proxy signals instead: retrieval hit count, whether a tool errored, self-consistency across samples, whether output matches the schema. "I'm confident" doesn't count; the evidence does.
def confidence(ctx): # verifiable signals, not "how sure are you?"
sigs = [ctx.retrieval_hits >= 2,
not ctx.tool_errored,
ctx.self_consistency >= 0.8, # agreement across samples
ctx.schema_ok]
return sum(sigs) / len(sigs)
def route(action, ctx):
if action.over_budget or action.unauthorized:
return "deny" # unauthorized isn't "ask" — it's a bug
c, impact = confidence(ctx), action.impact
if c >= 0.8 and impact == "low": return "auto"
return "ask" # low confidence or high impact → ask
Core mental model: HITL's cost is human attention, and attention is a scarce, depletable resource. Routing's goal isn't "safer" — it's "spend each human intervention where it counts."
deny into the human queue too — unauthorized actions don't belong there; mixing them in just buries the ask items that genuinely need judgment.
input() is synchronous blocking — a background agent that uses it will die. The core of production HITL is turning "wait for a human" from blocking into suspend + callback.When the agent hits the ask tier, input() freezes the whole worker, idling on a terminal that may have nobody at it for hours. The fix follows durable execution: checkpoint state outside the process (not in memory) → send a request to an approval channel (Slack / phone push / queue) → yield resources and exit → once the human decides, resume from the checkpoint via callback or polling. Two iron rules: state must live outside the process (a crash mustn't lose the approval); resume must be idempotent (replaying the same approval can't execute the action twice). Attach an SLA timeout to each approval: hung too long → default-deny or escalate, never wait forever.
async def step(ctx):
action = plan(ctx)
if route(action, ctx) == "ask":
aid = save_checkpoint(ctx) # ① state out of process, with approval_id
notify("slack", action, aid) # ② fire request, don't await
raise Suspend(aid) # ③ yield worker, exit
return execute(action)
def on_approval(aid, decision): # triggered by callback after the human acts
if seen(aid): return # ⑤ idempotent: replays don't re-execute
mark_seen(aid); ctx = load_checkpoint(aid)
if decision == "approve": resume_from(ctx) # ⑥ continue from checkpoint
else: resume_from(ctx, denied=True) # feed denial back to the agent too
input() blocks the whole worker — one pending approval drags down every concurrent task. (2) State only in memory — a restart / crash loses the approval and all progress. (3) No idempotency — a replayed callback or a double-tap executes the transfer twice. (4) No timeout — one unattended ask suspends the task forever.
Thinking of human intervention as only approve / reject wastes it. Real workflows have two more valuable moves: edit (fix its plan / draft / params and continue) and takeover (a human does a few steps, then hands back). To support both, the agent loop needs three things: interruptible (not a while True running to the death), serializable state, and resumable from any checkpoint. The commonly-missed part: after a human edits, you must feed the change back into the agent's context — otherwise its next step overwrites your edit with the old plan. So "plan" and "draft" must be explicit, human-overwritable state, not buried in prompt history.
def loop(state):
while not state.done:
if stop_requested(): return snapshot(state) # interruptible: save & exit anytime
proposal = plan(state)
decision = await_human(proposal) # approve / edit / takeover
if decision.kind == "approve": state = execute(proposal, state)
elif decision.kind == "edit":
state.plan = decision.edited # human-edited plan
state.ctx += f"[human edited plan: {decision.note}]" # feed back! or it's overwritten
elif decision.kind == "takeover":
state = decision.human_steps # human does steps, then resume
return state
A counterintuitive but important point: edit is cheaper than reject. Reject sends the agent back to square one (it may repeat the same mistake); edit nudges it straight onto the right track — one correction beats ten rejections.
while True — to step in you can only kill it, losing all progress and restarting from scratch. (2) The human edits the draft but it isn't fed back into context — the agent ignores your change next turn. (3) Incomplete checkpoint state — on resume a key variable is missing and the agent runs "amnesiac."
Each approval should record at least: the action, the agent's reason, the confidence, who approved, how long it took, the outcome. This log has two uses. First, retune routing thresholds — an ask class you approve 95% of the time → demote to auto; an auto class that caused an incident → promote to ask. The §1 thresholds aren't set once; they grow out of audit data. Second, escalation & delegation — high-impact actions require more than one approver / a specific role (lightweight RBAC); an approval hung too long auto-escalates to a backup. Finally, the single best move against approval fatigue: batch similar low-risk actions (approve 10 on one screen) rather than popping 10 dialogs.
def log_approval(rec): # every approval to the store: retune + accountability
db.insert(action=rec.action, reason=rec.agent_reason,
conf=rec.confidence, approver=rec.who,
latency=rec.decided_at - rec.asked_at, outcome=rec.result)
def retune(action_type): # regress thresholds from history
rows = db.query(action_type, last_days=14)
rate = mean(r.outcome == "approved" for r in rows)
if rate > 0.95 and no_incidents(rows): suggest("ask→auto") # always approved = redundant ask
if any(r.outcome == "incident" for r in rows): suggest("auto→ask")
Chain the four into a weekend project: give a background agent you run a layer that "suspends while you sleep, resumes from a tap on your phone," then red-team it yourself.
(confidence, impact) → {auto, ask, deny} table; proxy confidence with verifiable signals (retrieval hits / tool errors / self-consistency), never the model.ask tier from input() into checkpoint-to-store + Slack/push + suspend-exit; don't block the worker.retune() to see which asks to demote and which autos to promote.Once you've built this, you'll instinctively ask of any "autonomous agent product": what does it do while it asks a human (block or suspend), can I edit its intermediate work, does it keep an approval trail — instead of being dazzled by the "one-click full-auto" in the demo.
input() is plenty — don't add a queue. The moment it (a) runs in the background / on cron, (b) a single run spans beyond your attention window (tens of minutes+), or (c) multiple instances run concurrently — synchronous blocking bites: workers freeze, a crash loses all progress. The criterion isn't "project size," it's "are you present when it breaks." If not, you need durable. If you are, the machinery is just overhead.no_incidents guard in retune() is the defense — only suggest demotion when "always approved and no incidents," not on "always approved" alone. But the sturdier practice is to never let the regression auto-change thresholds — only propose (suggest(), not apply()), with your periodic review. Feeding your bias into a system that then automates the bias is HITL's most insidious failure — so the audit itself must be audited.