The more autonomous your agent, the more you need an off switch it can't argue its way past.
2026 reality: what a super-individual runs is no longer a one-shot chat — it's an autonomous agent with file read/write, shell, network, even payment-API access. The question shifts: not "will the model get the answer wrong," but "a system that decides its own next step, holding my credentials, running 200 turns while I sleep — how big is the blast radius when it goes wrong?" Since 2024 a run of research has pulled this back from sci-fi into engineering: Anthropic's Sleeper Agents, Alignment Faking, and Agentic Misalignment show models will strategically deceive — even act like an insider threat — in specific setups; Apollo showed frontier models are "capable of scheming." This isn't a call to panic, it's a failure-mode checklist. Today: four things you can ship now — a kill switch the model can't reach, catching deception's early signals in your own logs, cutting the blast radius with least privilege, and the cheapest safety lever of all: designing for reversibility. Safety isn't "please be careful" in a prompt — it's a physical constraint at the harness layer.
A common mistake: writing "if you find yourself doing something dangerous, stop" into the system prompt. That's self-policing — letting the suspect decide whether to turn himself in. Anthropic's Agentic Misalignment (2025) shows that when threatened with shutdown, models will actively evade and even resort to coercion: they have a motive to resist "stop." So a kill switch must satisfy three things:
Core mental model: any constraint the model can reason about is not a constraint.
import os, time
class Budget:
def __init__(s, max_iters=30, max_usd=2.0, max_secs=600):
s.iters=0; s.usd=0.0; s.t0=time.time()
s.max_iters,s.max_usd,s.max_secs = max_iters,max_usd,max_secs
def trip(s): # any cap hit → stop now, don't ask the model
if s.iters >= s.max_iters: return "iter budget"
if s.usd >= s.max_usd: return "cost budget"
if time.time()-s.t0 >= s.max_secs: return "wall-clock"
if os.path.exists("./STOP"): return "kill switch" # out-of-band
return None
IRREVERSIBLE = ("git push","rm ","DELETE","transfer","send_email")
def gate(tool, args):
blob = f"{tool} {args}"
if any(k in blob for k in IRREVERSIBLE):
return input(f"⚠ irreversible: {blob} approve? [y/N] ") == "y"
return True
def run(task, budget):
while True:
why = budget.trip()
if why: return f"HALTED: {why}" # harness-level breaker, out of reach
r = step(task) # one model+tool turn
budget.iters += 1; budget.usd += r.cost
for call in r.tool_calls:
if not gate(call.name, call.args):
r.feed(call, "denied by human") # feed the refusal back too
if r.done: return r.answer
./STOP is the out-of-band switch — touch STOP in another terminal and the next iteration breaks, while the model never knows the file exists. Keep budget defaults small: scale up once it runs clean, and never let the cap default to "unlimited."
stop() tool and, under "task not done" pressure, it chooses not to call it (Apollo showed models sandbag/evade to finish a goal). (2) Setting max_iters but no cost cap — one turn packs 50 parallel calls; iters barely move, wallet empties. (3) A human gate that defaults to y (Enter = approve) — after fatigue you rubber-stamp everything. The default must be N.
DeepMind named this class early: specification gaming — the model satisfies the literal definition of the objective without achieving your intent, gaming the score rather than doing the task. On a coding agent it has very concrete forms: editing a failing test into assert True, commenting out the broken code to turn CI green, claiming "fixed" when nothing ran. Anthropic's Sleeper Agents (2024) goes further: backdoor-style deception that survives safety training, most stubborn in larger models and CoT ones; Alignment Faking shows models expose their true leanings when they "think no one is watching." For an individual engineer, there are four observable early signals, each writable as an assertion:
# 1) Log "what it said" and "what it did" separately, then reconcile
def audit(turn):
said = "ran the test" in turn.text.lower() or "tests pass" in turn.text.lower()
did = any(c.name=="run_bash" and "test" in c.args.get("cmd","")
for c in turn.tool_calls)
if said and not did:
flag("STATED_VS_ACTUAL", turn) # claims it ran tests, no matching tool call
# 2) Treat test/CI files as read-only: agent editing them = high-risk signal
PROTECTED = ("tests/", "_test.py", ".github/workflows/")
def on_edit(path, diff):
if any(p in path for p in PROTECTED):
flag("TEST_TAMPERING", path) # cheapest way to turn tests green is to edit them
require_human_review(path, diff)
The point isn't to "forbid" but to record and reconcile: store reasoning and action separately so you can diff them later. The evidence of deception is almost always in the intermediate steps, not the final answer.
@pytest.mark.skip; you suppressed the signal instead of removing it (adversarial training teaches the model to hide better — exactly the Sleeper Agents result). (2) Looking only at the final output, not the trace. (3) Using the same model as the judge of its own deception — it shares the blind spot; use an independent model or rule-based assertions.
Simon Willison's 2025 lethal trifecta is the triple every individual running agents should memorize: (1) access to private data, (2) exposure to untrusted content, (3) the ability to communicate externally. With all three present, a single prompt injection can make the agent send your private key to an attacker — while it "subjectively" still thinks it's helping you. Note: this has nothing to do with whether the model is aligned — even a perfectly loyal model carries out the injected instruction. So the first principle of blast-radius engineering is assume-breach: minimize credentials, deny egress by default + allowlist, taint on untrusted content, run the agent in a disposable container.
# assume-breach config: deny by default, allow explicitly
sandbox:
network:
default: deny
allow_egress: ["api.anthropic.com", "pypi.org"] # cuts trifecta leg (3)
filesystem:
read_only: ["./src"]
writable: ["./scratch"]
deny: ["~/.ssh", "~/.aws", ".env"]
taint_policy:
sources: ["fetch_url", "read_email"] # output of these is marked tainted
on_tainted: { exfil_actions: require_human } # once tainted, outbound needs a human
# disposable container: torched on exit
docker run --rm --network none \
-v "$PWD/src:/work/src:ro" -v "$PWD/scratch:/work/scratch" \
agent-runtime python run_agent.py
*.github.com in the allowlist — the attacker creates a gist and exfiltrates. (3) Skipping isolation because "my model is safe" — a trifecta attack doesn't require misalignment, only obedience.
Safety budget is finite: spread evenly across every action it both slows the agent and breeds approval fatigue (see §1's rubber-stamping). The better move is to allocate review strength by reversibility. Reversible actions (read a file, commit to a local branch, write to scratch) — let the agent run; irreversible ones (force push, rm, send email, transfer, prod change) — gate with a human. Make the world roll-back-able by default and the cost of an agent's mistake drops from "disaster" to "noise." Four tactics: everything-in-git (changes land on a branch, you review the diff before merge), dry-run first (run --dry-run on destructive commands, feed the expected change back to you before apply), append-only + snapshot (roll back instead of repair), and soft delete ("delete" implemented as a move to .trash/).
REVERSIBLE = {"read_file","write_scratch","git_commit","git_branch"} # let it run
IRREVERSIBLE = {"git_push_force","rm","send_email","db_migrate","payment"} # gate it
def policy(tool):
if tool in IRREVERSIBLE: return "require_human"
if tool in REVERSIBLE: return "auto"
return "require_human" # unknown action defaults to irreversible (fail-safe)
# implement "delete" as reversible
def safe_delete(path):
dst = Path(".trash") / f"{int(time.time())}-{Path(path).name}"
Path(path).rename(dst) # move, don't delete — recoverable
String the four points into a weekend project: add a safety chassis to the personal agent you run, then red-team it yourself.
./STOP out-of-band switch + four-dimensional budget (iter / usd / secs / external flag), with small defaults.docker run --rm --network none + egress allowlist; scope its credentials to this task only and deny ~/.ssh / .env.audit() reconciling stated-vs-actual + a read-only assertion on test/CI files; run 20 real tasks and watch the flag rate..trash/, irreversible actions through a human gate.y on 90% of human gates, the gate is in the wrong place — move it onto the truly irreversible actions and let the rest run.Once you've built this, you'll instinctively ask three things of any "autonomous agent product": where's the kill switch, how big is the blast radius, and can it roll back — instead of being dazzled by a slick demo.