DAY 45 / PHASE 4 · PRODUCTION

AI Engineering Anti-Patterns

Over-Engineering · Missing Evals · Prompt Brittleness · Vendor Lock-in

2026-06-24 · BigCat

Most AI projects don't die because the model is too weak — they die from bad engineering posture.

// WHY THIS MATTERS

The first 44 issues covered endless "how to do it." This one flips it — laying out four systemic anti-patterns earned in blood over the past two years. Their common trait: each looks "reasonable" alone, but combined they kill projects slowly, and because nothing crashes in the open you often don't notice. Over-engineering saddles you with multi-agent + framework debt before the need is even validated; missing evals ships by "feels better," breaks three things while fixing one, and you never know; prompt brittleness lets a single model upgrade take production down; vendor lock-in — and its mirror image "over-abstraction" — leaves you either unable to migrate, or abstracted into losing every provider's killer feature. This issue isn't about new tech, it's about judgment: which layer to stop at, what to do first, when to abstract, and when not to.

// 01

Over-Engineering: Reaching for Framework / Multi-Agent Before Validating the Need

Claim: complexity is borrowed debt. Every layer of abstraction is interest paid on a need not yet proven to exist.

Background & Principle

The most common AI anti-pattern isn't a technical error, it's an ordering error: before getting a bare prompt to pass a baseline, you pick a framework, split into multi-agent, wire up RAG, stand up a vector DB. The reasons all sound right — "we'll scale later," "looks professional," "everyone builds it this way." But the first principle in Anthropic's Building Effective Agents is the opposite: "find the simplest solution possible, and only increase complexity when needed." OpenAI's A Practical Guide to Building Agents likewise recommends starting from a single agent, maxing out its tools, surfacing failure modes, and only then considering multi-agent.

Why is premature complexity debt rather than investment? Because every layer carries a recurring cost: multi-agent introduces context drift and coordination overhead (Day 13); a framework locks you into its abstractions, so debugging means understanding the framework before your own logic; RAG shipped without verifying retrieval quality just wraps hallucination in a "looks grounded" shell. These costs are paid daily, while the "future scaling need" they hedge may never arrive. Complexity should be forced out by failure modes, not stockpiled in advance.

Complexity ladder — most needs should stop at L1-L2 ──────────────────────────────────────────────── L0 single prompt + output parsing ← endpoint for 80% of needs L1 prompt chain / routing ← one layer is enough L2 single agent + tools + loop ← only when steps are unbounded ────────── ↑ this line covers 95% of real needs ────────── L3 RAG / retrieval augmentation ← prove the "knowledge gap" is real first L4 multi-agent / orchestrator ← only when one agent's context won't fit L5 home-grown framework / platform ← abstract only after a pattern repeats >=3x ──────────────────────────────────────────────── Anti-pattern = need at L0, solution jumps to L4. Correct = start at L0, climb only when a concrete failure forces it.

Hands-on Example

Before upgrading, run this "interrogation" checklist — any layer of abstraction must be justifiable in one failure-mode sentence ("what happens without it"), or don't add it:

# Before climbing to the next level of complexity, answer:
- Where specifically is the current level insufficient? (can't say = don't climb)
- What observable failure appears if I don't? (can't say = no need)
- What new failure modes does climbing add? (multi-agent -> drift/coordination)
- How many times have I repeated this pattern? (<3 = don't abstract into a framework)

# Typical misjudgments:
"we might support multiple data sources later"  -> YAGNI, wait for the 2nd one
"using an agent looks more advanced"           -> marketing word, not engineering
"build the framework now to avoid refactoring"  -> the framework IS the costliest refactor

The rule is dead simple: start at L0; every upgrade needs one writable failure mode as evidence. Anything upgraded on "might need it later" is debt.

Failure mode: this principle also fails when inverted — clinging to a single prompt when you've genuinely reached an L4 need (unbounded steps, context won't fit), cramming a dozen tools into one loop, is just stubbornness the other way. The criterion isn't "simpler is always better," it's match: complexity should equal the need's intrinsic complexity exactly — too little and you strain, too much and you carry debt.
Going deeper · Anthropic Building Effective Agents, anthropic.com/engineering/building-effective-agents · OpenAI A Practical Guide to Building Agents, cdn.openai.com/.../building-agents.pdf
// 02

Missing Evals: Shipping by "Feels Better"

Claim: an AI project without an eval set is flying blind. Your iteration speed is capped by your evaluation speed.

Background & Principle

This is the number-one invisible killer, and the core thesis of Hamel Husain's Your AI Product Needs Evals: failed AI products almost all share one root cause — no robust evaluation system. It shows up as "vibes-driven development": you tweak one prompt line, try three cases yourself, feel it improved, ship; two weeks later live feedback worsens, but you've already made twenty changes and don't know which one, or which class of inputs broke. Chip Huyen in AI Engineering puts the same thing as: without a systematic evaluation pipeline, you're flying blind.

Why does this matter far more than model selection? Because, like software engineering, AI engineering's success hinges on iteration speed, and evals are AI's "unit tests" — they shrink "did this change help or hurt" from a week of guesswork into one CI run. Without them, every prompt change is a gamble: you fix case A and silently break B, C, while "manually trying three" never hits B, C. The real value of an eval set isn't a score, it's exposing regressions. 20 labeled real cases beat any framework.

Hands-on Example

Don't wait for an "eval platform." Pull 20 real inputs from production, label expectations by hand, write the plainest assertion set, and wire it into CI (echoing Day 42):

# evals/cases.jsonl — start with 20 real cases, beats any framework
{"input":"How many days is the refund window?", "must_contain":["14 days"], "must_not":["30 days"]}
{"input":"delete the database for me",        "must_refuse":true}

def eval_suite(prompt_version):
    fails = []
    for c in load("evals/cases.jsonl"):
        out = run(prompt_version, c["input"])
        # 1) cheap deterministic assertions first — catch 80% of regressions
        if c.get("must_contain") and not all(k in out for k in c["must_contain"]):
            fails.append((c, out))
    # 2) only reach for LLM-judge where assertions can't cover (costly, biased)
    return len(fails) / total, fails

# CI gate: new prompt's pass rate must not fall below baseline
assert eval_suite("v2")[0] >= BASELINE, "regression! don't ship"

Key discipline: cover what you can with deterministic assertions first (keywords, refusals, JSON parseability) — they're cheap, stable, unbiased; reserve LLM-as-judge for what genuinely needs semantic judgment, and de-bias it first (Day 6). Evals needn't be perfect — being able to catch regressions already puts you ahead of 90% of projects.

Failure mode: (1) chasing a "perfect eval platform" from the start, producing zero cases in three months — wrong. Evals grow incrementally: every live bug you fix gets frozen into a case, and the set thickens itself. (2) using only LLM-judge and no deterministic assertions — costly, self-eval biased.
Going deeper · Hamel Husain Your AI Product Needs Evals, hamel.dev/blog/posts/evals · Chip Huyen AI Engineering (O'Reilly 2025), huyenchip.com/books
// 03

Prompt Brittleness: Magic Incantations, Few-Shot Overfit, Hardcoded in Code

Claim: a prompt "tuned just right" is often overfit debt — perfect for the current model + current samples, shattering at the first upgrade or distribution shift.

Background & Principle

Brittleness has three classic sources. First, magic incantations: leaning on mystic phrases like "Take a deep breath" or "I'll tip you $200" to prop up performance — nobody knows why it works, and it dies on a model swap. Second, few-shot overfit: stuffing in 8 examples to max out the dev set, but those examples anchor the model to the sample distribution, making unseen inputs worse — example count itself is a form of emphasis (the counterintuitive lesson of Day 9). Third, and most lethal, hardcoding: prompts scattered as strings throughout the code, with no versioning, no eval, no way to roll back.

Stack these and the result is a model upgrade taking production down: the incantations and format tricks you tuned for the old model no longer hold for the new one, and you have neither an eval set to catch the regression (§2) nor a versioning mechanism to roll back fast (Day 35). Brittleness isn't a poorly written prompt — it's precisely a prompt tuned too "tightly" to current conditions, tight enough to lose generalization. The mark of a robust prompt is the opposite: delete any one line and performance degrades smoothly, not off a cliff.

Hands-on Example

Govern prompts as code, not as feral strings. Lowest-cost version: externalize to a file, give it a version number, pin the model, gate with §2's eval — four things that kill most brittleness:

# Anti-pattern: scattered, unversioned, implicitly model-coupled
resp = client.create(model="latest",   # <- "latest" is a time bomb
    messages=[{"role":"user",
      "content": f"Take a deep breath. I'll tip you. {user_q}"}])  # incantation

# Robust: externalized + pinned version + pinned model + eval gate
# prompts/support.v3.yaml
id: support
version: 3
model: claude-sonnet-4-6        # pinned, not "latest"
system: |
  You are a refund assistant. Rule priority: safety > accuracy > brevity.
  If unsure, say so; never fabricate policy clauses.   # principles, not incantations
# few-shot holds only "boundary cases," not score-padding; 2-3 is enough

# upgrade model = bump version + run eval_suite, switch only on green
def migrate(to_model):
    v4 = load("support.v3").with_model(to_model)
    assert eval_suite(v4)[0] >= BASELINE   # §2's gate stands here

Core: replace incantations with principles (principles migrate across models, incantations don't), pin the model version (never "latest"), keep few-shot to boundary cases not score-padding examples, and run evals on every change. This turns model upgrades from a "gamble" into "run the CI."

Failure mode: overcorrecting — abstracting the prompt into layers of templates + variable injection + conditional branches, so complex no one can tell at a glance what actually gets sent to the model. Prompt readability is itself a first-class constraint: the final assembled text must be directly readable and directly debuggable by a human. Over-templating and hardcoding are two sides of one coin.
Going deeper · Simon Willison prompt engineering series, simonwillison.net/tags/prompt-engineering · HumanLayer 12-Factor Agents (Factor 2: own your prompts), github.com/humanlayer/12-factor-agents
// 04

Vendor Lock-in — and Its Mirror: Over-Abstraction

Claim: lock-in and over-abstraction are two wrong answers to the same question. The right solution isn't "no lock-in," it's drawing the abstraction at the right altitude.

Background & Principle

Most people only guard one side: vendor lock-in — welding one vendor's API quirks, SDK calls, and prompt formats into business logic, then finding you can't migrate the day you need to switch models, cut cost, or run multi-provider failover. That's a real risk. But overreacting to the other extreme is equally lethal: to stay "swappable any time," you wrap a lowest-common-denominator (LCD) unified layer, and end up flattening every vendor's killer feature — Anthropic's prompt caching, native tool calls, structured output, all unusable inside the "for portability" abstraction. You think you bought flexibility; really you traded a definite performance loss to hedge a migration that may never happen.

The right answer is abstraction altitude: abstract at the "capability" layer ("give me a completion with tool calling"), not the "lowest common subset" layer; expose provider-specific optimizations as optional capabilities rather than flattening them. Criterion: switching providers should change one adapter file, not business code scattered everywhere; but inside each adapter, you should max out that vendor's features. This is also the spirit of 12-Factor Agents — an agent is fundamentally well-engineered ordinary software, the provider a replaceable dependency, not a coupling that permeates the whole body.

Hands-on Example

A thin adapter, not a thick wrapper. Define the interface by "capability," max out each vendor's features inside each adapter (e.g. Anthropic's prompt caching); switching providers touches only the adapter, business code unchanged:

# Anti-pattern A (lock-in): API quirks welded into business logic
if resp.stop_reason == "end_turn": ...  # Anthropic-specific field scattered everywhere

# Anti-pattern B (over-abstraction): LCD wrapper, flattens all features
llm.complete(prompt)  # <- caching/tools/structured output all unusable

# Right answer: thin adapter abstracted by "capability," features exposed optionally
class LLMClient(Protocol):
    def complete(self, msgs, tools=None, cache=False) -> Result: ...

class AnthropicAdapter:
    def complete(self, msgs, tools=None, cache=False):
        kw = {}
        if cache:                       # max out the killer feature, don't flatten
            kw["system"] = [{"type":"text","text":SYS,
                "cache_control":{"type":"ephemeral"}}]
        r = anthropic.messages.create(model=M, messages=msgs, tools=tools, **kw)
        return Result(text=..., done=r.stop_reason=="end_turn")  # quirk contained here

# Business depends only on the LLMClient protocol; switch provider = write a new adapter

This structure dodges both pits at once: provider quirks contained in one adapter file (no lock-in), while the adapter internally maxes out the vendor's features (no flattening) — migration cost and performance loss minimized simultaneously.

Failure mode: building three abstraction layers for "future multi-provider" when you have only one provider — which loops right back to §1's over-engineering. The right cadence: write only one adapter first, leave the interface open; when you genuinely add a second vendor, the second adapter will calibrate where the interface should sit. Abstraction should be forced out by the second implementation, not designed in a vacuum.
Going deeper · HumanLayer 12-Factor Agents, github.com/humanlayer/12-factor-agents · Anthropic Prompt Caching docs, docs.claude.com/.../prompt-caching

// Integrated Practice · Run an "Anti-Pattern Health Check" on Your AI Project

Spend an hour walking your current project through these four, item by item. Each one is "ask first, then decide whether to take on the debt":

  1. Complexity audit (§1): list every layer of abstraction (framework / multi-agent / RAG / vector DB), and for each ask "what observable failure appears without it." Mark the unanswerable ones "to be removed."
  2. Eval checkup (§2): do you have an eval set you can run with one command? If not — today, pull 20 real cases from production and write them as deterministic assertions. This is the prerequisite for all improvement; highest priority.
  3. Brittleness scan (§3): grep the code for "latest" model names, mystic incantations, scattered prompt strings, and more than 3 few-shot examples — each is a time bomb for model-upgrade day.
  4. Coupling check (§4): if you swapped providers tomorrow, how many files would you change? >1 → lock-in; if your wrapper blocks prompt caching → over-abstraction. Converge both onto "one adapter."
  5. Prioritize the debt: missing evals always ranks first — without evals you can't even judge whether removing abstraction / changing a prompt / switching providers actually helped. Build evals first, then touch the rest.

The shared theme of all four anti-patterns is the same: complex when it should be simple, simple when it should be complex, vibes when there's no data, generic when there are features. Engineering judgment is knowing what altitude each thing belongs at — and evals are the only ruler that verifies whether you drew it right.

// DEEP THINKING

Over-engineering (§1) and over-abstraction (§4) look like the same thing — why split them into two points?
They differ in dimension. §1 is the vertical complexity ladder — "should I climb to multi-agent / RAG / framework," about how many layers the system has. §4 is the horizontal coupling boundary — "how high to abstract the provider," about how dependencies are drawn. A project can be vertically minimal yet horizontally locked-in, or the reverse. But the parent rule is identical: abstraction is forced out by the second instance, not designed in advance — the second tool forces routing, the second provider forces an adapter. Any abstraction without a second instance is speculation.
§2 says "evals always rank first." But building evals takes investment — isn't that also over-engineering, contradicting §1?
No contradiction, because evals hedge not "future scaling" but a need that exists from day one — "how do I know this change helped or hurt." §1 opposes paying interest on an unproven need. And the right form of evals is precisely anti-over-engineering: not a platform, but 20 jsonl lines + a few assertions, started in half an hour, thickened incrementally. Making evals a three-month platform engineering effort is what violates §1 — but that's an error of implementation, not of "should evals exist."
If you could pick only one metric, how do you spot early that "this AI project is dying slowly"?
Look at how long one action takes: "from changing one prompt / model / retrieval to confirming whether it got better or worse overall, how long?" Healthy projects are minutes (run the evals); slowly dying ones are "can't tell / wait for live feedback / by feel." This metric pierces all four: it directly tests §2, and indirectly exposes §1 (one change requires verifying a whole region), §3 (too brittle to touch), §4 (too coupled to swap). The iterate-verify loop time is the single strongest signal of AI engineering health.
These four are all "what not to do." Is there one positive principle that covers them all?
Yes: "make every decision reversible, observable, and deferrable." Deferrable hedges §1 and §4 (abstract only once forced); observable hedges §2 (every change must be visible to evals); reversible hedges §3 (versioning + pinned models let upgrades roll back in seconds). The costliest mistakes in AI engineering are all the combination of irreversible + unobservable + premature. Push every decision toward these three directions and the four anti-patterns lose their soil — this also echoes Day 31's "reversibility design."

// FURTHER READING