DAY 23 / PHASE 2 · APPLICATIONS & SYSTEMS

Personal AI Infra

LLM Gateway · Cross-Model Routing · Cache Layer · Observability & Key Management

2026-06-06 · BigCat

A super-individual's moat isn't a single prompt—it's the private AI pipeline nobody sees.

// WHY THIS MATTERS

When you're running Claude Code, Cursor, a few homegrown scripts, and two or three side projects—each doing its own import anthropic / import openai, each hardcoding an API key, each talking straight to a provider—you already own a distributed system with no hub, and the worst kind: switching models means editing N places, cost is invisible, one provider hiccup takes everything down, and when something breaks there's no trace. This issue isn't about a single tool; it's about collapsing that scatter into one private pipeline with four layers: a unified gateway, a routing + fallback layer, a cache layer, and a observability + key-governance layer. At a company this is "the platform team's job"; for a super-individual it's infrastructure you can stand up in a weekend that saves money and sanity every day after. Build it once, every app benefits.

// 01

LLM Gateway: Funnel Every Call Through One Endpoint

Claim: don't let each script talk to a provider SDK directly—auth / routing / cache / logging belong behind one OpenAI-compatible endpoint.

Background & Principle

The gateway is the single entry point to the whole stack. Apps no longer know about "Anthropic," "OpenAI," or "Gemini"—they only know a local endpoint http://localhost:4000/v1 speaking the OpenAI ChatCompletions dialect. Behind it, the gateway handles the real provider adaptation, auth, accounting, caching, logging. LiteLLM Proxy is the lowest-friction choice for individuals: one config.yaml maps 100+ providers to unified model names, and it runs as an OpenAI-compatible server.

The key payoff is decoupling: switching the underlying model is a one-line change in the gateway, with zero downstream app changes; capping an app, rotating a key, or adding a fallback all happens at the hub. That's exactly what the diagram below shows—all four layers hang behind the gateway, and the app side sees only one endpoint.

┌─ Apps ────────────┐ │ Cursor · scriptA │ all know only one OpenAI-compatible endpoint │ side-proj · cron │ http://localhost:4000/v1 └─────────┬──────────┘ │ (OpenAI dialect + virtual key) ▼ ┌─────────────────── LLM GATEWAY (LiteLLM) ───────────────────┐ │ ① Auth virtual key → budget / model allowlist / expiry │ │ ② Router pick model by task/cost + provider fallback │ │ ③ Cache pass-through prompt cache + semantic-cache hit │ │ ④ Observ. every call → trace / token / cost / latency │ └───┬───────────────┬───────────────┬───────────────┬─────────┘ ▼ ▼ ▼ ▼ Anthropic OpenAI Gemini local Ollama (real keys live ONLY at the gateway; apps never see them)

Hands-on

# config.yaml — map unified model names to real providers
model_list:
  - model_name: smart            # apps only say "smart"
    litellm_params:
      model: anthropic/claude-opus-4-8
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: cheap
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

# Start: litellm --config config.yaml   →  localhost:4000
# App side is provider-agnostic:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:4000/v1", api_key="sk-my-virtual-key")
client.chat.completions.create(model="smart", messages=[...])  # call Claude via the OpenAI SDK
Failure mode: turning the gateway into a single point of failure—all apps hit one local process, and if it dies everything dies. Individual fix: keep the gateway stateless (config + env vars), auto-restart it with supervisord/systemd, or keep a direct-to-provider fallback path for any app that can't tolerate interruption, letting the gateway carry only traffic that can absorb the occasional hiccup.
Resources · LiteLLM Proxy docs, docs.litellm.ai/docs/simple_proxy · LiteLLM source, github.com/BerriAI/litellm
// 02

Cross-Model Routing & Fallback: Pick by Task, Back Up by Availability

Claim: routing has two orthogonal axes—"which tier does this task need" (cost) and "what if this provider is down" (availability). Don't conflate them.

Background & Principle

Axis 1 (cost routing): not every call deserves a frontier model. Classification, extraction, formatting, the routing decision itself—Haiku-tier is plenty, an order of magnitude cheaper. Flipping the default from "expensive model" to "cheap by default, expensive on demand" is the fastest way to halve an individual's token bill.

Axis 2 (availability fallback): providers rate-limit, 5xx, and occasionally refuse. A fallback is an ordered model chain: if the primary errors, auto-try the next. OpenRouter exposes this as a models array—listed by priority, an error on one falls through to the next, and you're billed for whichever model was actually used. LiteLLM has an equivalent router fallback.

The discipline: keep the two axes separate. Cost routing is "I proactively pick the cheap one"; fallback is "passively keep me from going down." Mistaking a degraded fallback for a cost optimization will quietly drop your quality during a provider hiccup without you noticing.

Hands-on

# Cost routing: classify in one shot, then pick the tier
tier = classify(task)                       # judge difficulty with a cheap model
model = "smart" if tier == "hard" else "cheap"

# Availability fallback: OpenRouter models array, ordered backup
client.chat.completions.create(
    model="anthropic/claude-opus-4-8",
    extra_body={"models": [          # primary errors → fall through
        "anthropic/claude-sonnet-4-6",
        "openai/gpt-latest",
    ]},
    messages=[...],
)
# The response's model field tells you which was actually used → always log it
Failure mode: (1) assuming prompts are portable across providers during fallback—a system prompt tuned on Claude may drift on GPT; for cross-vendor links in the chain, either accept quality variance or keep a separate prompt per vendor. (2) Using an expensive model as the routing classifier, so what you save doesn't even cover the classification cost. Classification must use the cheapest tier—or rules/regex first.
Resources · OpenRouter Model Fallbacks, openrouter.ai/docs/.../model-fallbacks · LiteLLM Router & Fallbacks, docs.litellm.ai/docs/simple_proxy
// 03

Two Cache Layers: Provider Prompt Cache + Semantic Cache

Claim: "cache" in LLM-land is two completely different things—don't conflate provider-side prefix reuse with application-layer semantic hits.

Background & Principle

Layer 1 · provider prompt caching: it caches the prefix. Mark the stable, unchanging parts (system prompt, tool definitions, long documents) with cache_control, and the provider caches that KV computation; subsequent hits are charged at a tiny read rate. Anthropic's docs are explicit: cache read is ~0.1× the base input price, a 5-minute cache write ~1.25×, and a 1-hour write ~2×. This layer is a money-saver when you repeatedly call with the same long prefix (agent loops, multi-turn conversations especially), but it only reuses computation—it doesn't skip inference; the model is still genuinely called each time.

Layer 2 · semantic cache (GPTCache and similar): it caches the whole Q&A pair. A new query is first embedded and compared by similarity against past queries; if similar enough, the old answer is returned directly and the model is not called at all—both latency and cost drop to ~0. The cost is that the "similar" judgment is risky: a loose threshold returns answers it shouldn't.

The two layers are orthogonal and stackable: the prefix cache lowers the cost per call; the semantic cache lowers the number of calls. The former is near-zero-risk; the latter saves more but you must manage the threshold.

Hands-on

# Layer 1: Anthropic prompt caching — tag the stable prefix
client.messages.create(
  model="claude-opus-4-8",
  system=[{
    "type": "text",
    "text": LONG_STABLE_INSTRUCTIONS,          # long and identical each time
    "cache_control": {"type": "ephemeral"}  # ← cached, read ≈ 0.1×
  }],
  messages=[{"role":"user","content": today_query}],  # variable part goes last
)

# Layer 2: semantic cache — a hit skips the model (GPTCache idea)
emb = embed(query)
hit = vec_store.search(emb, threshold=0.95)   # threshold is the key knob
if hit: return hit.cached_answer            # 0 tokens, 0 latency
ans = call_llm(query); vec_store.add(emb, ans)

One rule: put volatile content at the end of the prefix. A prompt-cache hit requires the prefix to be byte-for-byte identical—a dynamic timestamp at the top of the system prompt can invalidate the entire cached block.

Failure mode: a semantic-cache threshold too loose returns the old "2023 annual report" answer for a new "2024 annual report" question—the two embeddings are close, but the answer is flat wrong. Don't put high-factuality, time-sensitive queries on semantic cache; it fits FAQs, support, and stable knowledge—the "varied phrasing, stable answer" case.
Resources · Anthropic Prompt Caching, platform.claude.com/.../prompt-caching · GPTCache, github.com/zilliztech/GPTCache
// 04

Observability & Key Governance: See What You Spend, Control Who Spends

Claim: an AI pipeline without traces is flying blind—you don't know where the money went, which app is burning it, or how long a key has been leaked.

Background & Principle

Observability: LLM-call observability differs from ordinary API monitoring—you want to see prompt/completion pairs, token usage, per-call cost, latency, and (for agents) the full call chain. Langfuse is the individual's go-to open-source option: self-hostable, natively understands LLM concepts (tokens, cost, traces, nested spans), and integrates with OpenTelemetry and the major SDKs. Once it's in, "which side project quietly burned $40 this month" goes from guesswork to a chart.

Key governance—three iron rules: (1) real keys never enter code or git, only env vars or a secret manager, and only at the gateway layer. (2) issue each app a virtual key instead of sharing a real one: LiteLLM's virtual keys support per-key budgets, rate limits, model allowlists, and expiry—an app that runs away only burns its own quota and can be revoked alone without affecting others. (3) start budgets small ($5–50), raise by real usage, and set a hard cap on every key—the individual's only backstop against "a 3 a.m. infinite loop blows up the bill."

Hands-on

# Langfuse: decorator auto-traces, zero intrusion
from langfuse.openai import OpenAI   # drop-in replacement, auto-records token/cost/latency
client = OpenAI(base_url="http://localhost:4000/v1", api_key="sk-my-virtual-key")

# LiteLLM virtual key: per-app budget + model allowlist + expiry
# POST /key/generate
{
  "key_alias": "side-project-blog",
  "max_budget": 20,            # this key spends at most $20 ever
  "budget_duration": "30d",     # reset every 30 days
  "models": ["cheap"],          # cheap tier only, prevents misuse of opus
  "duration": "90d"             # the key itself expires in 90 days
}
Failure mode: (1) watching only "total spend" rather than "spend per app / per key"—the bill rose but you can't pin which project, which is no monitoring at all. Traces must carry app/key dimension tags. (2) Sharing one real key across apps and hardcoding it into multiple repos: one leak exposes everything, and with no per-key separation you can't even tell which app leaked or how big the blast radius is.
Resources · Langfuse docs, langfuse.com/docs · Langfuse source, github.com/langfuse/langfuse · LiteLLM Virtual Keys, docs.litellm.ai/docs/proxy/virtual_keys

// CAPSTONE · Stand Up Your Private AI Pipeline in One Weekend

Wire the four layers into a runnable minimal infra that every app then shares:

  1. Run the gateway: one LiteLLM config.yaml mapping a smart/cheap pair, real keys via env vars. litellm --config config.yaml, endpoint at localhost:4000.
  2. Add routing + fallback: give smart a cross-tier fallback chain (opus→sonnet→gpt); new scripts default to cheap, promote to smart only when reasoning depth is confirmed.
  3. Turn on cache: tag all stable system prompts with cache_control; for FAQ-style apps add a semantic-cache layer, tuning the threshold from 0.95 up.
  4. Wire Langfuse + issue virtual keys: one virtual key per app with max_budget and a model allowlist; self-host Langfuse, and tag every call with the app.
  5. Acceptance: point the base_url of your existing 2-3 scripts at localhost:4000 and run for a week. Open Langfuse on the weekend—for the first time you see at a glance what each app spent, the hit rate, and which provider hiccuped. That chart is the payoff of your private pipeline.

Once built, "switch models" goes from editing N repos to changing one config line; "how much did AI cost this month" goes from reverse-engineering a credit-card statement to a live dashboard. That's the watershed from "knows how to use AI tools" to "owns AI infrastructure."

// GLOSSARY

LLM Gateway / Proxy
A unified entry layer funneling many providers into one OpenAI-compatible endpoint. Example: LiteLLM Proxy.
OpenAI-compatible
Using the OpenAI ChatCompletions dialect as a universal protocol; any OpenAI SDK client connects with zero changes.
Model Routing
Proactively picking a model tier by task difficulty/cost (e.g. cheap vs smart).
Fallback Chain
An ordered model chain where a primary error auto-falls-through to the next, resisting provider hiccups.
Prompt Caching
Provider-side caching of a stable prefix's KV computation; cache read ≈ 0.1× the base price.
cache_control
The Anthropic API field marking a cache boundary, {"type":"ephemeral"}.
Semantic Cache
Hits a past answer by query-embedding similarity; on a hit the model isn't called at all. Example: GPTCache.
Observability
Full-chain visibility into LLM-call trace/token/cost/latency. Example: Langfuse.
Virtual Key
A proxy key issued by the gateway, carrying per-key budget, rate limit, model allowlist, expiry.
Budget Window
A virtual key's periodic spend cap (e.g. $20 per 30 days), preventing runaway spend.

// DEEP THINKING

The gateway funnels every call, making itself a single point of failure plus an extra hop of latency. For an individual, does centralization's upside really beat the cost?
For an individual (low traffic, many apps) the upside is overwhelming. The extra hop is process-local (milliseconds), negligible against multi-second inference; the SPOF is solved by systemd auto-restart plus a direct-connect fallback for critical apps. In return: one place to switch models, one view of cost, one place to manage keys. The cost rises with traffic scale (enterprise needs HA and horizontal scaling), but an individual is nowhere near that inflection—here centralization is nearly pure profit.
Prompt cache saves on "repeated calls with the same prefix"; semantic cache saves on "similar queries." What workload do neither help?
The "different prefix every time + new query every time" workload: one-off large-document analysis, reviewing a different codebase each time, exploratory research. The prefix changes → no reusable prefix; the query is always new → semantic hit rate ≈ 0. Such workloads can only be cheapened by model routing (right tier) and the batch API (non-real-time discount). The tell: hit rate stays below 10% yet you force a 1h cache, so the write premium actually loses money.
Cross-provider fallback sounds robust, but it hides a dangerous assumption. What is it, and how do you defend?
The assumption: "swap the model, the output quality is equivalent." In reality prompts are tuned to a specific model; cross-vendor fallback drifts behavior—format, tone, tool-call style may all change—and you stay oblivious because "at least it didn't error." Defenses: (1) force-log the model actually used in traces and review the fallback trigger rate; (2) for quality-sensitive paths fall back only within a vendor (opus→sonnet), reserving cross-vendor fallback for "something beats nothing" non-critical paths; (3) tag cross-vendor degraded outputs so downstream can recognize them.
A virtual key's per-key budget is a "hard cap." But an LLM call's cost is only known after the response—does the budget check happen before or after the call? What boundary problem does that create?
The budget is checked against recorded cumulative spend, before the call; but a single call's cost is only settled after the response. So there's an "overshoot tail": as cumulative spend nears the cap, several concurrent requests can all pass the pre-check and then together push the account past the limit. The impact is tiny for a single-threaded individual, but agent concurrency or batched cron triggers expose it. Pragmatic move: set the budget below what you can truly absorb, leaving a buffer, and pair it with a provider-side hard spend limit as a second line—don't treat the gateway budget as an absolutely impassable wall.
This infra is the watershed from "uses AI" to "owns AI infra." But could it instead become complexity debt that locks you in? When should you tear it down?
It can, if over-engineered. The criterion is pain-driven: you have ≥3 apps, the monthly bill needs allocating, you've switched models ≥2 times—only when these three signals appear is it worth building; otherwise a single config.py centralizing keys is enough. Signs to simplify/tear down: a layer's config hasn't changed in six months (it's not solving a real problem); the time maintaining the gateway exceeds the time it saves. Infra's value is "build once, every app benefits"—once apps converge to one, or all migrate to a hosted IDE, this self-built pipeline should degrade back to a lightweight config file. Don't architect for architecture's sake.

// FURTHER READING