DAY 28 / PHASE 3 · FRONTIER

Local & Edge LLM

Quantization · Memory Math · Runtime Differences · Hybrid Routing

2026-06-11 · BigCat

A local model isn't "a cheaper Claude" — it's a component with its own engineering constraints.

// WHY THIS MATTERS

By 2026, a 32GB Mac can comfortably run open-weight models in the Qwen / Llama / gpt-oss tier. So plenty of people drop a local model into an agent as a "free Claude" — and hit a wall of mysterious regressions: long context suddenly amnesiac, tool calls breaking, the same model behaving differently in Ollama vs MLX. The problem isn't the model — it's that local inference is a different discipline from calling a cloud API: you personally carry the quantization decision, the memory budget, the runtime choice, the constrained decoding. This issue doesn't explain "what quantization is" or "how to install Ollama" (that's 101). It covers the four engineering judgments that make or break a local setup: what quantization actually loses, how to compute whether memory fits, where the four runtimes differ, and the real boundary of small-model agents plus hybrid routing. The goal: answer "local or cloud for this task?" with numbers behind it.

// 01

Quantization Is an Engineering Decision, Not a Free Lunch

Claim: barely-changed perplexity ≠ no loss. Quantization hurts reasoning / code / long context far more than PPL shows.

Background & Principle

First-principles memory formula: weight bytes = param count × bytes per param. FP16 is 2 bytes/param, so a 7B model is 14GB raw; quantize to 4-bit (~0.5 byte/param incl. scale overhead) and it drops to ~4GB — that's the precondition for running locally at all. GGUF's K-quants (Q4_K_M, Q5_K_M) are today's de-facto standard: they use different precision for different tensors (higher bits for critical attention and feed_forward layers), and Q4_K_M is the recognized size/quality sweet spot. Layer on imatrix (importance matrix): a calibration corpus measures "which weights, when nudged, move the output most," and quantization protects those first — driving perplexity loss lower still.

But the real trap is the metric. The community habitually measures quantization loss with perplexity, and PPL is the average log-likelihood of next-token — it is deeply insensitive to "occasionally picking one critical token wrong." Yet exactly that occasional error is what fails a compile or snaps a reasoning chain. So you'll see "Q4's PPL only rose 1%" while real coding / math tasks drop a whole tier. The sensitivity of multi-step reasoning and precise generation to quantization is systematically understated by an averaged PPL.

Hands-on

# Don't trust PPL alone — A/B with samples from YOUR task
# Pull two quant levels of the same model, run your real 20 evals
ollama pull qwen3:8b              # default is usually Q4_K_M
ollama pull qwen3:8b-q8_0         # high-precision control

# Quantize with your own imatrix (max fidelity):
# 1) compute importance matrix on domain-representative text
./llama-imatrix -m model-f16.gguf -f calib.txt -o imat.dat
# 2) attach --imatrix at quantize time; critical weights are protected
./llama-quantize --imatrix imat.dat model-f16.gguf model-Q4_K_M.gguf Q4_K_M

The discipline: pick the quant level by task, not by "it runs." Forgiving tasks like extraction / classification tolerate Q4 or lower; code / math / long-chain reasoning warrant Q5_K_M / Q8, or a smaller-but-full-precision model.

Failure modes: (1) Aggressive quantization on small models — a 3B has little redundancy to begin with, so Q4's relative damage far exceeds the same quant on a 70B; don't go below Q5 if you want quality from a small model. (2) Quantizing the KV cache to Q4 to save memory — long-context retrieval degrades noticeably, because KV cache precision directly governs the model's ability to "look back" at early tokens. Q8 is the safe line for KV cache; use Q4 with caution.
Going deeper · llama.cpp quantize README, github.com/ggml-org/llama.cpp/.../quantize · imatrix discussion, llama.cpp discussion #5006
// 02

Memory Math, and Judging "Good Enough Locally"

Claim: can-it-run = weights + KV cache + overhead ≤ your memory. Most OOMs die on the ignored KV cache.

Background & Principle

"Will this model run on my machine" isn't a lookup, it's arithmetic. The budget has three parts:

Sum these three against your VRAM (or Apple Silicon unified memory) to know "does it fit." More important is the second judgment: fits ≠ should use. Local models have a clear "good-enough zone" and a "don't-touch zone":

Good-enough locally ──────────▶ Go cloud classify / extract / route / PII frontier reasoning / hard math fixed-format rewrite / summarize long context (beyond local stability) offline / strict-privacy contexts multi-step agentic (needs reliability) high-volume low-cost grunt work one-off, low-frequency, best quality Test: task tolerance × privacy need × call volume ↑tolerant ↑private ↑frequent → local pays off ↑needs-best ↑rare → just go cloud

Hands-on

# 30-second estimate: does the model fit?
weights_GB = params_B * bytes_per_param   # Q4≈0.5, Q8≈1.0, FP16≈2.0
kv_GB      = 2 * layers * kv_heads * head_dim * ctx * 2 / 1e9  # FP16 KV
need_GB    = weights_GB + kv_GB + 1.5     # +overhead headroom

# e.g. Llama-70B Q4, 32K context, GQA(kv_heads=8)
#  weights ≈ 40GB, kv ≈ 10GB+ → need ≈ 52GB
#  → a 48GB machine "that runs 70B" can't actually open 32K context
Failure mode: choosing a machine by weight size only, forgetting KV cache. "48GB runs 70B Q4 (40GB)" holds at 2K context, but open long context and it OOMs or gets silently truncated by the framework. When buying hardware / picking a model, budget the KV cache at your real context length — don't look at the model file size alone.
Going deeper · Simon Willison How to run an LLM on your laptop, simonwillison.net/2025/Jul/18 · Apple Silicon runtime study, arXiv:2511.05502
// 03

Runtimes Aren't Equal: llama.cpp / Ollama / MLX / LM Studio

Claim: the same weights behave differently across runtimes. The worst trap isn't speed — it's Ollama silently truncating context.

Background & Principle

"Running a model locally" is really a layered stack, each layer solving something different:

┌─────────────────────────────────────────────┐ │ LM Studio GUI + model store (backend: llama.cpp / MLX) │ ├─────────────────────────────────────────────┤ │ Ollama model registry + Modelfile │ │ + OpenAI-compatible server (wraps llama.cpp) │ ├──────────────────────┬──────────────────────┤ │ llama.cpp │ MLX / mlx-lm │ │ cross-platform C++ │ Apple Silicon native │ │ GGUF · GBNF grammar │ unified-mem zero-copy │ └──────────────────────┴──────────────────────┘

Hands-on

Ollama's most insidious trap: a small default num_ctx (historically often 2048/4096). You think you're using the model's 128K window, but it only sees a few thousand tokens and silently drops the rest — "unexplained amnesia" on long-context tasks is this nine times out of ten. Raise it explicitly via the Modelfile:

# Modelfile: set context explicitly, don't eat the silent default
FROM qwen3:8b
PARAMETER num_ctx 32768

# create and serve
ollama create qwen3-32k -f Modelfile

# then point the OpenAI SDK straight at local — almost no code change
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(model="qwen3-32k",
        messages=[{"role":"user","content":"..."}])
Failure modes: (1) Inferring local behavior from cloud benchmarks — quant level, runtime, and KV precision all differ; you must re-measure on your runtime. (2) Using a generic GGUF route on a Mac and complaining it's slow — switching to MLX on the same machine often lifts throughput directly; picking the wrong runtime costs more than picking the wrong model. (3) Forgetting num_ctx and blaming "the model's long context is bad."
Going deeper · Ollama OpenAI compatibility, docs.ollama.com/api/openai-compatibility · Apple MLX, github.com/ml-explore/mlx
// 04

The Boundary of Small-Model Agents, and Hybrid Routing

Claim: shaky local tool-use is an engineering problem, not a capability one — constrained decoding guarantees syntax, hybrid routing guarantees quality.

Background & Principle

Drop an 8B local model into an agentic loop and the usual collapse is unstable tool-call format: a missing bracket, truncated JSON, a misspelled field — one parse error kills the whole loop. Cloud models do this so well you forget how hard it is; small models don't have that luxury. The fix isn't "ask the model to behave," it's constrained decoding: llama.cpp's GBNF grammar / JSON Schema zeroes out the probability of any grammar-violating token at the sampling layer, physically guaranteeing valid JSON. That's an order of magnitude more reliable than writing "please output JSON" in the prompt.

But mind the boundary: it guarantees syntax, not semantics. The model can still fill in a syntactically valid but factually wrong value. And in llama.cpp's implementation, the schema only constrains output — it is not injected into the prompt; the model can't "see" the schema, so if you want it to understand field meanings, you still describe them in the prompt.

The real engineering answer is usually hybrid: the local model does the high-volume, forgiving, privacy-sensitive grunt work (routing classification, PII redaction, first-pass filtering, drafting), and the hard, low-frequency, must-be-best steps go to a cloud frontier model. Local serves as router and fallback, not the full-time lead.

┌── simple/frequent/private ──▶ local 8B (constrained) request ─▶ local classifier ┤ └── complex/rare/best-quality ─▶ cloud frontier (local can also draft, then cloud verifies)

Hands-on

# Constrain local output with JSON Schema (llama.cpp server / llama-cpp-python)
# the sampling layer forces valid JSON — even small models can't emit garbage
schema = {"type":"object",
  "properties":{"intent":{"enum":["search","refund","other"]},
                "urgent":{"type":"boolean"}},
  "required":["intent","urgent"]}

r = client.chat.completions.create(model="qwen3-8k",
      messages=[{"role":"user","content": ticket}],
      extra_body={"response_format":{"type":"json_schema",
                  "json_schema":{"schema": schema}}})
# intent=search → handle locally; intent=refund & urgent → escalate to cloud
Failure modes: (1) Assuming constrained = correct — it guarantees {"urgent":true} is valid JSON, not that true is the right judgment; semantic errors still happen. (2) Expecting an 8B to run a full multi-tool agentic loop — as steps pile up, accumulated error rate drags success to unusable; those belong in the cloud. (3) A hybrid whose local classifier is itself inaccurate — wrong model at the routing layer poisons everything downstream; the classifier must be accurate enough, or itself a cheap cloud tier.
Going deeper · llama.cpp GBNF grammars, github.com/ggml-org/llama.cpp/.../grammars · Simon Willison llama.cpp grammars to generate JSON, til.simonwillison.net

// Capstone · Build Yourself a Local-First "Triage" Layer

String the four points into a weekend project that saves money and protects privacy: add a local triage layer in front of one of your existing cloud calls.

  1. Pick quant & model (§1): classification is forgiving, so start with an 8B Q4_K_M; A/B Q4 vs Q8 on your 20 real samples first, confirm Q4 doesn't drop quality, then commit.
  2. Compute the memory budget (§2): include the KV cache at the context length you need; confirm it runs stably — don't discover the OOM in production.
  3. Choose a runtime (§3): try MLX first on a Mac; otherwise spin up Ollama's OpenAI-compatible server with an explicit, large enough num_ctx in the Modelfile.
  4. Constrain + route (§4): have the local classifier output {intent, urgent} under JSON Schema; handle the simple/private cases locally, escalate the complex/high-value ones to a cloud frontier model.
  5. Quantify the gain: run it for a week, log "share handled locally × token cost saved per call" and "accuracy of escalated cloud cases." Most classification/routing setups find 60%+ of requests never need the cloud — costs halved, privacy protected as a bonus.

Once you've built this, "local vs cloud" stops being a feeling and becomes a decision line with numbers behind it — exactly the cost intuition a super-individual should carry.

// DEEP THINKING

If perplexity systematically understates quantization's harm to reasoning, why does the community still compare quant levels with PPL?
Because PPL is cheap, deterministic, and reproducible — one fixed corpus, one pass, a number; no task design, no judge noise. Task-level eval is expensive and subjective. It's the classic compromise between "what's measurable" and "what you want to measure." The right move: use PPL as a coarse filter (a PPL spike eliminates a candidate), but decide the final level via A/B on your own task samples. A metric's convenience shouldn't dictate your engineering decision — a universal rule across all eval work.
Unified memory lets a Mac run big models with "lots of memory," so why does it still lose to a discrete GPU of equal memory under heavy load?
The bottleneck is memory bandwidth and compute, not just capacity. A discrete GPU's GDDR/HBM bandwidth far exceeds Mac unified memory, with stronger parallel compute. Unified memory wins on "large capacity, zero copy, energy efficiency" — it fits models others can't and feels great for single-stream low latency; but high-concurrency batch throughput and compute-dense very-large-model work still favor the discrete GPU. Choose by whether you're "single-user low-latency" or "multi-stream high-throughput."
Constrained decoding can 100% guarantee valid JSON — could it actually lower output quality?
Yes, and it's a real trade-off. Forcibly masking tokens can cut off the high-probability path the model "wanted" to take, pushing it into a sub-optimal branch and slightly lowering quality on some tasks. Subtler still: if the schema conflicts with the model's natural tendency (it wants to explain then conclude, but you only allow pure JSON), the constraint amplifies the awkwardness. Mitigations: make the schema fit the model's natural output, add a reasoning field to let it "think" first when needed, or split complex structures into two steps (free generation, then extraction).
Hybrid routing sounds great, but adding a local classifier adds a failure point and latency. When is hybrid actually over-engineering?
When call volume is too low. Hybrid's payoff comes from "a large share of high-frequency requests are actually simple" — at low volume the cost saved can't cover the complexity of maintaining a local stack (deploy, monitor, the classifier's own errors). Low-frequency cases are simpler all-cloud. The test rhymes with Day 3's "workflow vs agent": start with the simplest (all-cloud), add hybrid only when cost/privacy genuinely becomes the bottleneck. Don't go hybrid just to "use a local model."
The "good-enough zone" keeps expanding as models get stronger. Will it eventually squeeze cloud frontier models into a narrow corner?
The zone is indeed expanding — what an 8B does today took GPT-3.5 three years ago. But frontier capability rises too; the two are a race, not a convergence. The likelier steady state: local absorbs the "mature, standardizable, not-quality-sensitive" long tail (classify, extract, rewrite), while cloud holds the frontier needing freshest knowledge, strongest reasoning, longest context. The boundary keeps shifting right but won't vanish — just as local databases never killed cloud databases, they just each found their place. The real leverage is judgment: knowing which side a task falls on right now.

// FURTHER READING