DAY 30 / PHASE 3 · FRONTIER

Open Source Models in Practice

Llama · Qwen · DeepSeek · Mistral — Selection / Quantization / Porting / MoE

2026-06-13 · BigCat

Open-weight models aren't a "budget Claude" — they're a different runtime with their own engineering discipline.

// WHY THIS MATTERS

By 2026, open weights are no longer toys: DeepSeek-V3 / R1, Qwen3, Llama, and Mistral approach closed-source SOTA on many tasks. But "close on a leaderboard" and "able to replace Claude in your production pipeline" are separated by a whole layer of engineering. Most people trip on three illusions: that open = cheap (it actually buys you control, privacy, and the ability to fine-tune, while saddling you with ops and quality risk); that downloading means running (you're actually stuck on VRAM math and quantization loss); and that you can paste your Claude prompt straight over (chat templates, tool use, and structured output all need redoing). This issue does not cover "what Llama is" — that's ai-ml-daily's job. We cover four engineering matters: which tasks open truly handles, License as a hidden constraint, the real ledger of quantization and VRAM, the fragility of prompt/tool porting, and the build reality of MoE and reasoning models.

// 01

Selection & License: Where Open Truly Replaces, and License as Hidden Constraint

Claim: choosing open is a "control / privacy / fine-tunability" decision, not a "save money" one — and there's a sharp task cliff on capability.

Background & Principle

Split "should we go open" into two independent questions. Capability: open is already good enough on "short, closed, verifiable" tasks — classification, extraction, fixed-domain RAG Q&A, translation, rewriting, mid-difficulty code completion. It still has a clear cliff on "long, open, sequential-decision" tasks — multi-step agentic loops, complex tool orchestration, frontier reasoning. The cause isn't a single weak skill, it's error accumulation: if closed is 95% right per step and open is 88%, after 20 steps that 7pp gap balloons into disaster.

The License question matters more than most assume. Llama ships under Meta's own community license, not an OSI open-source one: it has a 700M monthly-active-user cap (above which you need a separate grant from Meta) and forbids using Llama outputs to train or distill any non-Llama model — which directly kills the "use Llama to generate synthetic data for my own model" path. By contrast Qwen and Mistral are mostly Apache 2.0, and DeepSeek models are mostly MIT — far cleaner. If your plan involves distillation, synthetic data, or possibly going big, License selection belongs ahead of capability benchmarking.

Hands-on

# 30-second selection (pin it in your project)
# Go open iff ≥1 holds:
- Data can't leave premises / strict compliance   # privacy = the hardest reason
- Task is fixed and high-frequency, worth a fine-tuned specialist
- Volume so high the closed API cost structure is untenable
- Need determinism/reproducibility/version-locking (closed swaps models silently)

# Task capability cliff (heuristic, not absolute):
fine:    classification / extraction / translation / rewrite / fixed RAG / single-step code
careful: 3-8 step workflows (use workflow, not agent, as a net)
avoid:   long agentic loops / complex multi-tool orchestration / frontier math reasoning

# License before capability:
distill/synthetic-data/may-scale → avoid Llama, pick Qwen(Apache) / DeepSeek(MIT)

Failure mode: forcing an open model through a long agentic loop, then concluding "open can't do it." The real fix is often to lower the architecture — fold the agent back into a deterministic workflow (Days 3 / 5), let the open model only do "short tasks" at each node, and pull control flow back into code. Open model + workflow often beats closed model + bare agent.

Going deeper · Llama 3.3 Community License, llama.com/.../license · DeepSeek-V3 Technical Report, arXiv:2412.19437

// 02

Quantization & VRAM: The Real Ledger of Running It

Claim: the real barrier to open models isn't downloading — it's VRAM math and the quantization-loss curve. Pick the wrong bit-width or runtime and capability evaporates.

Background & Principle

The VRAM budget has three parts: weights (params × bytes/param) + KV cache (grows linearly with context length and concurrency) + activation/framework overhead. A 32B model in FP16 is 64GB of weights alone — won't fit one card; quantized to 4-bit, weights drop to about 16GB, into reach of a consumer dual-card or even a single 24G card. So quantization isn't an "optional optimization," it's the precondition for open to be deployable at all.

Quant formats split into two lines. Weight-only low-bit: GGUF (the llama.cpp ecosystem, top pick for CPU/Mac/hybrid inference), AWQ / GPTQ (4-bit, GPU inference; AWQ's idea is that protecting just ~1% of salient weights greatly reduces error). FP8: minimal precision loss, but needs newer GPUs. The empirical sweet spot is 4-bit (e.g. GGUF's Q4_K_M): quality loss is usually negligible; below 4-bit (Q3/Q2) degradation gets steep, hitting reasoning and long outputs hardest. Runtime also splits by scenario: llama.cpp / Ollama are flexible, fast to set up, good for local and single-user; vLLM uses PagedAttention for high-throughput continuous batching — the default for production concurrency. Don't mix the two scenarios up.

┌──────── VRAM × Quantization Decision Map ────────┐ │ │ │ bit ▏ 7B 32B 70B ▏ quality loss │ │ ────┼──────────────────┼────────── │ │ FP16▏ 14G 64G 140G ▏ baseline (lossless) │ │ 8bit▏ 7G 32G 70G ▏ near-lossless │ │ 4bit▏ ~4G ~16G ~35G ▏ negligible ◀ sweet spot │ │ 3bit▏ 3G 12G 26G ▏ visible degradation │ │ 2bit▏ 2G 8G 18G ▏ collapses ✗ │ │ │ │ + KV cache: grows linearly with ctx×concurrency, don't skip it │ │ │ runtime: local/Mac/single-user → llama.cpp/Ollama │ │ production/high-concurrency → vLLM(PagedAttn) │ └──────────────────────────────────────────────────────────┘ (weight VRAM approximate: params × bit/8, plus framework overhead)

Hands-on

# Rough VRAM estimate: will it fit my card
def vram_gb(params_b, bits=4, ctx_k=8, batch=1):
    weights = params_b * bits / 8                 # weights
    kv      = 0.5 * ctx_k * batch                 # KV cache rough(GB), varies by model
    return round((weights + kv) * 1.2, 1)         # ×1.2 framework overhead

vram_gb(32, bits=4, ctx_k=8)   # → ~20.6G: a single 24G card just fits
vram_gb(70, bits=4, ctx_k=32)  # → ~62G: needs dual cards or an A100

# For production, use vLLM not Ollama for concurrency
# vllm serve Qwen/Qwen3-32B-AWQ --quantization awq --max-model-len 16384

Failure mode: (1) cramming into a small card with Q2/Q3 to run reasoning tasks — the chain-of-thought drifts worse the longer it reasons, and the saved VRAM buys unusable output. (2) Using Ollama for production concurrency — it shines at single-user interaction; multi-stream throughput lags vLLM badly. Conversely, burning hours forcing vLLM onto a Mac is also the wrong tool. (3) Counting only weights and forgetting KV cache, then OOM-ing the moment a long context arrives.

Going deeper · AWQ (Lin et al., MLSys 2024), arXiv:2306.00978 · vLLM / PagedAttention (Kwon et al., SOSP 2023), arXiv:2309.06180 · llama.cpp, github.com/ggml-org/llama.cpp

// 03

Porting Fragility: You Can't Paste Claude's Prompt Over

Claim: copying a closed-source prompt onto an open model silently degrades it; chat templates and constrained decoding are required coursework for open deployment.

Background & Principle

Three most-underrated pitfalls. ① Chat template: each open model has its own conversation special tokens (system/user/assistant boundaries, tool-block format). Use the wrong template — another model's format, or hand-assembled strings — and the model doesn't error; it just quietly gets dumber. Always use the official tokenizer's apply_chat_template; never hand-write it. ② Tool use / function calling: open models' tool calling is far more brittle than Claude's — JSON often drops brackets, drifts field names, or mixes in explanatory prose. Prompting "output only JSON" is nowhere near enough. ③ Few-shot weight: instructions a closed model handles zero-shot often need 2-3 examples to be stable on open — Examples > Rules (Day 9) is even more extreme here.

The engineering fix for ② is constrained decoding: at sampling time, zero out the probabilities of illegal tokens, mechanically guaranteeing valid JSON / conformance to a given grammar. vLLM has built-in guided decoding, often backed by engines like XGrammar that achieve near-zero overhead. This turns "pray the model keeps format" into "physically impossible to break format" — the same idea as the permission gate in Day 3: don't leave capability to the model's discretion, enforce it in the runtime.

Hands-on

from openai import OpenAI  # vLLM exposes an OpenAI-compatible endpoint
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

# ① Always use the official chat template (transformers)
# prompt = tokenizer.apply_chat_template(msgs, tokenize=False,
#            add_generation_prompt=True)  # don't hand-assemble <|im_start|> tokens

# ② Constrained decoding: guarantee valid JSON mechanically, not by prompting
schema = {"type":"object",
  "properties":{"sentiment":{"enum":["pos","neg","neutral"]},
               "score":{"type":"number"}},
  "required":["sentiment","score"]}

r = client.chat.completions.create(
    model="Qwen/Qwen3-32B-AWQ",
    messages=[{"role":"user","content":"Service here was way too slow"}],
    extra_body={"guided_json": schema},   # ← vLLM constrained decoding
)  # output guaranteed json.loads-able, no regex fallback needed

Failure mode: (1) copying a prompt template across models and dragging the other model's special tokens along — silent degradation you can't diagnose. (2) Feeding open models the closed-source "polite long system prompt" — many open models follow over-long system instructions poorly; key constraints must be carried by few-shot and constrained decoding, not by piling on words. (3) Parsing tool use with only regex and no grammar constraint, then getting repeatedly broken by malformed JSON in production.

Going deeper · XGrammar (Dong et al. 2024), arXiv:2411.15100 · vLLM Structured Outputs docs, docs.vllm.ai/.../structured_outputs

// 04

MoE & Reasoning: The Two Ledgers DeepSeek Teaches

Claim: MoE lets you run "total-param" capability at "active-param" compute — but you pay VRAM by total params; before self-hosting a reasoning model, do the token math — most cases should distill, not self-host.

Background & Principle

Keep MoE's two ledgers separate. DeepSeek-V3 has 671B total params but activates only 37B per token (using MLA + auxiliary-loss-free load balancing, per the technical report). Meaning: compute/throughput is reckoned on the active 37B — cheap; but VRAM must hold all 671B experts — you can't predict which expert gets routed. So MoE is "data-center friendly, single-machine unfriendly": it optimizes throughput at scale, not letting you run a huge model on one card. Trying to run a big MoE locally is basically a category error.

The reasoning model's token ledger. DeepSeek-R1 (arXiv:2501.12948) proves pure RL can elicit high-quality reasoning behaviors like self-reflection and verification — but at the cost of a chain-of-thought token explosion: a simple question may emit thousands of reasoning tokens first. Self-host an R1-class model and you pay full compute for that long thinking on every call. The engineering reality: rather than self-hosting a full reasoning model, use the officially released distilled versions (R1 distilled into smaller Qwen / Llama models), compressing the reasoning ability into a size that fits your card; or control the reasoning budget at the prompt layer, turning thinking off for simple tasks (Qwen3's hybrid thinking/non-thinking is designed exactly for this).

Hands-on

# MoE: compute the two ledgers separately
total_params = 671   # → decides VRAM (must hold all experts)
active_params = 37    # → decides per-token compute/throughput
# Takeaway: MoE suits data-center scale, not single-machine self-hosting of big sizes

# Reasoning: ask "is this task worth thinking about" before enabling CoT
# simple classification/extraction → thinking off, save 90% tokens
# complex math/debugging          → thinking on, but cap it to prevent runaway

# Build-path priority:
# distilled small model (fits) > full self-host (expensive) > on-demand cloud reasoning API

Failure mode: (1) seeing "37B active" and assuming DeepSeek-V3 runs on a 40G card — VRAM is reckoned on 671B, an order of magnitude off. (2) Using a reasoning model as a general model, leaving thinking maxed even on simple tasks — token cost and latency both explode. (3) Blindly self-hosting the full model chasing "all local," ignoring that a distilled version may already be good enough on your specific task at an order-of-magnitude lower cost.

Going deeper · DeepSeek-R1 (arXiv:2501.12948), arxiv.org/abs/2501.12948 · DeepSeek-V3 (arXiv:2412.19437), arxiv.org/abs/2412.19437 · Qwen3 Technical Report, github.com/QwenLM/Qwen3

// Integrated Practice · Port One Claude Task to Open Source

Pick a "short, closed, verifiable" task you already run on Claude (e.g. review sentiment classification, invoice field extraction), port it end to end, and feel all four pitfalls firsthand:

Selection + License: fixed, high-frequency, worth a specialist? Pick Qwen3 or Mistral (Apache, keeping the distillation/synthetic-data door open), not Llama.
Quantization + runtime: estimate size with §2's vram_gb(); validate locally by pulling Q4_K_M via Ollama, then move to vLLM + AWQ for a concurrency benchmark.
Porting: use the official apply_chat_template, trim Claude's system prompt, add 2-3 few-shot examples, and lock output format with guided_json constrained decoding.
Eval comparison: using Days 6 / 29 methods, run "Claude" vs "open quantized" on the same golden set, comparing accuracy, p95 latency, and unit cost.
Decision: in most cases you'll find — on short, closed tasks the open quantized version matches accuracy, costs an order of magnitude less, and keeps data on premises. Only when all three hold is the migration truly worth it.

Once you've done this, your judgment on "can open replace closed" shifts from gut to data: not a faith war, but a three-axis tradeoff of task-cliff + cost + compliance on each concrete task.

// Deep Thinking

Why is the "open 88% vs closed 95% per step" error accumulation a disaster for agents but not for workflows?

Because an agent's control flow is decided by the model — each step's error reshapes the subsequent path. Over 20 independent steps, success drops from 0.95²⁰≈36% to 0.88²⁰≈8%, and once it takes a wrong branch it can't self-correct back. A workflow's control flow lives in code: each node is an independent short task, and a single failure can be isolated by code-layer retry/validation/rollback without polluting the whole. That's why "open + workflow" often beats "open + bare agent" — you swap the multiplicative structure of error accumulation for an additive one that can be cut off at each node.

Is quantization a free lunch? With the same 16GB VRAM, which is better: a 4-bit 32B or an FP16 7B?

Not a free lunch, but 4-bit's loss is usually far smaller than the gain from more parameters. Empirically "big model, heavy quant" generally beats "small model, light quant": a 4-bit 32B clearly outperforms an FP16 7B on most tasks, because parameter count sets the capability ceiling and 4-bit only shaves marginal precision. There are exceptions — tasks sensitive to numerical precision (long-chain math, exact code) degrade more visibly at low bits, and the curve steepens below 4-bit. So default to "big model + 4-bit," but measure reasoning-heavy tasks rather than assuming.

Constrained decoding guarantees valid JSON, but can it guarantee correct content? Where's the boundary?

No, and it easily misleads. Constrained decoding only constrains form (valid JSON, complete fields, enum in range), not semantics (whether the value is truly correct). More subtly: a hard constraint forces the model to fill a valid value even when it should say "I don't know," potentially amplifying hallucination — it converts uncertainty from a visible signal (malformed format) into a hidden one (valid but wrong). So pair constrained decoding with content eval, and where needed leave an "uncertain" enum or confidence field as an escape hatch — don't pass format-validity off as correctness.

MoE's "compute by active, VRAM by total" asymmetry — what strategic choice does it imply for an individual self-hoster?

It means you basically can't reap MoE's dividend. MoE optimizes the cloud-vendor scenario: serving the full set of experts in shared VRAM to massive concurrent requests, amortizing the low active-compute into high throughput. As a single-user self-hoster you pay the VRAM cost of all experts without the throughput benefit. So the individual path should invert: pick dense small models (7B-32B dense) or distilled dense versions of MoE — controllable VRAM, single-machine friendly. Leave big MoE to "call a cloud API," and self-host only dense models that fit your card — a rational choice forced by the architectural asymmetry.

If an open model is already good enough on your specific task, why do many teams still can't quit closed source? Beyond capability, what are the hidden costs?

The hidden costs are mostly on the "non-model" ops surface: (1) you carry availability, autoscaling, GPU ops, security patches yourself — the closed API internalizes all of these; (2) model iteration — closed source keeps upgrading; self-hosting means chasing new releases, retesting, requantizing; (3) multimodal/tools/caching and other peripherals come bundled on a closed platform, while self-build means stitching each one. So "good enough on one task" often isn't "lower total cost of ownership." Rational conclusion: use open for narrow, stable, privacy-sensitive, high-frequency fixed tasks, and closed for broad, changing, frontier, low-frequency complex tasks — most mature teams run a hybrid stack, not an either/or.

// Further Reading

DeepSeek-V3 Technical Report — 671B/37B MoE, MLA, auxiliary-loss-free load balancing
DeepSeek-R1 — eliciting reasoning via pure RL, and distillation into small models
vLLM · PagedAttention (SOSP 2023) — the engineering bedrock of production inference throughput
AWQ (MLSys 2024) — 4-bit quantization protecting 1% salient weights
XGrammar — near-zero-overhead constrained decoding engine