Open-weight models aren't a "budget Claude" — they're a different runtime with their own engineering discipline.
By 2026, open weights are no longer toys: DeepSeek-V3 / R1, Qwen3, Llama, and Mistral approach closed-source SOTA on many tasks. But "close on a leaderboard" and "able to replace Claude in your production pipeline" are separated by a whole layer of engineering. Most people trip on three illusions: that open = cheap (it actually buys you control, privacy, and the ability to fine-tune, while saddling you with ops and quality risk); that downloading means running (you're actually stuck on VRAM math and quantization loss); and that you can paste your Claude prompt straight over (chat templates, tool use, and structured output all need redoing). This issue does not cover "what Llama is" — that's ai-ml-daily's job. We cover four engineering matters: which tasks open truly handles, License as a hidden constraint, the real ledger of quantization and VRAM, the fragility of prompt/tool porting, and the build reality of MoE and reasoning models.
Split "should we go open" into two independent questions. Capability: open is already good enough on "short, closed, verifiable" tasks — classification, extraction, fixed-domain RAG Q&A, translation, rewriting, mid-difficulty code completion. It still has a clear cliff on "long, open, sequential-decision" tasks — multi-step agentic loops, complex tool orchestration, frontier reasoning. The cause isn't a single weak skill, it's error accumulation: if closed is 95% right per step and open is 88%, after 20 steps that 7pp gap balloons into disaster.
The License question matters more than most assume. Llama ships under Meta's own community license, not an OSI open-source one: it has a 700M monthly-active-user cap (above which you need a separate grant from Meta) and forbids using Llama outputs to train or distill any non-Llama model — which directly kills the "use Llama to generate synthetic data for my own model" path. By contrast Qwen and Mistral are mostly Apache 2.0, and DeepSeek models are mostly MIT — far cleaner. If your plan involves distillation, synthetic data, or possibly going big, License selection belongs ahead of capability benchmarking.
# 30-second selection (pin it in your project)
# Go open iff ≥1 holds:
- Data can't leave premises / strict compliance # privacy = the hardest reason
- Task is fixed and high-frequency, worth a fine-tuned specialist
- Volume so high the closed API cost structure is untenable
- Need determinism/reproducibility/version-locking (closed swaps models silently)
# Task capability cliff (heuristic, not absolute):
fine: classification / extraction / translation / rewrite / fixed RAG / single-step code
careful: 3-8 step workflows (use workflow, not agent, as a net)
avoid: long agentic loops / complex multi-tool orchestration / frontier math reasoning
# License before capability:
distill/synthetic-data/may-scale → avoid Llama, pick Qwen(Apache) / DeepSeek(MIT)
The VRAM budget has three parts: weights (params × bytes/param) + KV cache (grows linearly with context length and concurrency) + activation/framework overhead. A 32B model in FP16 is 64GB of weights alone — won't fit one card; quantized to 4-bit, weights drop to about 16GB, into reach of a consumer dual-card or even a single 24G card. So quantization isn't an "optional optimization," it's the precondition for open to be deployable at all.
Quant formats split into two lines. Weight-only low-bit: GGUF (the llama.cpp ecosystem, top pick for CPU/Mac/hybrid inference), AWQ / GPTQ (4-bit, GPU inference; AWQ's idea is that protecting just ~1% of salient weights greatly reduces error). FP8: minimal precision loss, but needs newer GPUs. The empirical sweet spot is 4-bit (e.g. GGUF's Q4_K_M): quality loss is usually negligible; below 4-bit (Q3/Q2) degradation gets steep, hitting reasoning and long outputs hardest. Runtime also splits by scenario: llama.cpp / Ollama are flexible, fast to set up, good for local and single-user; vLLM uses PagedAttention for high-throughput continuous batching — the default for production concurrency. Don't mix the two scenarios up.
# Rough VRAM estimate: will it fit my card
def vram_gb(params_b, bits=4, ctx_k=8, batch=1):
weights = params_b * bits / 8 # weights
kv = 0.5 * ctx_k * batch # KV cache rough(GB), varies by model
return round((weights + kv) * 1.2, 1) # ×1.2 framework overhead
vram_gb(32, bits=4, ctx_k=8) # → ~20.6G: a single 24G card just fits
vram_gb(70, bits=4, ctx_k=32) # → ~62G: needs dual cards or an A100
# For production, use vLLM not Ollama for concurrency
# vllm serve Qwen/Qwen3-32B-AWQ --quantization awq --max-model-len 16384
Three most-underrated pitfalls. ① Chat template: each open model has its own conversation special tokens (system/user/assistant boundaries, tool-block format). Use the wrong template — another model's format, or hand-assembled strings — and the model doesn't error; it just quietly gets dumber. Always use the official tokenizer's apply_chat_template; never hand-write it. ② Tool use / function calling: open models' tool calling is far more brittle than Claude's — JSON often drops brackets, drifts field names, or mixes in explanatory prose. Prompting "output only JSON" is nowhere near enough. ③ Few-shot weight: instructions a closed model handles zero-shot often need 2-3 examples to be stable on open — Examples > Rules (Day 9) is even more extreme here.
The engineering fix for ② is constrained decoding: at sampling time, zero out the probabilities of illegal tokens, mechanically guaranteeing valid JSON / conformance to a given grammar. vLLM has built-in guided decoding, often backed by engines like XGrammar that achieve near-zero overhead. This turns "pray the model keeps format" into "physically impossible to break format" — the same idea as the permission gate in Day 3: don't leave capability to the model's discretion, enforce it in the runtime.
from openai import OpenAI # vLLM exposes an OpenAI-compatible endpoint
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
# ① Always use the official chat template (transformers)
# prompt = tokenizer.apply_chat_template(msgs, tokenize=False,
# add_generation_prompt=True) # don't hand-assemble <|im_start|> tokens
# ② Constrained decoding: guarantee valid JSON mechanically, not by prompting
schema = {"type":"object",
"properties":{"sentiment":{"enum":["pos","neg","neutral"]},
"score":{"type":"number"}},
"required":["sentiment","score"]}
r = client.chat.completions.create(
model="Qwen/Qwen3-32B-AWQ",
messages=[{"role":"user","content":"Service here was way too slow"}],
extra_body={"guided_json": schema}, # ← vLLM constrained decoding
) # output guaranteed json.loads-able, no regex fallback needed
Keep MoE's two ledgers separate. DeepSeek-V3 has 671B total params but activates only 37B per token (using MLA + auxiliary-loss-free load balancing, per the technical report). Meaning: compute/throughput is reckoned on the active 37B — cheap; but VRAM must hold all 671B experts — you can't predict which expert gets routed. So MoE is "data-center friendly, single-machine unfriendly": it optimizes throughput at scale, not letting you run a huge model on one card. Trying to run a big MoE locally is basically a category error.
The reasoning model's token ledger. DeepSeek-R1 (arXiv:2501.12948) proves pure RL can elicit high-quality reasoning behaviors like self-reflection and verification — but at the cost of a chain-of-thought token explosion: a simple question may emit thousands of reasoning tokens first. Self-host an R1-class model and you pay full compute for that long thinking on every call. The engineering reality: rather than self-hosting a full reasoning model, use the officially released distilled versions (R1 distilled into smaller Qwen / Llama models), compressing the reasoning ability into a size that fits your card; or control the reasoning budget at the prompt layer, turning thinking off for simple tasks (Qwen3's hybrid thinking/non-thinking is designed exactly for this).
# MoE: compute the two ledgers separately
total_params = 671 # → decides VRAM (must hold all experts)
active_params = 37 # → decides per-token compute/throughput
# Takeaway: MoE suits data-center scale, not single-machine self-hosting of big sizes
# Reasoning: ask "is this task worth thinking about" before enabling CoT
# simple classification/extraction → thinking off, save 90% tokens
# complex math/debugging → thinking on, but cap it to prevent runaway
# Build-path priority:
# distilled small model (fits) > full self-host (expensive) > on-demand cloud reasoning API
Pick a "short, closed, verifiable" task you already run on Claude (e.g. review sentiment classification, invoice field extraction), port it end to end, and feel all four pitfalls firsthand:
vram_gb(); validate locally by pulling Q4_K_M via Ollama, then move to vLLM + AWQ for a concurrency benchmark.apply_chat_template, trim Claude's system prompt, add 2-3 few-shot examples, and lock output format with guided_json constrained decoding.Once you've done this, your judgment on "can open replace closed" shifts from gut to data: not a faith war, but a three-axis tradeoff of task-cliff + cost + compliance on each concrete task.
"uncertain" enum or confidence field as an escape hatch — don't pass format-validity off as correctness.