DAY 12 / PHASE 1 · ENGINEERING

Fine-tuning vs Prompting

ROI Decision Tree · LoRA/QLoRA · Data Quality · Decoding Params

2026-05-27 · BigCat

It's 2026, and "should I fine-tune?" is still the decision most AI engineers make too early. This week: the real ROI threshold, true engineering cost of LoRA, why 50 examples beat 50K, and the counterintuitive behavior of temperature/top-p.

Prerequisite → ai-ml-daily Day 2 (Pretraining & Fine-tuning Mechanics)

// WHY THIS MATTERS

Fine-tuning in 2026 is nothing like 2023. LoRA is the default (HuggingFace PEFT, Unsloth, Axolotl all one-liners), QLoRA lets 70B train on a single 4090, and open-weight models (Llama 3.3, Qwen 3, DeepSeek V3) close in on Claude 4.x / GPT-5 on many tasks. Yet most teams fine-tune at the wrong moment—training before RAG is even tuned, never deduping the 50K noisy samples, only to discover prompt + few-shot would've matched the result, and never touching inference parameters. This issue assumes you know what fine-tuning is (ai-ml-daily Day 2 covers the mechanics) and skips definitions; we go straight to the 4 engineering layers that determine ROI: ① the when-to-FT decision tree → ② the real trade-offs of LoRA/QLoRA config → ③ why data quality 10x crushes quantity → ④ counterintuitive decoding parameter behavior. The key insight: fine-tuning is not "making the model smarter"—it's locking the output distribution into your subspace. Understanding this avoids 80% of common misuse.

ROI Ladder for AI Capability Acquisition (cheap → expensive) Problem: model isn't good enough for my scenario │ ▼ ┌─────────────────────────────────────────────────────────┐ │ ① Better prompt / system prompt cost: 0 │ │ XML structure, few-shot, CoT, role time: 1 hour │ └─────────────────────────────────────────────────────────┘ │ not enough ▼ ┌─────────────────────────────────────────────────────────┐ │ ② Add RAG / tool use cost: $ │ │ grounding, external knowledge, tools time: 1-3 d │ └─────────────────────────────────────────────────────────┘ │ still not enough ▼ ┌─────────────────────────────────────────────────────────┐ │ ③ Use stronger model cost: $$ │ │ Haiku → Sonnet → Opus; mini → GPT-5 time: 5 min │ └─────────────────────────────────────────────────────────┘ │ already on top-tier model ▼ ┌─────────────────────────────────────────────────────────┐ │ ④ Few-shot + structured output cost: $ │ │ dynamic example selection, JSON schema time: 1 day │ └─────────────────────────────────────────────────────────┘ │ pattern is truly stable & out-of-prompt-range ▼ ┌─────────────────────────────────────────────────────────┐ │ ⑤ Fine-tune (LoRA → full) cost: $$$$ │ │ style/persona/format/distillation time: 1-4 wk│ └─────────────────────────────────────────────────────────┘ 90% of "needs fine-tuning" is actually stuck at ①②③④. ⑤ truly fits only: style transfer / private data / on-device / latency hard limit

// 01

The decision framework: fine-tuning is the last step, not the first

Claim: 80% of fine-tuning is premature optimization—teams skip prompt + RAG + bigger model + few-shot and jump straight to training. The 5 cases where FT is genuinely irreplaceable: style/persona, strict format, private-domain knowledge compression, distillation to small model, hard latency constraints. Everything else should fall back upstream.

Background & Principle

Fine-tuning fundamentally locks the model's output distribution into the subspace of your samples—it does not "teach new knowledge" (factual memorization needs 100K+ samples to stabilize), nor "raise intelligence" (base capability is set in pretraining). Once you internalize this one sentence, most misuse evaporates: want the model to "answer specialized questions more accurately"? That's a retrieval problem, RAG beats FT; want "better reasoning"? That's a model-selection problem, swapping to Opus 4.7 / GPT-5 is 10x more effective than fine-tuning a Llama 3 8B; want a "specific speaking style", "strict JSON output", or "play a character"? That's the FT sweet spot.

OpenAI / Anthropic engineering docs (2024-2025) repeat the same hierarchy: prompt first, then RAG, then bigger model, then fine-tune. Anthropic literally says "almost always prefer prompting over fine-tuning"—not because FT doesn't work, but because the marginal-return curve drops sharply: one week of RAG tuning typically yields +30% accuracy; one week of FT typically yields +5-10%, plus you pay for eval, data prep, version management, and model hosting.

The 5 scenarios where FT really earns its keep: (1) Style/persona—100-500 conversation samples to make a small model speak like a brand; prompting can't match; (2) Strict format—certain schemas must never break, FT generalizes better than constrained decoding; (3) Distillation—train a Llama 8B on Claude Opus outputs for that one task, cutting inference cost 50-100x; (4) Private-domain token distribution—medicine/law/internal-code sub-languages where FT lowers perplexity meaningfully; (5) Latency/privacy hard limits—must run on a local 7B, must not leave the company network. Almost everything else should drop back to prompt + RAG.

Which technique fits which problem (decision matrix) Problem type Primary Backup FT? ────────────────────────────────────────────────────────────── Insufficient factual knowledge RAG bigger model ✗ Insufficient reasoning bigger model CoT ✗ Strict format / JSON prompt+CFG FT cond Specific style / brand voice few-shot FT ✓ Persona / character dialog few-shot FT ✓ Ultra-low latency / local infer distill-FT quantize ✓ Private domain (med/law/code) RAG + FT RAG ✓ Reduce hallucination RAG + cite prompt ✗ Cross-language bigger model prompt cond Agent tool-call accuracy prompt + few-shot ✗ Tone consistency / less verbosity prompt FT cond Ultra-long context (1M+) long-ctx RAG ✗ → FT is truly irreplaceable in only ~20% of engineering problems

Hands-on

A 20-line "should-I-FT" sanity check—run before every training job:

# pre_ft_check.py —— pre-training ROI sanity check
def should_finetune(task) -> str:
    checks = {
        "prompt_optimized":
            task.has_xml_structure and task.has_few_shot >= 3,
        "tried_bigger_model":
            task.tested_on_top_tier_model,  # Opus 4.7 / GPT-5
        "rag_attempted":
            task.is_knowledge_task <= task.has_rag,
        "have_eval_set":
            len(task.eval_examples) >= 50,
        "data_quality_audited":
            task.dataset_inspected_manually,
        "baseline_metric_known":
            task.prompt_baseline_score is not None,
    }
    failed = [k for k,v in checks.items() if not v]
    if failed:
        return f"❌ DO NOT TRAIN. Fix first: {failed}"

    # 5 scenarios where FT really fits
    valid_reasons = {"style_persona", "strict_format",
                     "distillation", "private_domain",
                     "latency_hard_constraint"}
    if task.motivation not in valid_reasons:
        return f"⚠️  Motivation '{task.motivation}' rarely benefits from FT. "\
               f"Re-evaluate via prompt/RAG first."

    # ROI estimate: FT gain must be >= 3x the prompt gain to be worth it
    expected_gain = task.eval_target - task.prompt_baseline_score
    if expected_gain < 0.15:
        return "⚠️  Expected gain < 15pp. FT ops cost likely outweighs benefit."
    return "✅ Proceed. Use LoRA r=16 baseline; full FT only if LoRA insufficient."

Failure modes: (1) "Our prompt is already long and complex, so it must be fully optimized." — Long ≠ good; cutting half of overly long prompts and adding 3 few-shots often raises accuracy; (2) "RAG is too slow, so FT." — RAG retrieval is typically 100ms, but the FT'd model may have higher inference latency itself; measure tokens, don't guess; (3) No eval set after training—you can't detect a 5% regression until users complain; (4) Scraping data from production without auditing—PII, wrong answers, and user complaints all get trained in; (5) Going straight to full fine-tune—skipping LoRA inflates engineering cost 20x for little marginal gain.

Deeper · Anthropic When to fine-tune Claude, docs.claude.com/.../prompt-engineering · OpenAI Fine-tuning best practices, platform.openai.com/docs/guides/fine-tuning

// 02

LoRA / QLoRA: the real trade-offs of rank, target_modules, alpha

Claim: LoRA's "mystery knobs" rank/alpha/target_modules aren't black magic. Rank is capacity, alpha is a learning-rate multiplier, target_modules is coverage—each has a counterintuitive sweet spot. Most tutorials' defaults (r=8, alpha=16, q_proj+v_proj only) are conservative for most real tasks.

Background & Principle

LoRA (Hu et al. 2021): freeze the base weights W, add a low-rank pair ΔW = B·A (A is r×d, B is d×r, with r ≪ d), and train only A, B. Parameter count drops from 100% to 0.1-2%, GPU memory from 80GB to 6-12GB. QLoRA (Dettmers et al. 2023) adds one more trick: keep the base model in 4-bit NF4 quantization in VRAM, dequantize on-the-fly during forward/backward—70B becomes trainable on a single 4090 (24GB). It's one of the most important engineering breakthroughs of 2023.

The real behavior of the three knobs:

rank (r): intuitively "capacity," but not monotonically better. Sebastian Raschka's 2024 ablations show diminishing returns from r=8 to r=64 for most tasks; r>128 starts overfitting. For style/persona, r=8-16 is enough; for capability boost / distillation, r=32-64; for domain adaptation + lots of data, consider r=128+. Default r=16 is a stable baseline.
alpha: often mistaken for "LoRA capacity," but it's actually a learning-rate multiplier—the update is scaled by (alpha/r)·B·A. HuggingFace PEFT defaults to alpha = 2·r, but Raschka and others repeatedly find alpha = r (scaling factor 1) equivalent or more stable across tasks. alpha > 2r often lets LoRA dominate the forward pass and degenerates into "noisy training."
target_modules: tutorials default to only q_proj + v_proj (the QLoRA paper setting), but 2024 follow-up ablations (including from Dettmers) show that training all linear layers (q/k/v/o + gate/up/down) typically yields +2-5% accuracy at small memory cost. q+v-only is a fallback for VRAM-constrained settings, not a baseline.

One easily missed QLoRA detail: 4-bit quantization exists only in the forward pass; backward gradients compute in bf16—so "quantization loss" doesn't enter the gradient, and loss curves match plain LoRA. But at deploy time, serving 4-bit base + LoRA adapter directly adds an extra layer of quantization noise vs. training—prefer to merge the adapter and ship bf16 or 8-bit GPTQ/AWQ. Don't ship in the 4-bit training state directly.

LoRA sweet spots by task type Task rank alpha target_modules data size ────────────────────────────────────────────────────────────────── Style / persona 8-16 r q,v or all_linear 100-1000 Format hardening 16 r all_linear 500-5000 Distillation (Opus→8B) 32-64 r all_linear 10K-100K Domain adapt (med/law/code) 64-128 r all_linear 10K-1M Instruction tuning 32 r all_linear 5K-50K Counter-example (common bad default): r=8, alpha=16, target=[q_proj,v_proj], data=unaudited → Underfit on most tasks + 60% training time wasted on wrong layers

Hands-on

Unsloth (the highest-performance open LoRA/QLoRA stack in 2026, 2-5x faster than HF PEFT and 60% less VRAM) — minimum working config:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# —— Load 4-bit quantized base (70B trainable on a single 4090) ——
model, tok = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.3-70B-Instruct-bnb-4bit",
    max_seq_length = 4096,
    load_in_4bit = True,           # QLoRA: 4-bit NF4 quantization
)

# —— LoRA adapter config: stable baseline ——
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,                          # 32-64 for distill/capability tasks
    lora_alpha = 32,                 # alpha = r, not 2r
    target_modules = [               # all_linear, not just q,v
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout = 0.05,              # 0.05-0.1 for small data, 0 for large
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,         # see #03, quality > quantity
    tokenizer = tok,
    max_seq_length = 4096,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,   # effective batch=16
        warmup_steps = 10,
        num_train_epochs = 2,              # 1-3 epochs for most LoRA tasks
        learning_rate = 2e-4,              # 10x higher than full FT
        bf16 = True,
        logging_steps = 5,
        optim = "adamw_8bit",            # 40% more VRAM savings
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
    ),
)
trainer.train()

# —— Deploy: merge first, then ship bf16 / 8-bit, NOT 4-bit + adapter ——
model.save_pretrained_merged("./merged", tok, save_method = "merged_16bit")

Failure modes: (1) alpha = 2r is the PEFT default but repeatedly shown suboptimal—especially at r ≥ 32, alpha > r makes the LoRA delta too large; classic symptom is loss dropping after 1 epoch but eval crashing; fix is alpha = r; (2) Only training q_proj, v_proj and concluding "LoRA doesn't work"—the MLP layers (gate/up/down) hold 60%+ of model parameters; skipping them gives up most of the capacity; (3) Reusing the full-FT learning rate of 2e-5—LoRA needs 1e-4 to 5e-4 or it won't learn anything; (4) Training 4-bit, deploying 4-bit—extra quantization error costs 2-5% eval; merge and deploy bf16 or 8-bit; (5) QLoRA OOM at large batch sizes—the VRAM bottleneck is activations, not weights, use gradient checkpointing + small micro-batch + large accumulation.

Deeper · Hu et al. LoRA: Low-Rank Adaptation of LLMs (ICLR 2022), arxiv.org/abs/2106.09685 · Dettmers et al. QLoRA (NeurIPS 2023), arxiv.org/abs/2305.14314 · Raschka Practical Tips for Finetuning LLMs Using LoRA, magazine.sebastianraschka.com · Unsloth docs, docs.unsloth.ai

// 03

The dataset: 50 high-quality samples beat 50,000 noisy ones

Claim: the highest leverage in fine-tuning is not algorithm, not rank, not model—it's the dataset. LIMA / Zephyr / Tülu repeatedly show that 1,000 hand-picked samples beat 100,000 scraped ones; distillation data is cheaper and better than human labels. The dataset is 90% of fine-tuning, configuration is the other 10%.

Background & Principle

LIMA (Meta 2023) is the canonical demonstration: 1,000 hand-curated dialog samples were used to fine-tune Llama 65B, beating DaVinci-003 by 43% and the early Bard by 46% in human eval. Compared to the prevailing approach (FLAN-T5-style millions of instructions), LIMA's data was 1,000x smaller and better. The reason: during instruction tuning, fine-tuning isn't teaching knowledge—it's unlocking already-present capability and locking the response format. The quality of your samples determines what quality of subspace you lock into. Feed noise = lock into noise.

2024 carried this line forward: Zephyr-7B used distillation (GPT-4 outputs as teacher) + DPO preference tuning to beat human-labeled models of the same size; Tülu 3 (Allen AI 2024) systematically ablated data filtering > data scale—deduping, removing hallucinated labels, sampling by difficulty all beat raw volume by a wide margin.

Three iron rules of data engineering:

(1) Always dedupe: MinHash + LSH or SemDeDup to remove semantic duplicates. Real production data follows a long-tail distribution—the top-10 FAQs typically account for 60% of samples. No dedup = training the same 10 answers 10,000 times each.
(2) Manual sample audit is non-optional: randomly inspect 50 samples by hand; if >5% are wrong, contain PII, or break formatting, throw the whole batch out. Most teams skip this step and then spend 2 weeks debugging after training, only to discover it was a data problem.
(3) Distillation is the de facto standard post-2024: use Claude Opus / GPT-5 as the teacher, generate 1-3 candidates per user query and pick the best with a ranker. 50-100x cheaper than human labeling, often higher quality (humans get tired, models don't). This is the shared secret behind Phi / Zephyr / Orca / Qwen-Distill.

A counterintuitive consequence: the smaller your dataset, the higher the quality bar. Every LIMA sample was hand-vetted by the authors; Tülu 3 keeps noise rates <1%. If your 500 samples have a 5% error rate, that's 25 strong counterexamples directly poisoning training—relatively 100x more impactful than 25 counterexamples among 50,000. So small + strict is harder, not easier.

Quality vs quantity curve (typical) Eval accuracy │ 85% ┤ ╭──────━━━━━━━ 1K hand-picked (LIMA-style) │ ╭─╯ 80% ┤ ╭──╯ │ │ 75% ┤ │ ╭━━━━━━━━━━ 50K mixed (scraped) │ │ ╭──╯ 70% ┤ │ ╭───╯ │ │ ╭───╯ 65% ┤ ╰───╯ │ └─────────────────────────────────────────────► 100 1K 10K 100K data size → Same 80% accuracy: 1K curated ≈ 50K mixed (50x cost diff) → 100K mixed often loses to 1K curated (overfits to noise)

Hands-on

Minimum working pipeline for distillation + dedup + quality audit (Anthropic + sklearn):

import anthropic, json, hashlib, random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

client = anthropic.Anthropic()

# —— Step 1: distill high-quality targets with Claude Opus 4.7 ——
def distill(user_query: str) -> str:
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=800,
        system="You are answering as the production assistant. Be precise, "
               "non-redundant. Refuse politely if uncertain. Output only the reply.",
        messages=[{"role":"user","content":user_query}],
    )
    return msg.content[0].text

# —— Step 2: semantic dedup (simplified SemDeDup) ——
def dedupe(samples, threshold=0.92):
    texts = [s["prompt"] + " " + s["completion"] for s in samples]
    vec = TfidfVectorizer(max_features=20000).fit_transform(texts)
    sim = cosine_similarity(vec)
    keep, seen = [], set()
    for i in range(len(samples)):
        if i in seen: continue
        keep.append(samples[i])
        for j in range(i+1, len(samples)):
            if sim[i,j] > threshold: seen.add(j)
    return keep

# —— Step 3: Claude-as-judge quality audit (auto-reject low quality) ——
def audit(sample) -> bool:
    msg = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=10,
        messages=[{"role":"user", "content":
          f"Rate this Q&A pair on factual correctness, format integrity, "
          f"and helpfulness. Output ONLY one of: GOOD / BAD.\n\n"
          f"Q: {sample['prompt']}\nA: {sample['completion']}"}]
    )
    return "GOOD" in msg.content[0].text.upper()

# —— Step 4: human sample review (last gate that can't be automated) ——
def human_sample_review(dataset, n=50):
    """Dump 50 random samples to JSON. If <95% pass, abort training."""
    sample = random.sample(dataset, min(n, len(dataset)))
    json.dump(sample, open("audit_sample.json","w"), indent=2, ensure_ascii=False)
    print("Open audit_sample.json. Pass rate < 95% → abort training.")

# —— Full pipeline ——
queries = load_real_user_queries(n=2000)              # real production queries
raw = [{"prompt":q, "completion":distill(q)} for q in queries]
deduped = dedupe(raw)                                  # ~1200 samples
clean = [s for s in deduped if audit(s)]          # ~900 samples
human_sample_review(clean)                             # you eyeball 50
# Final ~900 samples train better than a 50K scraped set

Failure modes: (1) Skipping dedup—top-10 FAQs become 60% of samples; the model learns only those 10; (2) Including the model's own bad answers—many teams scrape from "support logs," accidentally training the cases where users said "this answer is wrong"; filter by user-feedback / thumbs-up; (3) Distilling with an unstable teacher model—using Claude Sonnet one week, GPT-5 another week creates style drift in the dataset and the student can't lock onto a stable voice; (4) Synthetic data with no ground-truth check—LLM-distilled answers are 5-10% wrong, training them as labels teaches the model to fabricate; at minimum use an independent LLM judge; (5) Ignoring negative samples—training only correct answers means the model never learns to refuse; include 5-15% "I don't know / refuse" samples (the I-don't-know data).

Deeper · Zhou et al. LIMA: Less Is More for Alignment (NeurIPS 2023), arxiv.org/abs/2305.11206 · Lambert et al. Tülu 3 (2024), arxiv.org/abs/2411.15124 · Abbas et al. SemDeDup (2023), arxiv.org/abs/2303.09540 · Tunstall et al. Zephyr: Direct Distillation (2023), arxiv.org/abs/2310.16944

// 04

Decoding params: the counterintuitive behavior of temperature, top-p, min-p

Claim: when a fresh FT model feels "unstable," "occasionally hallucinates," or "rambles"—80% of the time it's unset decoding parameters, not the model. The interactions between temperature/top-p/min-p/repetition_penalty are counterintuitive; most API defaults (temp=1, top_p=1) are tuned for chat feel, not engineering stability, and production must override them.

Background & Principle

Decoding is the process of turning the model's logit distribution (over the vocab) into specific tokens. The LLM internally knows "the next token is 90% A, 5% B, 5% other"—it's the decoding strategy that decides what you actually get. The precise behavior of the three core parameters:

temperature: divide logits by a constant T before softmax. p_i = softmax(logits / T). T → 0 collapses to argmax (deterministic greedy); T = 1 is no rescaling; T > 1 flattens the distribution (more random). Counterintuitive: T does not control "creativity"—it only amplifies or compresses the logit gaps. Code / JSON / tool-call: T=0 or 0.2; Style / chat: T=0.7-0.9; Brainstorm: T=1.0+.
top-p (nucleus sampling): accumulate probabilities from highest down, keep tokens whose cumulative probability ≤ p. top_p=0.9 means "cut off the bottom 10%." Counterintuitive: top_p and temperature are not independent—raising T fattens the tail, so top_p admits more noisy tokens. High T + high top_p is a runaway recipe. Production rule: either low T with default top_p (T=0.2, top_p=1), or default T with tight top_p (T=1, top_p=0.9)—not both open.
min-p (2023, supported by Llama.cpp / vLLM): keep tokens with probability ≥ min_p × p_top—prune by ratio to top probability. More robust than top-p: top-p still admits tail noise when the distribution is sharp; min-p won't. Production recommendation: min_p = 0.05-0.1 instead of top_p.
repetition_penalty / presence_penalty / frequency_penalty: prevent stuck words / repeated paragraphs. Most FT'd small models get stuck on long generations—setting repetition_penalty to 1.05-1.1 fixes it immediately. But > 1.2 makes output weird (forcibly avoiding common tokens breaks grammar).

Decoding params matter more after FT than for base models, because FT tightens the distribution. With a base model at T=1 the top-10 tokens might be [20%, 15%, 10%, ...]; after FT they could be [85%, 5%, 3%, ...]. In that regime top_p=0.9 essentially passes only the top-1 (near-greedy); T=0.7 amplifies a 5% token back to 15%, injecting randomness that shouldn't be there. After FT, lower temperature, use min-p, light repetition_penalty is the stable mode.

Decoding param guidance (post-FT models) Scenario temp top_p / min_p rep_penalty ────────────────────────────────────────────────────────────── Code / JSON tool 0.0 top_p=1 (no-op) 1.0 Structured extract 0.0 top_p=1 1.0 RAG serious QA 0.2 min_p=0.1 1.05 Chat / persona 0.7 min_p=0.05 1.05-1.1 Creative / writing 0.9 min_p=0.05 1.1 Brainstorm / diverse 1.0 min_p=0.02 1.0 Counter-example (common prod default): temp=1.0, top_p=1.0, rep_penalty=1.0 → effectively no control; small FT models drift, repeat, occasionally hallucinate

Hands-on

Stable decoding configs for a post-FT small model (vLLM / OpenAI-compatible API):

from openai import OpenAI

# vLLM serve endpoint
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

# —— Production: strict JSON output ——
resp = client.chat.completions.create(
    model = "merged-llama-3.3-8b-ft",
    messages = msgs,
    temperature = 0.0,            # fully deterministic, no randomness
    top_p = 1.0,                  # T=0 makes top_p moot; 1.0 for clarity
    max_tokens = 512,
    response_format = {"type": "json_schema",
                       "json_schema": my_schema},  # structural constraint
    extra_body = {                   # vLLM extensions
        "repetition_penalty": 1.0,
        "min_p": 0.0,
    },
)

# —— Production: persona / chat ——
resp = client.chat.completions.create(
    model = "merged-llama-3.3-8b-ft",
    messages = msgs,
    temperature = 0.7,
    top_p = 1.0,                  # disable top_p, let min_p take over
    max_tokens = 1024,
    extra_body = {
        "min_p": 0.05,              # more robust than top_p
        "repetition_penalty": 1.05, # anti-stuck, don't exceed 1.1
    },
)

# —— Monitoring: pull logprobs to locate where the model wobbled ——
resp = client.chat.completions.create(
    model = "merged-llama-3.3-8b-ft",
    messages = msgs,
    temperature = 0.2,
    logprobs = True,
    top_logprobs = 5,             # top-5 candidates + probs per step
    max_tokens = 200,
)
for token in resp.choices[0].logprobs.content:
    # top-1 prob < 0.5 = model "hesitating" = hallucination hotspot
    if token.logprob < -0.7:   # exp(-0.7) ≈ 0.5
        print(f"⚠️  uncertain at: {token.token}",
              [(t.token, f"{2.718**t.logprob:.2f}")
               for t in token.top_logprobs])

Failure modes: (1) temperature=1 with top_p=1—the OpenAI/Anthropic SDK default, but practically synonymous with "unstable" for post-FT models; production must override at least one; (2) Using temperature to dial creativity—T is not a creativity knob, it's a sharpness knob; for real creativity, change the prompt (style direction, seed words), not just T; (3) repetition_penalty > 1.2—forcibly avoiding recently-used tokens breaks grammar on long outputs; 1.05-1.1 is the sweet spot; (4) Non-zero temperature for code/JSON—structured output relies on T=0 + grammar/JSON schema; enabling temperature is asking for bugs; (5) Unfixed seed—running A/B eval without a fixed seed means two identical inputs give two different outputs; vLLM/OpenAI both support seed, fix it.

Deeper · Holtzman et al. The Curious Case of Neural Text Degeneration (nucleus / top-p), arxiv.org/abs/1904.09751 · Nguyen et al. Min-p Sampling, arxiv.org/abs/2407.01082 · vLLM Sampling parameters, docs.vllm.ai/.../openai_compatible_server

// Combined playbook · A two-week prompt-to-FT decision path

Suppose you have a real production problem and the model isn't good enough. Two weeks, in ROI order, to decide whether to fine-tune:

Day 1 · Build the eval set (half-day, most critical step): write 50-100 cases with ground-truth answers. No eval = stop. Every downstream change is judged by eval numbers.
Day 1-2 · Max out prompting: XML-structured system prompt + 3-5 few-shots + CoT (if reasoning). Run eval, log baseline.
Day 3 · Try a stronger model: Sonnet → Opus; GPT-5-mini → GPT-5. Often +10-20pp accuracy, faster than a week of FT. If a model swap is enough, stop here.
Day 4-5 · Add RAG (if it's a knowledge task): install the hybrid + reranker stack from Day 10. Re-run eval.
Day 6 · Tune decoding params: temperature 1 → 0.2-0.7, add min_p=0.05, rep_penalty=1.05. Zero cost, frequently skipped.
Day 7 · Decision point: how far is current eval from target? Gap < 10pp → keep iterating prompt/RAG; gap > 15pp AND falls into the 5 valid FT scenarios → proceed to FT.
Day 8-10 · Dataset: distill 800-2,000 samples from Claude Opus 4.7 → dedupe → judge audit → manual review of 50. Takes longer than the actual training.
Day 11-12 · Training: Unsloth + QLoRA + r=32, alpha=32, all_linear, 2 epochs. Llama 3.3 8B on a 4090: 4-8 hours per epoch.
Day 13 · Merge + deploy: merge LoRA back to bf16, serve with vLLM. Re-run the full eval and compare to baseline.
Day 14 · Long-tail observation: shadow-run on production traffic for 1 day to catch any regression. Only then route real traffic.

Walking this path: FT is not "let's train first and see"—it's "I've maxed the upstream 4 layers, confirmed the gain curve is still steep, and now I run the shortest, cheapest LoRA cycle that gets me there." That's fine-tuning engineering for 2026. Most teams realize by Day 6 that they don't need FT at all.

// Further Reading

Hu et al. · LoRA: Low-Rank Adaptation of LLMs (ICLR 2022) — the original LoRA paper, start of the PEFT era
Dettmers et al. · QLoRA (NeurIPS 2023) — 4-bit base + LoRA; key paper enabling single-GPU 70B training
Zhou et al. · LIMA (NeurIPS 2023) — the canonical "1K curated > 100K scraped" experiment
Lambert et al. · Tülu 3 (2024) — systematic ablation: data filtering beats data scale
Raschka · Practical Tips for Finetuning LLMs Using LoRA — systematic ablation of rank/alpha/target
Unsloth Documentation — the highest-performance open LoRA/QLoRA stack in 2026
Nguyen et al. · Min-p Sampling (2024) — the robust replacement for top-p
Anthropic · Prompt Engineering vs Fine-tuning — Claude's official prompt-first stance

// Deep Thinking

If prompt + RAG + bigger model handle 80% of engineering problems, why is the open-source community still releasing fine-tuned models at full speed? What's the real driver?

Three forces. (1) Ownership / control: FT'd weights live on your hardware—no API rate limits, no price hikes, no deprecations; for enterprise users this is a real need that prompt+RAG can't meet. (2) Distillation economics: compressing Claude Opus intelligence into Llama 8B on your business task drops unit cost from $15/M tokens to $0.2/M—a 5-100x reduction; once scale arrives, FT becomes mandatory. (3) Research value: FT is the open-source community's flagship way to demonstrate "we're keeping up"; model cards are themselves brand. DeepSeek, Qwen, and Mistral's influence comes largely from continuous FT releases. These three drivers haven't weakened in 2026, so "prompt-first" doesn't contradict "FT is still a key capability"—it just means who does the FT has shifted: from every team, to a small number of well-resourced teams + the open community, while everyone else uses their releases + prompting.

LIMA's "1,000 samples beat 100,000" — does it still hold in the GPT-5 / Claude 4.7 era, or only when base models were weaker?

It holds even more strongly. LIMA's core insight is "instruction tuning is unlock, not teach"—the stronger the base model, the more it can unlock, and the bigger the leverage of small high-quality samples. Tülu 3 (2024) reproduced LIMA-style small-data high-quality SFT on Llama 3 with one tier higher accuracy than 2023's LIMA; Zephyr / Phi-3 / Qwen-Distill all follow the same playbook. The counterargument is that Llama 3.1/3.3/Qwen 3 use millions of samples for their instruct variants—but those aim to cover the entire long tail of tasks, which isn't a fair comparison to single-business FT. Bottom line: FT'ing your business model on 500-5,000 hand-curated samples + distillation is the 2026 best practice, and the stronger the base, the better this strategy looks.

QLoRA lets 70B train on a single GPU—but inference still needs multiple GPUs. What does this "trainable but unservable" asymmetry mean for indie developers?

This asymmetry is the most underappreciated developer-economics inflection point of 2024-2026. Train: QLoRA lets you fine-tune 70B on a 24GB card (4090, A10) in 6-12 hours per epoch, fully self-hosted. Serve: 70B inference still needs 80GB+ VRAM (40GB even at 4-bit)—essentially an H100, billed hourly. So the indie path is actually: train big, distill to small, deploy small. Concretely: QLoRA-train a 70B on your domain, then use this 70B as a teacher to distill a 7B/8B student—8B runs on a 16GB card or M-series Mac, inference cost collapses. This "train big, serve small" route decouples training from inference compute, letting a single developer own a full customized model + deployable stack. Over the next three years this will be one of the most important moats an individual AI engineer can build.

Why haven't decoding params (temperature, top-p, min-p) converged on an "optimal default" the way LoRA did? Why do production teams still hand-tune?

Because optimal decoding depends on the task, not the model. LoRA's optimal rank/alpha is largely a function of architecture, so a community consensus default can emerge. Decoding isn't like that—the same FT'd Llama 8B wants T=0 for code, T=0.7 for chat, T=1 for brainstorm. The task space is infinite and parameter interactions are non-convex (T × top_p has nonlinear joint effects), so no single default covers everything. Model providers' API defaults (T=1, top_p=1) are chat-feel "least-wrong" options, but they're almost never optimal for production. Medium-term we'll see auto-tune decoding: pick params based on prompt type (similar to GPT-5's "reasoning effort" auto-dialing by problem hardness), but in 2026 it's still manual engineering. Knowing how to tune decoding is a basic skill for an LLM engineer—on par with "can write SQL indexes."

If by 2027 open-source 7B models match Claude 4.x on most business tasks, how will the "should I FT" decision tree change?

The decision tree flips. Today: prompt → RAG → bigger model → FT, because "bigger model" has the highest ROI. When a 7B open model is already near top-tier API, "bigger model" stops yielding gain—the only remaining levers are prompt and FT. At that point FT's relative ROI rises sharply because: (1) the base model is already strong enough that FT doesn't need to add base capability, only alignment/style/format lock—small data suffices; (2) marginal cost of private deployment approaches zero (open + quantized + edge GPU), removing the inference cost constraint; (3) data ownership becomes the core competitive moat—your FT dataset is something competitors can't replicate. Prediction: by 2027, "should I FT" will flip from "99% no" to "business moats require FT"—but still small high-quality data + LoRA, not a return to 2023's full-FT era. FT is always engineering, always ROI-driven—it's just that the ROI function itself is shifting.