AI/ML Explained: Model Compression

Day 31 · 2026-06-17 · Difficulty ★★★☆☆
For: engineers with coding experience, non-AI background
Engineering counterpart → super-individual D12: Fine-tuning (LoRA/QLoRA in practice and its trade-offs)
Through-line

A 70B-parameter model in FP16 needs 140 GB of VRAM just for the weights—a consumer GPU can't hold it. Model compression answers one question from four angles: can we use a smaller model, fewer weights, a smaller change, or lower precision to fit this pile of parameters onto hardware you can afford? The four techniques are orthogonal and stackable: distillation (swap to a smaller model), pruning (delete weights), LoRA (train only the delta), quantization (lower precision). This is not engineering tuning—it's the mechanism and math of why it works.

Knowledge DistillationKnowledge Distillation

transferteacher-student
One-line analogy

Distillation is like a senior engineer mentoring a junior in code review. A poor mentor just says "pick A" (hard label); a good one says "A is ~80% right, B is plausible with ~10%, C basically never" (soft label). The latter carries far more information—the junior learns not just the answer but the similarity structure between classes. Distillation makes a small model (student) imitate the full probability distribution output by a large model (teacher), not just the final answer.

Problem it solves + mechanism

Big models are accurate but expensive; small ones are fast but dumb. Can a small model "inherit" the big one's judgment? The key insight (Hinton 2015): in a big model's softmax output, the tiny probabilities on the wrong classes (cat=0.9, dog=0.08, car=0.0001) hold enormous "dark knowledge"—they tell you "cats resemble dogs, and not at all cars." Hard labels (cat=1, rest=0) throw this class-relationship away entirely.

The mechanism softens the softmax with a temperature, T. A plain softmax squashes the max toward 1 and the rest toward 0; dividing logits by T>1 before softmax flattens the distribution, amplifying and exposing those small probabilities:

pi = exp(zi/T) / Σj exp(zj/T)

Here zi is the raw score (logit) for class i, and T is the temperature knob. T=1 is plain softmax; larger T means a smoother distribution and clearer inter-class similarity. During training the student matches the teacher's soft distribution at the same high temperature (using KL divergence to measure the gap between two distributions), usually anchored by a bit of the true hard label too. The diagram below shows the information gap between soft and hard labels:

Supervision signal for one "cat" image

Hard label: cat 1.0 |dog 0 |tiger 0 |car 0 ← all relations lost

Soft label (T=4):
cat 0.65
dog 0.22 ← cat≈dog
tiger 0.11 ← both felines
car 0.02 ← nothing alike
↑ "dark knowledge" = the big model's similarity map of the world
Code example
import torch.nn.functional as F

def distill_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    # 1) Soft targets: both soften with temperature T, compare distributions (KL)
    s_soft = F.log_softmax(student_logits / T, dim=-1)
    t_soft = F.softmax(teacher_logits / T, dim=-1)  # teacher: no gradient
    kd = F.kl_div(s_soft, t_soft, reduction="batchmean") * (T * T)
    # ↑ multiply by T²: softening shrinks gradients, scale them back

    # 2) Hard target: plain cross-entropy with true labels, an anchor
    ce = F.cross_entropy(student_logits, labels)

    # 3) Weighted mix: alpha favors imitating the teacher, (1-alpha) the truth
    return alpha * kd + (1 - alpha) * ce
Common misconception + use case
"Smaller student is always better, it just copies the teacher"—wrong. There's a lower bound on student capacity: too small and it can't even fit the soft distribution, so distillation underperforms training from scratch. Empirically the student is usually 1/2–1/10 of the teacher (e.g. DistilBERT is ~60% of BERT), not arbitrarily small. Another trap: too high a T flattens the distribution toward uniform, drowning the real signal.
📌 Super-individual scenario: run a GPT-5-class model on a narrow task (e.g. sorting email into "todo / reference / ignore") over a few thousand examples with probabilities, use it as teacher to distill a small model deployed locally—everyday classification at zero API cost, millisecond latency, and offline. That's the "big model as teacher, small model as employee" pattern for personal automation.
Takeaway + question
💡 Distillation transfers not the answer but the big model's similarity structure of the world—the "dark knowledge" in soft labels is the truly valuable part.
🤔 If soft labels carry more than hard labels, is the essence of "a good teacher" really "someone willing to expose their own uncertainty"? What does that imply for leading a team or teaching a child?

PruningPruning

sparsificationdelete weights
One-line analogy

Pruning is dropping unused database indexes, or dead-code elimination. In a trained network, a huge fraction of weights have near-zero magnitude—they barely contribute to the output, like an index never hit by a query or a branch never reached. Zero them out (or physically remove them) and the model gets smaller and faster with almost no accuracy loss. The only real question: how do you tell which weights are "dead".

Problem it solves + mechanism

Neural networks are inherently over-parameterized—far more parameters than the task needs, which is the price of making training converge. After training, many parameters are redundant. The simplest and most effective criterion is magnitude pruning: the smaller a weight's absolute value, the less deleting it hurts.

if |wij| < threshold τ, set wij = 0

Intuition: w is connection strength; |w|≈0 means the connection passes almost no signal, so cutting it is like removing a useless wire. But pruning too hard at once collapses accuracy, so the standard recipe is iterative: "prune a little → fine-tune to recover → prune more" (train-prune-finetune), letting the network gradually adapt to the sparse structure.

A deeper finding is the Lottery Ticket Hypothesis (Frankle & Carbin 2018): inside a large randomly-initialized network there already hides a small "winning" sub-network—pull it out alone, train it with the original init, and it reaches accuracy comparable to the full network. Pruning is, in a sense, "scratching the lottery ticket" to find that sub-network. Two pruning granularities have very different engineering implications:

  • Unstructured pruning—zero out arbitrary individual weights, highest compression, but you get a sparse matrix that ordinary GPUs may not accelerate (needs dedicated sparse kernels);
  • Structured pruning—remove whole rows/columns/attention heads, so the matrix shrinks regularly and runs faster on standard hardware—but at equal accuracy you can remove less.
Code example
import torch.nn.utils.prune as prune
import torch.nn as nn

layer = nn.Linear(1024, 1024)

# Magnitude pruning: zero the 40% smallest-magnitude weights in this layer
prune.l1_unstructured(layer, name="weight", amount=0.4)
print((layer.weight == 0).float().mean())  # ≈ 0.40 sparsity

# Key: after pruning, fine-tune a few epochs to recover (loop omitted)
# ... train(model) ...  # surviving weights compensate for the removed

# Once satisfied, finalize: remove the mask so zeros are permanent
prune.remove(layer, "weight")
Common misconception + use case
"Prune 50% of weights = 2x faster"—usually false. Unstructured pruning yields a sparse matrix, but consumer GPUs have no native acceleration for sparse ops, so it may run no faster—you only saved storage. For real inference speedup you typically need structured pruning or sparsity-supporting hardware/kernels. "Compression ratio" and "speedup" are two different things; don't conflate them.
📌 Super-individual scenario: structurally prune + fine-tune a local open-source small model to fit on your laptop or an edge device (a home mini-server), giving you an "always-on, zero-network" private assistant. What pruning buys you here is the ability to run on hardware you already own, not just saved cost.
Takeaway + question
💡 Over-parameterization is a necessary cost of training, but not a necessary burden for deployment—pruning is the bridge of "train big, deploy small."
🤔 The lottery ticket says "a small winner hides inside a big network." This eerily mirrors the human brain forming huge numbers of neural connections early in development and then pruning them en masse—is redundancy-then-pruning a universal learning pattern for intelligent systems?

Low-Rank AdaptationLoRA / QLoRA

parameter-efficientincremental tuning
One-line analogy

Full fine-tuning is like forking and rewriting an entire repo—all 70B params updated, a 140GB copy stored per task, a disaster. LoRA is like a git diff / patch file: the original weights stay frozen (the base repo), and you train only a tiny incremental patch. Each task stores just that patch (a few to tens of MB), added onto the trunk at use time. One base, countless lightweight patches, swapped on demand.

Problem it solves + mechanism

Fully fine-tuning a large model means storing a whole weight set per downstream task—exploding storage and switching cost. The key assumption of LoRA (Hu et al. 2021): the weight change ΔW during fine-tuning is intrinsically "low-rank"—not that complex, approximable by the product of two skinny matrices. Instead of updating a d×k matrix, you decompose it into B (d×r) × A (r×k), where the rank r is tiny (often 8 or 16):

W' = W + ΔW ≈ W + B·A   (r ≪ min(d, k))

Parameter count drops from d×k to r×(d+k). Example, d=k=4096, r=8: full is 16.7M params, LoRA only 65K—~250× smaller. The frozen W provides "general ability," the tiny BA provides "task specialization." At inference BA merges into W with zero added latency. Dimension intuition below:

Approximate one big update with two skinny matrices

Full ΔW: d × k
(4096×4096)
16.7M params


LoRA decomposition:
B
d×r
4096×8
×A   r×k   (8×4096) = same d×k shape
↑ train only B, A = 65K params (~0.4%), W stays frozen

QLoRA (Dettmers et al. 2023) goes further: store the frozen base quantized to 4-bit in VRAM, keeping only the small LoRA adapter at high precision for training. This lets a single 48GB GPU fine-tune a 65B model. Its three key inventions: NF4 (4-bit NormalFloat, a data type information-theoretically optimal for normally-distributed weights), double quantization (quantize even the quantization constants for a bit more savings), and paged optimizers (trade VRAM for RAM to absorb memory spikes). In short: LoRA saves "params to train," QLoRA further saves "VRAM the frozen base occupies."

Code example
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# QLoRA: load base quantized to 4-bit (NF4 + double quantization)
bnb = BitsAndBytesConfig(load_in_4bit=True,
        bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
base = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3-8b", quantization_config=bnb)

# Attach LoRA adapters only on attention q/v projections, rank r=8
cfg = LoraConfig(r=8, lora_alpha=16,
        target_modules=["q_proj", "v_proj"], lora_dropout=0.05)
model = get_peft_model(base, cfg)

model.print_trainable_parameters()
# → trainable: ~4M / 8B (≈0.05%), everything else frozen
Common misconception + use case
"LoRA must underperform full fine-tuning"—not absolute. On most downstream adaptation tasks LoRA matches or nearly matches full tuning, but when the task requires the model to learn large amounts of brand-new knowledge (rather than adjusting existing ability), the low-rank assumption becomes a bottleneck—ΔW isn't actually low-rank and r=8 can't hold it. Rule of thumb: style/format/domain-tone adaptation → LoRA is enough; injecting an entirely new body of facts → may need larger r or full tuning.
📌 Super-individual scenario: use QLoRA to fine-tune a local model on your own writing samples, getting an adapter (tens of MB) that "writes in your voice." Train another for "code comments in your voice"—same base, different patches, hot-swapped by context. This turns "personal style" into a reusable, version-controllable small file.
Takeaway + question
💡 LoRA's core insight: adapting ≠ rewriting. Most "learning a new task" is really patching a low-rank delta onto a huge ability base.
🤔 "Ability is the frozen base, the task is a pluggable patch"—can this architecture describe human professional growth in reverse? Your general judgment is W, each new job a LoRA?
 Engineering counterpart → super-individual D12 (Fine-tuning in practice and trade-offs)

QuantizationQuantization

numeric compressionlower precision
One-line analogy

Quantization is storing numbers in a smaller data type—compressing FP16 (16-bit float) down to INT8 or even INT4, like turning a PNG into a JPEG or changing a DOUBLE column to SMALLINT. You trade precision for space and speed: model weights don't need that many significant digits, so chopping the low bits halves the size, then halves again, with memory bandwidth and compute dropping in step. The cost is quantization noise; the trick is keeping it from wrecking the model.

Problem it solves + mechanism

The bottleneck in large-model inference is often not compute but the bandwidth of moving hundreds of GB of weights from VRAM into the compute units. The smaller the weights, the faster they move and the more fit. The math core of quantization is a linear mapping: map a continuous float range [min, max] uniformly onto 2b integer buckets.

scale = (max − min) / (2b − 1)

xq = round(x / scale)   →store→   x̂ = xq × scale

Intuition: scale is "how big a float span each integer bucket represents," and b is the bit width (8 for INT8, 4 for INT4). On store, divide the float by scale and round to a small integer xq; on use, multiply by scale to recover an approximate float . The gap between x̂ and the original x is the quantization error. Lower bit width, fewer buckets, larger error—that's the essence of "precision for space." Below: cramming continuous values into 4 buckets (2-bit):

Continuous float → discrete integer buckets (2-bit = 4 buckets)

float axis ├──────┼──────┼──────┼──────┤
         min     ▼0.31→snaps to bucket 1
int value   0    1    2    3
↑ real values between buckets get "rounded" to the nearest point = quantization error

Two practical keys. First, outliers: LLM weights/activations occasionally have extreme values that stretch [min,max] wide, squeezing most normal values into a few buckets and destroying precision. The core of LLM.int8(), GPTQ and friends is handling outliers. Second, post-training vs quantization-aware:

  • Post-Training Quantization (PTQ)—quantize after training, zero extra training. GPTQ (Frantar et al. 2022) uses approximate second-order information to calibrate layer by layer, compressing 175B models to 3-4 bit with near-zero accuracy loss—the flagship of this line;
  • Quantization-Aware Training (QAT)—simulate quantization error during training so the model "adapts in advance." More accurate but requires retraining, higher cost.
Code example
import torch

def quantize_int8(w):
    # Symmetric quantization: scale from the max abs value, zero-point at 0
    scale = w.abs().max() / 127.0          # INT8 range [-127,127]
    w_q = torch.round(w / scale).clamp(-127, 127).to(torch.int8)
    return w_q, scale                       # store int8 weights + one fp scale

def dequantize(w_q, scale):
    return w_q.to(torch.float32) * scale     # recover approx float at use time

w = torch.randn(4096, 4096)              # fp32 weights: 64 MB
w_q, s = quantize_int8(w)               # int8: 16 MB, 1/4 the size
err = (w - dequantize(w_q, s)).abs().mean()
print(f"mean quantization error {err:.5f}")  # tiny noise, model barely notices
Common misconception + use case
"4-bit = half the size AND half the quality"—wrong. The accuracy loss is far from linear. INT8 being near-lossless is common knowledge; 4-bit with calibration methods like GPTQ/NF4 still stays close to the original on most tasks; the real cliff usually comes below 3-bit. "Half the bit width" and "half the quality" are nowhere near proportional—which is exactly why quantization is so popular.
📌 Super-individual scenario: to run a local LLM on your laptop, just pick the community's 4-bit quantized build (GGUF/GPTQ format). A 70B model needs 140GB in FP16 and simply won't run; quantized to 4-bit it's ~35-40GB and, with system RAM, runs on personal hardware—quantization is the step that takes "personally owning a strong model" from impossible to possible.
Takeaway + question
💡 Quantization reveals a counterintuitive fact: the effective information in big-model weights is far below the bits they occupy—most precision is redundant.
🤔 If 16-bit weights compress to 4-bit with almost no loss, does that mean the model's "knowledge" is intrinsically sparse / low information density? Is this the same principle as the human brain storing memory in a fuzzy, approximate way?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Why can distillation, pruning, LoRA, and quantization "stack"? What dimension does each compress?
Because they're orthogonal—they act on different axes. Distillation swaps "the model itself" (big architecture → small, fewer layers/width); pruning cuts "the number of weights" (delete redundant connections, making matrices sparse or smaller); LoRA cuts "params to train/store" (freeze the base, store only a low-rank delta—a training-side, not inference-side, compression); quantization cuts "bits per weight" (numeric precision, untouched count and structure). So real-world combos chain together: distill a small model → prune redundancy → QLoRA-adapt it to the task → quantize for deployment. Each step compresses a different dimension, and multiplied they yield tens-of-times total compression. The value of seeing this: when your bottleneck is VRAM, prioritize quantization+pruning; when it's "a copy per task is too expensive," use LoRA; when it's latency and you can swap architectures, distill. Treat the symptom, don't blindly apply all four.
2. Quantization, pruning, and distillation all "drop information yet barely lose accuracy." Is there one deep reason behind this?
Yes—it all roots in the over-parameterization and redundancy of neural networks. Deep nets have far more parameters than the task needs, to let high-dimensional optimization converge smoothly (a flatter loss landscape, easier to find good minima). But once training is done, that "redundancy prepared for trainability" becomes a deployment burden. The three techniques drain the same redundancy from different angles: pruning says "many weights are ≈0, delete them"; quantization says "many high-precision bits are noise, chop them"; distillation says "a small model's capacity is actually enough for this task's true complexity, it just can't be trained directly—let the big model teach it." The lottery ticket hypothesis says it most sharply: the useful "sub-network" was always there; the big model is just easier-to-train scaffolding. This raises a deep question—if deployment needs only this much effective capacity, what does the enormous training-time scale actually buy, "ability" or "optimizability"? The mainstream answer leans toward the latter: scale is mostly to make training tractable, not because it's all really used at inference.
3. QLoRA quantizes the base to 4-bit and still fine-tunes fine. Why doesn't the "noisy 4-bit base" wreck training?
Several mechanisms stack. First, gradients flow only to the high-precision LoRA adapter—the 4-bit base is frozen, receives no updates, so quantization error isn't amplified or accumulated in backprop; it merely provides a "good enough" reference for the forward pass. Second, the NF4 data type is information-theoretically optimized for normally-distributed weights—network weights are roughly normal, and NF4 packs its limited buckets more densely in high-probability regions, so error at 4-bit is far below uniform quantization. Third, double quantization further compresses the overhead of the quantization constants themselves with almost no added error. Fourth and most crucial: the high-precision LoRA adapter can "compensate" for the base's quantization noise—during training, BA automatically learns a delta that partly cancels the bias from quantization. So QLoRA isn't "training hard on a bad base," it's "freezing an adequate approximate base + a precise learnable patch that corrects it." This also explains why directly quantizing a model to 4-bit for inference sometimes underperforms a QLoRA-fine-tuned one—the latter has an adapter compensating.
4. "Dark knowledge in soft labels" and "low-rank ΔW" seem unrelated—but are they both saying the task's "true complexity" is low?
A beautiful connection. Distillation's soft labels reveal that a classification task's true signal isn't "the one-hot hard answer" but a low-dimensional similarity structure (cat≈dog≫car)—far simpler than "N independent categories," which is why a small model can learn it. LoRA's low-rank ΔW reveals that adapting a general model to a specific task requires a change that is intrinsically low-dimensional—you needn't rewrite the whole 16M-param matrix; a rank-8 patch suffices. Both, in different languages, state the same prior: the "intrinsic dimension" of real-world tasks is far below the model's "apparent dimension." There's dedicated research on this (e.g. Aghajanyan et al. on the intrinsic dimension of fine-tuning). For a complexity-science lens, it echoes a deep theme: a high-dimensional system's effective degrees of freedom often concentrate on a low-dimensional manifold—whether the order parameters of a physical system or the task representation of a neural net. Compression is possible because what "looks big" is, in information terms, not big at all.
5. Compression makes "owning a strong model on personal devices" possible. What does this mean for the power structure of "AI super-individuals"?
This is the political economy behind the technical question. Uncompressed frontier models only run in big companies' data centers, so ability is centralized in a few API providers—you rent intelligence but don't own it, and can be cut off, price-hiked, or censored at any time. Compression (especially quantization + LoRA) tips the balance back: 4-bit quantization fits a 70B model onto consumer hardware, LoRA lets you privatize a general model with a tens-of-MB patch, distillation lets you "download" a big model's ability into a small one you own. Together they point to a possible "sovereign AI"—intelligence running on your device, on your data, offline, private, not remotely shut-off-able. For those chasing "super-individual," this isn't just saving money but ability sovereignty: whether your core intelligence infrastructure is rented or owned sets the ceiling on your autonomy. The cost is taking on ops and alignment responsibility yourself—exactly the line between "super-individual" and "platform user." The maturation of compression is the key link that turns this path from ideal into feasible.