A 70B-parameter model in FP16 needs 140 GB of VRAM just for the weights—a consumer GPU can't hold it. Model compression answers one question from four angles: can we use a smaller model, fewer weights, a smaller change, or lower precision to fit this pile of parameters onto hardware you can afford? The four techniques are orthogonal and stackable: distillation (swap to a smaller model), pruning (delete weights), LoRA (train only the delta), quantization (lower precision). This is not engineering tuning—it's the mechanism and math of why it works.
Distillation is like a senior engineer mentoring a junior in code review. A poor mentor just says "pick A" (hard label); a good one says "A is ~80% right, B is plausible with ~10%, C basically never" (soft label). The latter carries far more information—the junior learns not just the answer but the similarity structure between classes. Distillation makes a small model (student) imitate the full probability distribution output by a large model (teacher), not just the final answer.
Big models are accurate but expensive; small ones are fast but dumb. Can a small model "inherit" the big one's judgment? The key insight (Hinton 2015): in a big model's softmax output, the tiny probabilities on the wrong classes (cat=0.9, dog=0.08, car=0.0001) hold enormous "dark knowledge"—they tell you "cats resemble dogs, and not at all cars." Hard labels (cat=1, rest=0) throw this class-relationship away entirely.
The mechanism softens the softmax with a temperature, T. A plain softmax squashes the max toward 1 and the rest toward 0; dividing logits by T>1 before softmax flattens the distribution, amplifying and exposing those small probabilities:
Here zi is the raw score (logit) for class i, and T is the temperature knob. T=1 is plain softmax; larger T means a smoother distribution and clearer inter-class similarity. During training the student matches the teacher's soft distribution at the same high temperature (using KL divergence to measure the gap between two distributions), usually anchored by a bit of the true hard label too. The diagram below shows the information gap between soft and hard labels:
import torch.nn.functional as F def distill_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7): # 1) Soft targets: both soften with temperature T, compare distributions (KL) s_soft = F.log_softmax(student_logits / T, dim=-1) t_soft = F.softmax(teacher_logits / T, dim=-1) # teacher: no gradient kd = F.kl_div(s_soft, t_soft, reduction="batchmean") * (T * T) # ↑ multiply by T²: softening shrinks gradients, scale them back # 2) Hard target: plain cross-entropy with true labels, an anchor ce = F.cross_entropy(student_logits, labels) # 3) Weighted mix: alpha favors imitating the teacher, (1-alpha) the truth return alpha * kd + (1 - alpha) * ce
Pruning is dropping unused database indexes, or dead-code elimination. In a trained network, a huge fraction of weights have near-zero magnitude—they barely contribute to the output, like an index never hit by a query or a branch never reached. Zero them out (or physically remove them) and the model gets smaller and faster with almost no accuracy loss. The only real question: how do you tell which weights are "dead".
Neural networks are inherently over-parameterized—far more parameters than the task needs, which is the price of making training converge. After training, many parameters are redundant. The simplest and most effective criterion is magnitude pruning: the smaller a weight's absolute value, the less deleting it hurts.
Intuition: w is connection strength; |w|≈0 means the connection passes almost no signal, so cutting it is like removing a useless wire. But pruning too hard at once collapses accuracy, so the standard recipe is iterative: "prune a little → fine-tune to recover → prune more" (train-prune-finetune), letting the network gradually adapt to the sparse structure.
A deeper finding is the Lottery Ticket Hypothesis (Frankle & Carbin 2018): inside a large randomly-initialized network there already hides a small "winning" sub-network—pull it out alone, train it with the original init, and it reaches accuracy comparable to the full network. Pruning is, in a sense, "scratching the lottery ticket" to find that sub-network. Two pruning granularities have very different engineering implications:
import torch.nn.utils.prune as prune import torch.nn as nn layer = nn.Linear(1024, 1024) # Magnitude pruning: zero the 40% smallest-magnitude weights in this layer prune.l1_unstructured(layer, name="weight", amount=0.4) print((layer.weight == 0).float().mean()) # ≈ 0.40 sparsity # Key: after pruning, fine-tune a few epochs to recover (loop omitted) # ... train(model) ... # surviving weights compensate for the removed # Once satisfied, finalize: remove the mask so zeros are permanent prune.remove(layer, "weight")
Full fine-tuning is like forking and rewriting an entire repo—all 70B params updated, a 140GB copy stored per task, a disaster. LoRA is like a git diff / patch file: the original weights stay frozen (the base repo), and you train only a tiny incremental patch. Each task stores just that patch (a few to tens of MB), added onto the trunk at use time. One base, countless lightweight patches, swapped on demand.
Fully fine-tuning a large model means storing a whole weight set per downstream task—exploding storage and switching cost. The key assumption of LoRA (Hu et al. 2021): the weight change ΔW during fine-tuning is intrinsically "low-rank"—not that complex, approximable by the product of two skinny matrices. Instead of updating a d×k matrix, you decompose it into B (d×r) × A (r×k), where the rank r is tiny (often 8 or 16):
Parameter count drops from d×k to r×(d+k). Example, d=k=4096, r=8: full is 16.7M params, LoRA only 65K—~250× smaller. The frozen W provides "general ability," the tiny BA provides "task specialization." At inference BA merges into W with zero added latency. Dimension intuition below:
QLoRA (Dettmers et al. 2023) goes further: store the frozen base quantized to 4-bit in VRAM, keeping only the small LoRA adapter at high precision for training. This lets a single 48GB GPU fine-tune a 65B model. Its three key inventions: NF4 (4-bit NormalFloat, a data type information-theoretically optimal for normally-distributed weights), double quantization (quantize even the quantization constants for a bit more savings), and paged optimizers (trade VRAM for RAM to absorb memory spikes). In short: LoRA saves "params to train," QLoRA further saves "VRAM the frozen base occupies."
from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model # QLoRA: load base quantized to 4-bit (NF4 + double quantization) bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True) base = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8b", quantization_config=bnb) # Attach LoRA adapters only on attention q/v projections, rank r=8 cfg = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05) model = get_peft_model(base, cfg) model.print_trainable_parameters() # → trainable: ~4M / 8B (≈0.05%), everything else frozen
Quantization is storing numbers in a smaller data type—compressing FP16 (16-bit float) down to INT8 or even INT4, like turning a PNG into a JPEG or changing a DOUBLE column to SMALLINT. You trade precision for space and speed: model weights don't need that many significant digits, so chopping the low bits halves the size, then halves again, with memory bandwidth and compute dropping in step. The cost is quantization noise; the trick is keeping it from wrecking the model.
The bottleneck in large-model inference is often not compute but the bandwidth of moving hundreds of GB of weights from VRAM into the compute units. The smaller the weights, the faster they move and the more fit. The math core of quantization is a linear mapping: map a continuous float range [min, max] uniformly onto 2b integer buckets.
Intuition: scale is "how big a float span each integer bucket represents," and b is the bit width (8 for INT8, 4 for INT4). On store, divide the float by scale and round to a small integer xq; on use, multiply by scale to recover an approximate float x̂. The gap between x̂ and the original x is the quantization error. Lower bit width, fewer buckets, larger error—that's the essence of "precision for space." Below: cramming continuous values into 4 buckets (2-bit):
Two practical keys. First, outliers: LLM weights/activations occasionally have extreme values that stretch [min,max] wide, squeezing most normal values into a few buckets and destroying precision. The core of LLM.int8(), GPTQ and friends is handling outliers. Second, post-training vs quantization-aware:
import torch def quantize_int8(w): # Symmetric quantization: scale from the max abs value, zero-point at 0 scale = w.abs().max() / 127.0 # INT8 range [-127,127] w_q = torch.round(w / scale).clamp(-127, 127).to(torch.int8) return w_q, scale # store int8 weights + one fp scale def dequantize(w_q, scale): return w_q.to(torch.float32) * scale # recover approx float at use time w = torch.randn(4096, 4096) # fp32 weights: 64 MB w_q, s = quantize_int8(w) # int8: 16 MB, 1/4 the size err = (w - dequantize(w_q, s)).abs().mean() print(f"mean quantization error {err:.5f}") # tiny noise, model barely notices