AI/ML Explained: Tokenization Deep Dive

Day 11 · 2026-05-28 · Difficulty ★★★☆☆

For: engineers with coding experience, not from an AI background

Byte Pair EncodingBPE

subword algorithmcompression origin

One-line analogy

BPE is essentially a greedy dictionary-compression algorithm—same ancestor as gzip / LZ, which compress files by "replacing frequent byte sequences with short codes" (Gage's 1994 data-compression algorithm). The only difference is the objective: compression software makes files smaller, BPE makes the vocabulary big enough yet not explosive. Sennrich et al. (2016) carried it from compression into NLP.

What it solves + how it works

Both extremes fail. Word-level vocab (one token per word): English has hundreds of thousands of words plus spelling variants and neologisms—the vocab is unbounded, and any word unseen in training (OOV, out-of-vocabulary) just breaks. Character-level: vocab shrinks to a few dozen chars, but a sentence explodes into hundreds of tokens, and since attention is O(n²), long sequences become unaffordable. BPE takes the middle ground: common words stay one token, rare words split into subword pieces.

The mechanism is greedy merging: ① init vocab = all single characters; ② count the frequency of adjacent token pairs in the corpus; ③ merge the most frequent pair into a new token and add it to the vocab; ④ repeat ②③ until the vocab hits a target size (say 50K). The training output is an ordered list of merge rules, applied in the same order at inference:

corpus (split into chars): low low low lower newest widest

Round 1: top pair e + s → merge into es
Round 2: top pair es + t → merge into est
Round 3: top pair l + o → merge into lo
Round 4: lo + w → low (now "low" is a single token)

vocab = [chars...] + es + est + lo + low ...　"newest" → "new" + "est"

Code

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# 1) empty BPE, pre-split on whitespace (BPE assumes pre-tokenized)
tok = Tokenizer(BPE(unk_token="[UNK]"))
tok.pre_tokenizer = Whitespace()

# 2) train: target vocab 5000, repeatedly merge frequent pairs
trainer = BpeTrainer(vocab_size=5000, special_tokens=["[UNK]"])
tok.train(["corpus.txt"], trainer)   # output = ordered merge-rule table

# 3) encode: apply the same merge rules to new text
out = tok.encode("tokenization")
print(out.tokens)   # maybe → ['token','ization'], maybe finer

Common pitfall + your scenario

"BPE splits words along linguistic roots"—wrong. BPE is purely frequency-based; "tokenization" → "token"+"ization" only because those segments are frequent, nothing to do with grammar/morphology. A different corpus may split it oddly. This is the root cause of why models sometimes "understand" rare words awkwardly.

📌 Parent scenario: when estimating prompt cost, count tokens with a tokenizer first—the same sentence phrased with more mainstream (frequent) wording often yields fewer tokens, both saving money and reducing the model's chance of "misreading."

Takeaway + question

💡 BPE is "frequency-driven compression," not "linguistic segmentation"—it knows nothing about meaning, only statistics.
🤔 If token boundaries are decided by frequency, not meaning, what systematic disadvantage does the model have with "low-frequency but semantically important" terms (your domain jargon)?

WordPieceWordPiece · BERT family

subword algorithmBERT

One-line analogy

WordPiece and BPE share "the same greedy-merge framework, different scoring function"—like two databases both using B-trees, but one caches by access frequency and the other by benefit/cost ratio. BPE looks at absolute frequency; WordPiece looks at a mutual-information-style gain. It originated in Schuster & Nakajima 2012 (Japanese/Korean voice search) and was popularized by BERT (Devlin et al. 2018).

What it solves + how it works

BPE's "merge the most frequent pair" has a flaw: two segments that are each frequent and merely happen to be adjacent often (like "the" followed by various words) get merged by mistake. WordPiece instead merges the pair that most increases the corpus likelihood, scored as:

score(a, b) = freq(ab) / ( freq(a) × freq(b) )

Intuition: the numerator is how often the pair appears together; the denominator is how often each appears on its own. If a and b are merely each frequent but randomly adjacent, the denominator is large, the score is low, no merge; only when a and b are tightly bound (the combo is far more common than the parts) does the score rise. This is exactly pointwise mutual information (PMI)—measuring whether two things co-occur more than by chance.

At inference it uses longest-match greedy splitting, and word-internal continuation subwords are marked with the ## prefix (playing → play + ##ing), enabling lossless reconstruction: ## attaches to the previous, no-## gets a leading space.

Code

from transformers import BertTokenizer
tok = BertTokenizer.from_pretrained("bert-base-uncased")

# ## = "attaches to previous subword, no space between", for lossless restore
print(tok.tokenize("playing tokenization"))
# illustrative → ['playing', 'token', '##ization']
# frequent words stay whole, rare words split into ## pieces

print(tok.tokenize("antidisestablishment"))
# the rarer the word, the more ## fragments (actual split depends on vocab)

Common pitfall + your scenario

"## is garbage in BERT's output"—wrong. ## marks "this is a word-internal continuation piece"; it is the key to lossless restoration, not noise. Strip it as garbage and your spaces will misalign on detokenize.

📌 Parent scenario: when using BERT / sentence-embedding models for semantic search, understanding ## splitting helps diagnose—why do certain proper nouns or code identifiers have poor embeddings? Because they get shredded into meaningless ## fragments, diluting the semantics.

Takeaway + question

💡 BPE compares "who appears most," WordPiece compares "who binds tightest"—the latter uses the PMI idea to suppress "frequent but unrelated" spurious merges.
🤔 The same corpus trains different vocabs under BPE vs WordPiece. How would that difference affect a downstream model's handling of "compound words / coined words"?

SentencePiece & UnigramSentencePiece · Unigram LM

frameworklanguage-agnostic

One-line analogy

BPE / WordPiece both assume you can pre-split on whitespace—a hidden assumption that breaks for Chinese / Japanese (no whitespace word boundaries). SentencePiece treats the whole sentence as a raw Unicode stream, with even the space as an ordinary character (shown as ▁). Analogy: BPE is a parser that "splits on a delimiter first, then parses"; SentencePiece is a schema-less parser running directly on the raw byte stream—it depends on no language's tokenization rules, and is therefore losslessly reversible (detokenize restores the original exactly).

What it solves + how it works

First, clear up a common confusion: SentencePiece is a framework / tool (open-sourced by Google) that internally runs either BPE or Unigram. Unigram is its default and its biggest contribution (Kudo 2018).

Unigram goes in exactly the opposite direction from BPE: BPE is bottom-up merging (building up from characters), Unigram is top-down pruning—

① start from a large candidate subword set;
② give each subword a probability, and score all possible segmentations of a sentence by probability;
③ use the EM algorithm to iterate, each round dropping the batch of subwords that contributes least to total likelihood;
④ until the vocab shrinks to the target size.

Bonus: each token carries a probability, and one word admits multiple valid segmentations—used for subword regularization (randomly sampling different splits at training time as data augmentation). T5, LLaMA, Gemma, etc. all use SentencePiece.

The pre-tokenization assumption

BPE / WordPiece: split on whitespace first → subword merge　(no spaces in Chinese → breaks)
SentencePiece:　raw stream + space as ▁ → subword　(unified across CJK/EN, reversible)

"机器学习 hello" → ['▁机','器','学','习','▁hello']　▁ records the space

Code

import sentencepiece as spm

# train unigram: eats raw text directly, no pre-tokenization
spm.SentencePieceTrainer.train(
    input="corpus.txt", model_prefix="m",
    vocab_size=8000, model_type="unigram")  # or "bpe"

sp = spm.SentencePieceProcessor(model_file="m.model")
print(sp.encode("机器学习 hello", out_type=str))
# space → ▁, CJK and English handled uniformly
print(sp.decode(sp.encode("机器学习 hello")))  # == original (lossless)

Common pitfall + your scenario

"SentencePiece is a tokenization algorithm"—inaccurate. It is a framework that hosts BPE or Unigram. Saying "this model uses SentencePiece" only tells half the story; you still need to ask "which algorithm."

📌 Parent scenario: interdisciplinary reading often involves multilingual terms (Latin, German philosophy words, Japanese Buddhist concepts). SentencePiece-based models (LLaMA / Gemma) usually have better token efficiency on non-English than English-trained BPE—worth weighing when picking a model for multilingual content.

Takeaway + question

💡 BPE "builds up" bottom-up, Unigram "trims" top-down; SentencePiece's real breakthrough is removing the "must pre-tokenize first" assumption, making the tokenizer language-agnostic and reversible.
🤔 Why does "lossless reversibility" matter? Think about code generation and JSON output—scenarios where not a single character can be wrong.

Byte-level BPE & UTF-8 BoundariesByte-level BPE · UTF-8 · Vocab trade-off

byte fallbackvocab trade-off

One-line analogy

Even the smartest subword algorithm will hit characters unseen in training (rare glyphs, emoji, new symbols). Byte-level BPE's solution is like programming's "always have a fallback to raw bytes": any character in UTF-8 ultimately becomes 1–4 bytes, and there are only 256 possible bytes—put those 256 in the base vocab and you never get OOV. GPT-2 has done this since.

What it solves + how it works

The cost is the UTF-8 boundary problem: one Chinese character = 3 UTF-8 bytes, an emoji can be 4, and a token boundary can land in the middle of a character's bytes. Two counterintuitive consequences:

UTF-8 boundary: a character gets shredded

"你好" → UTF-8 bytes: E4 BD A0｜E5 A5 BD (3 bytes each)
token split may land at: [E4 BD] [A0 E5 A5] [BD] ← char shredded

streaming output: � � → 你好 (half-bytes arrive first, shown once complete)

① Streaming garbage: when emitting token by token, half a character's bytes arrive first and render as �, only assembling into a full character once the rest arrives (those flickering boxes you sometimes see in streaming output);
② The model "can't see characters": the model operates on tokens / bytes, not human "letters." So "how many r's in strawberry" or "spell this word backwards" often trip it up—it has no "character" representation layer at all.

The vocab-size trade-off (a core trade-off, pure mechanism not tuning):

Large vocab: shorter sequences (fewer tokens) → faster inference, more fits in context, lower cost; but the embedding matrix = V × d_model grows linearly (4× the V means 4× these params), and rare tokens get few training samples and learn poorly;
Small vocab: fewer params, but the same text splits into more tokens, sequences lengthen, O(n²) cost rises.

The trend is ever-larger vocabs: Llama 2 = 32K, Llama 3 = 128K, GPT-4 (cl100k) ≈ 100K, GPT-4o (o200k) ≈ 200K. The main driver is multilingual efficiency—non-English often uses 2–4× more tokens (the "tokenization tax"), which a larger vocab mitigates.

Code

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")  # GPT-4's vocab

ids = enc.encode("你好🌍")
print(len(ids), ids)         # CJK/emoji often span several tokens

# a single Chinese char is 3 bytes in UTF-8
print("你".encode("utf-8"))  # b'\xe4\xbd\xa0' —— 3 bytes

# 256 byte tokens as fallback, so never OOV
for t in ids:
    print(t, enc.decode_single_token_bytes(t))

Common pitfall + your scenario

"1 token ≈ 1 word / 1 Chinese char"—wrong. English is roughly 1 token ≈ 0.75 words; Chinese is often 1 char ≈ 1–2 tokens (depending on vocab). Extrapolating Chinese cost from English token intuition badly underestimates it.

📌 Parent scenario: when estimating API cost and context budget, Chinese content usually has far more tokens than intuition suggests. Measure a representative sample with tiktoken / the SDK first, then extrapolate by ratio—don't eyeball it from English experience.

Takeaway + question

💡 Byte fallback guarantees "never OOV," at the cost of shredded characters—in the model's world there are no "letters," only tokens and bytes.
🤔 If the model can't see characters, how does it spell fluently and write correct code? (Hint: in massive data, "character-level patterns" get indirectly encoded into token co-occurrence statistics.)

Deep Questions

1. BPE, WordPiece, and Unigram all essentially "carve the corpus into the optimal set of units." What does this share with the database index design / cache-key design you know?

Core isomorphism: all choose a set of "reusable units" under a fixed budget to maximize overall efficiency. Index design picks "which columns to index"—too many indexes slow writes and waste space, too few slow queries; a tokenizer picks "which subwords go in the vocab"—too large blows up the embedding matrix and undertrains rare tokens, too small lengthens sequences. Cache-key design picks "cache granularity"—too fine means high hit-rate but heavy management, too coarse means low hit-rate; subword granularity is the same: char-level has high "hit-rate" (never OOV) but long sequences, word-level has short sequences but frequent OOV. The three algorithms are three selection strategies: BPE = greedy by frequency (like LFU caching), WordPiece = by mutual-information gain (like "benefit/cost" eviction), Unigram = build everything then trim by global probability (like building all indexes then pruning by query cost). Your distributed-systems intuitions about "hot/cold tiering" and "index selectivity" map almost one-to-one onto understanding tokenizers.

2. "The model can't see characters" makes it miscount the r's in strawberry. Is this a fundamental flaw of tokenization, or an engineering problem you can work around with data/training?

Both, but the root is tokenization. The model's input is token ids; the character "r" never enters the model as an independent unit—it's buried inside the holistic vectors of tokens like "straw" and "berry." So "counting letters" / "reversing a string"—character-level operations—are indirect inference for the model, not direct reads, and naturally error-prone. How far can you work around it? (a) Data: with lots of "spelling, letter games, character counting" examples in training, the model can indirectly encode character-level patterns into token co-occurrence statistics, so frequent words get spelled right; (b) but the long tail / rare / coined words still break, lacking enough co-occurrence signal; (c) the real architectural fix is character-level / byte-level models (e.g. ByT5, recent BLT) that eat bytes directly with no tokens—at the cost of long sequences and heavy compute. So the status quo is an engineering compromise: trade tokens for efficiency, patch character ability with massive data, and leave the long tail to "the model occasionally errs." A reminder, too: before outsourcing character-precise tasks (checksums, formatting) to a model, ask whether it can even "see" that granularity.

3. Vocabs keep growing (Llama 2's 32K → Llama 3's 128K). Will this trend continue forever? What forces push and what forces pull?

Pushes (toward larger): (a) multilingual—the "tokenization tax" on non-English is eased by a larger vocab, and covering more languages/scripts needs more subwords; (b) long-context efficiency—the same text becomes fewer tokens, effectively expanding context and lowering O(n²) cost; (c) code / structured text—folding common code snippets, indentation, and symbols into the vocab greatly shortens sequences. Pulls (toward smaller): (a) the embedding + output softmax matrix = V × d, so doubling V doubles both, a heavy memory burden especially for small models; (b) rare-token undertraining—the larger the vocab, the fewer training samples each long-tail token sees, learning poorly or becoming a "glitch token" (almost absent in training, weird at inference); (c) output-layer compute—each step computes a softmax over the whole vocab, slower as V grows. So it won't grow without bound; it converges to a sweet spot set by model scale, target language distribution, and deployment cost. Today's large models mostly land at 100K–250K. A more radical direction is to drop the fixed vocab altogether (byte-level / dynamic segmentation), turning this trade-off from "pick a vocab size" into "let the model learn to segment."

4. SentencePiece emphasizes "lossless reversibility" (detokenize restores the original perfectly). Why couldn't early tokenizers do this? For which downstream tasks is reversibility make-or-break?

Early tokenizers were irreversible because they did destructive preprocessing first: split on whitespace (losing "how many spaces" and "which whitespace char"), lowercasing, stripping punctuation, Unicode normalization—these steps permanently discard part of the original, so detokenize can only "guess" its way back (e.g. heuristics for "should there be a space before punctuation," which fail on edge cases). SentencePiece's key design: treat the space as an ordinary character (▁) and model it, doing no lossy preprocessing, making encode→decode a strict bijection. Tasks where reversibility is make-or-break: (a) code generation—one wrong space/indent/tab is a Python syntax error; (b) structured output (JSON / XML / YAML)—format characters can't be wrong; (c) diff / patch generation—must be character-exact; (d) rich text / markdown—whitespace and newlines carry meaning; (e) non-English + mixed scripts—"normalization" preprocessing often corrupts CJK / Arabic. In short: for any task where "the output gets re-parsed by a machine," a tokenizer's lossiness is a hidden bug source.

5. Tokenization is the translation layer between "human symbols" and "model numbers." Zoom out to a cognitive view—what does this layer's existence imply about whether the model "really understands" language?

A deep question that touches symbol grounding. The tokenizer compresses text into a string of integer ids; the model never touches "letters" or "sounds," only how those ids co-occur across massive text. Implications at several levels: (a) the model's "language" is statistical geometry, not symbolic semantics—it "knows" strawberry not by recognizing s-t-r-a-w..., but by where that token's vector sits relative to "fruit"/"red"/"sweet" in high-dimensional space; this is utterly unlike the human path of "sound first, then meaning"; (b) tokenization is a cognitive filter—it predetermines the smallest unit the model can perceive; whatever gets shredded (rare words, character-level structure, non-mainstream languages) is innately "myopic" to the model, not from lack of effort but from lost resolution at the input; (c) versus humans—humans hold characters, syllables, words, concepts at multiple granularities and switch freely (so you can play word games, decompose characters, hear-and-disambiguate), while the model is locked to the single granularity the tokenizer hands it. For anyone interested in consciousness and cognition, there's a striking parallel: when we say a model "understands" language, it largely does geometric operations in a discrete symbol space forcibly defined by the tokenizer. The "boundary" of understanding is, in a sense, drawn at the tokenization step. This is why some argue: to get closer to human language ability, the next architecture may first have to break the "fixed tokenization" ceiling.

Byte Pair EncodingBPE

WordPieceWordPiece · BERT family

SentencePiece & UnigramSentencePiece · Unigram LM

Byte-level BPE & UTF-8 BoundariesByte-level BPE · UTF-8 · Vocab trade-off

Further Reading

Deep Questions