BPE is essentially a greedy dictionary-compression algorithm—same ancestor as gzip / LZ, which compress files by "replacing frequent byte sequences with short codes" (Gage's 1994 data-compression algorithm). The only difference is the objective: compression software makes files smaller, BPE makes the vocabulary big enough yet not explosive. Sennrich et al. (2016) carried it from compression into NLP.
Both extremes fail. Word-level vocab (one token per word): English has hundreds of thousands of words plus spelling variants and neologisms—the vocab is unbounded, and any word unseen in training (OOV, out-of-vocabulary) just breaks. Character-level: vocab shrinks to a few dozen chars, but a sentence explodes into hundreds of tokens, and since attention is O(n²), long sequences become unaffordable. BPE takes the middle ground: common words stay one token, rare words split into subword pieces.
The mechanism is greedy merging: ① init vocab = all single characters; ② count the frequency of adjacent token pairs in the corpus; ③ merge the most frequent pair into a new token and add it to the vocab; ④ repeat ②③ until the vocab hits a target size (say 50K). The training output is an ordered list of merge rules, applied in the same order at inference:
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace # 1) empty BPE, pre-split on whitespace (BPE assumes pre-tokenized) tok = Tokenizer(BPE(unk_token="[UNK]")) tok.pre_tokenizer = Whitespace() # 2) train: target vocab 5000, repeatedly merge frequent pairs trainer = BpeTrainer(vocab_size=5000, special_tokens=["[UNK]"]) tok.train(["corpus.txt"], trainer) # output = ordered merge-rule table # 3) encode: apply the same merge rules to new text out = tok.encode("tokenization") print(out.tokens) # maybe → ['token','ization'], maybe finer
WordPiece and BPE share "the same greedy-merge framework, different scoring function"—like two databases both using B-trees, but one caches by access frequency and the other by benefit/cost ratio. BPE looks at absolute frequency; WordPiece looks at a mutual-information-style gain. It originated in Schuster & Nakajima 2012 (Japanese/Korean voice search) and was popularized by BERT (Devlin et al. 2018).
BPE's "merge the most frequent pair" has a flaw: two segments that are each frequent and merely happen to be adjacent often (like "the" followed by various words) get merged by mistake. WordPiece instead merges the pair that most increases the corpus likelihood, scored as:
score(a, b) = freq(ab) / ( freq(a) × freq(b) )
Intuition: the numerator is how often the pair appears together; the denominator is how often each appears on its own. If a and b are merely each frequent but randomly adjacent, the denominator is large, the score is low, no merge; only when a and b are tightly bound (the combo is far more common than the parts) does the score rise. This is exactly pointwise mutual information (PMI)—measuring whether two things co-occur more than by chance.
At inference it uses longest-match greedy splitting, and word-internal continuation subwords are marked with the ## prefix (playing → play + ##ing), enabling lossless reconstruction: ## attaches to the previous, no-## gets a leading space.
from transformers import BertTokenizer tok = BertTokenizer.from_pretrained("bert-base-uncased") # ## = "attaches to previous subword, no space between", for lossless restore print(tok.tokenize("playing tokenization")) # illustrative → ['playing', 'token', '##ization'] # frequent words stay whole, rare words split into ## pieces print(tok.tokenize("antidisestablishment")) # the rarer the word, the more ## fragments (actual split depends on vocab)
BPE / WordPiece both assume you can pre-split on whitespace—a hidden assumption that breaks for Chinese / Japanese (no whitespace word boundaries). SentencePiece treats the whole sentence as a raw Unicode stream, with even the space as an ordinary character (shown as ▁). Analogy: BPE is a parser that "splits on a delimiter first, then parses"; SentencePiece is a schema-less parser running directly on the raw byte stream—it depends on no language's tokenization rules, and is therefore losslessly reversible (detokenize restores the original exactly).
First, clear up a common confusion: SentencePiece is a framework / tool (open-sourced by Google) that internally runs either BPE or Unigram. Unigram is its default and its biggest contribution (Kudo 2018).
Unigram goes in exactly the opposite direction from BPE: BPE is bottom-up merging (building up from characters), Unigram is top-down pruning—
Bonus: each token carries a probability, and one word admits multiple valid segmentations—used for subword regularization (randomly sampling different splits at training time as data augmentation). T5, LLaMA, Gemma, etc. all use SentencePiece.
import sentencepiece as spm # train unigram: eats raw text directly, no pre-tokenization spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="m", vocab_size=8000, model_type="unigram") # or "bpe" sp = spm.SentencePieceProcessor(model_file="m.model") print(sp.encode("机器学习 hello", out_type=str)) # space → ▁, CJK and English handled uniformly print(sp.decode(sp.encode("机器学习 hello"))) # == original (lossless)
Even the smartest subword algorithm will hit characters unseen in training (rare glyphs, emoji, new symbols). Byte-level BPE's solution is like programming's "always have a fallback to raw bytes": any character in UTF-8 ultimately becomes 1–4 bytes, and there are only 256 possible bytes—put those 256 in the base vocab and you never get OOV. GPT-2 has done this since.
The cost is the UTF-8 boundary problem: one Chinese character = 3 UTF-8 bytes, an emoji can be 4, and a token boundary can land in the middle of a character's bytes. Two counterintuitive consequences:
The vocab-size trade-off (a core trade-off, pure mechanism not tuning):
The trend is ever-larger vocabs: Llama 2 = 32K, Llama 3 = 128K, GPT-4 (cl100k) ≈ 100K, GPT-4o (o200k) ≈ 200K. The main driver is multilingual efficiency—non-English often uses 2–4× more tokens (the "tokenization tax"), which a larger vocab mitigates.
import tiktoken enc = tiktoken.get_encoding("cl100k_base") # GPT-4's vocab ids = enc.encode("你好🌍") print(len(ids), ids) # CJK/emoji often span several tokens # a single Chinese char is 3 bytes in UTF-8 print("你".encode("utf-8")) # b'\xe4\xbd\xa0' —— 3 bytes # 256 byte tokens as fallback, so never OOV for t in ids: print(t, enc.decode_single_token_bytes(t))