9 min read
Why Tokens Are Enough
tokenization information-theory
Table of Contents

Modern language models don’t train on text — a tokenizer chops raw text into chunks, and the model only ever sees those chunks. This indirection raises two natural questions. First: what does tokenization lose? A language model is a distribution over strings, but we’re learning a distribution over token sequences — does this restrict what we can express? Second: what does tokenization add? Even if nothing is lost, the token representation might introduce redundancy that the model must waste capacity on. We’ll show that with a lossless tokenizer, the answer to both questions is: nothing.

What is a lossless tokenizer?

A tokenizer chops strings into chunks called tokens. A lossless tokenizer is one where you can perfectly reconstruct the original string from those chunks — nothing is lost in translation.

To make this precise, we need a few objects. An alphabet is a finite set of characters. The set contains all finite strings over , including the empty string. A vocabulary is a finite set of tokens, where each token is typically a short string from . The set contains all finite token sequences over .

A tokenizer maps a string to a token sequence. A detokenizer maps a token sequence back to a string, typically by concatenating the tokens together.

We say is lossless if there exists a detokenizer such that every string satisfies the round-trip property .

Example. Let and . The tokenizer maps , and the detokenizer concatenates: . The round-trip gives .

Losslessness forces to be injective — distinct strings must map to distinct token sequences. If two strings mapped to the same token sequence , then the round-trip property would require and , contradicting .

Note. We do not require the reverse direction: multiple token sequences can detokenize to the same string. For instance, both and concatenate to “hello”. The tokenizer picks one; the others are unused sequences in . Losslessness only requires the forward-then-back direction to work.

Modeling strings via tokens

A language model is a probability distribution over :

When we train on token sequences instead, we’re learning a different object — a distribution over :

How does a distribution over token sequences give us a distribution over strings? The natural answer: the probability of a string is the total probability of all the ways to produce it — sum over every token sequence that detokenizes to that string. This defines the induced language model as the pushforward of through :

where is the preimage of — the set of all token sequences that detokenize to .

Example. Two token sequences detokenize to : and . The induced probability is .

To verify that sums to one, note that the preimage sets partition — since is a function, every token sequence detokenizes to exactly one string, so each belongs to exactly one — giving us:

What does tokenization lose?

The question is whether this inducing relationship is surjective: can every language model be induced by some ? If not, then modeling token sequences is strictly less expressive than modeling strings, and tokenization loses something.

For lossless tokenizers, the answer is yes. A lossless is injective, so every string maps to exactly one token sequence. This means we can transfer probability directly by setting and zero elsewhere. We call this the canonical inducing distribution.

Claim. When is lossless, any desired can be exactly induced by some .

Proof. Define for and otherwise. This sums to one because is a bijection onto its image:

To check it induces : since is zero outside , the only token sequence in with positive mass is . So:

where the last step is losslessness.

Note. The canonical is not the only distribution that induces . For instance, if , both and detokenize to , so any satisfying induces . The canonical construction puts all mass on whichever sequence is .

This construction fails when is lossy. If for distinct strings , the construction assigns , which recovers the probability of only one of the merged strings. The probability mass of the other is lost entirely, so — the constructed isn’t even a valid distribution.

Example. Suppose both “l” and “ł” map to the same token , so łł. Let łł. The construction sets and everywhere else. The total mass is , not — the probability of łł has nowhere to go.

What does tokenization add?

We showed that a lossless tokenizer doesn’t restrict what language models can express. But expressiveness isn’t the only concern — tokenization could introduce redundancy. Multiple token sequences can detokenize to the same string, so a model over must somehow distribute probability across these equivalent sequences. Does this force the model to waste capacity on a spurious choice?

We can make this precise using entropy. Recall that the entropy of a discrete random variable with distribution over finite sample space is:

Entropy measures average uncertainty — it is zero when all mass is on a single outcome and maximized under a uniform distribution. If tokenization adds redundancy, a model over token sequences should require strictly more entropy than the underlying string distribution.

For lossless tokenizers, it needn’t. The canonical from the previous section places all mass on for each string , so exactly one token sequence per string has positive probability — achieving . Tokenization adds no redundancy. We can make this precise:

Claim. Any distribution on that induces satisfies , with equality if and only if for every string , at most one token sequence in has positive probability under .

Proof. Let and . By the inducing property, . Since is a deterministic function of , the chain rule of entropy gives:

The gap is the residual uncertainty about which token sequence was used, given the string it represents. This is zero if and only if, conditioned on each string , the distribution concentrates on a single token sequence in — i.e., for every , at most one has .

Example. Consider . The canonical distribution puts all mass on , giving . A distribution that splits mass evenly between and would give — the extra bit encodes a meaningless choice of tokenization.

Does it matter?

Theory tells us that the model should assign all the weight to the canonical tokenization. Chatzi et al. (2025) make the case for why: they prove that canonical sampling — restricting generation to token sequences that the tokenizer would actually produce — yields a token-level distribution provably closer to the training distribution in KL-divergence than standard sampling does. The intuition is clean: since the model only ever saw canonical sequences during training, non-canonical sequences are out-of-distribution, and probability placed on them is probability placed where the model has no training signal.

How much probability leaks in practice? Chirkova et al. (2023) estimated the gap between and the true pushforward using importance sampling over tokenizations of GPT-2 and BLOOM. For well-represented text (Wikipedia, news), the relative gap in bits-per-character was under 0.5%. It grew for out-of-distribution text — ~1.6% on Twitter, ~2% on transcribed speech — driven by rare words that split into long token sequences and leak probability onto non-default segmentations.

So is this redundancy of assigning mass to non-canonical tokenizations necessarily bad? BPE-Dropout (Provilkov et al., 2020) randomly drops merge operations during training, exposing the model to varied tokenizations of the same string. This acts as a regularizer: the model learns more robust subword representations, improving translation quality by up to 2.3 BLEU over standard BPE. So while the canonical is information-theoretically optimal, deliberately introducing some tokenization noise can help generalization — a case where a bit of redundancy pays for itself.

References

  • Chatzi, I., Corvelo Benz, N., Tsirtsis, S., & Gomez-Rodriguez, M. (2025). Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service. arXiv:2506.06446
  • Chirkova, N., Kruszewski, G., Rozen, J., & Dymetman, M. (2023). Should you marginalize over possible tokenizations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). arXiv:2306.17757
  • Provilkov, I., Emelianenko, D., & Voita, E. (2020). BPE-Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. arXiv:1910.13267