Modern language models don’t train on text — a tokenizer chops raw text into chunks, and the model only ever sees those chunks. This indirection raises two natural questions. First: what does tokenization lose? A language model is a distribution over strings, but we’re learning a distribution over token sequences — does this restrict what we can express? Second: what does tokenization add? Even if nothing is lost, the token representation might introduce redundancy that the model must waste capacity on. We’ll show that with a lossless tokenizer, the answer to both questions is: nothing.
What is a lossless tokenizer?
A tokenizer chops strings into chunks called tokens. A lossless tokenizer is one where you can perfectly reconstruct the original string from those chunks — nothing is lost in translation.
To make this precise, we need a few objects. An alphabet
A tokenizer
We say
Example. Let
and . The tokenizer maps , and the detokenizer concatenates: . The round-trip gives .
Losslessness forces
Note. We do not require the reverse direction: multiple token sequences can detokenize to the same string. For instance, both
and concatenate to “hello”. The tokenizer picks one; the others are unused sequences in . Losslessness only requires the forward-then-back direction to work.
Modeling strings via tokens
A language model is a probability distribution over
When we train on token sequences instead, we’re learning a different object — a distribution over
How does a distribution over token sequences give us a distribution over strings? The natural answer: the probability of a string is the total probability of all the ways to produce it — sum over every token sequence that detokenizes to that string. This defines the induced language model
where
Example. Two token sequences detokenize to
: and . The induced probability is .
To verify that
What does tokenization lose?
The question is whether this inducing relationship is surjective: can every language model
For lossless tokenizers, the answer is yes. A lossless
Claim. When
Proof. Define
To check it induces
where the last step is losslessness.
Note. The canonical
is not the only distribution that induces . For instance, if , both and detokenize to , so any satisfying induces . The canonical construction puts all mass on whichever sequence is .
This construction fails when
Example. Suppose both “l” and “ł” map to the same token
, so . Let . The construction sets and everywhere else. The total mass is , not — the probability of has nowhere to go.
What does tokenization add?
We showed that a lossless tokenizer doesn’t restrict what language models can express. But expressiveness isn’t the only concern — tokenization could introduce redundancy. Multiple token sequences can detokenize to the same string, so a model over
We can make this precise using entropy. Recall that the entropy of a discrete random variable
Entropy measures average uncertainty — it is zero when all mass is on a single outcome and maximized under a uniform distribution. If tokenization adds redundancy, a model over token sequences should require strictly more entropy than the underlying string distribution.
For lossless tokenizers, it needn’t. The canonical
Claim. Any distribution
Proof. Let
The gap
Example. Consider
. The canonical distribution puts all mass on , giving . A distribution that splits mass evenly between and would give — the extra bit encodes a meaningless choice of tokenization.
Does it matter?
Theory tells us that the model should assign all the weight to the canonical tokenization. Chatzi et al. (2025) make the case for why: they prove that canonical sampling — restricting generation to token sequences that the tokenizer would actually produce — yields a token-level distribution provably closer to the training distribution in KL-divergence than standard sampling does. The intuition is clean: since the model only ever saw canonical sequences during training, non-canonical sequences are out-of-distribution, and probability placed on them is probability placed where the model has no training signal.
How much probability leaks in practice? Chirkova et al. (2023) estimated the gap between
So is this redundancy of assigning mass to non-canonical tokenizations necessarily bad? BPE-Dropout (Provilkov et al., 2020) randomly drops merge operations during training, exposing the model to varied tokenizations of the same string. This acts as a regularizer: the model learns more robust subword representations, improving translation quality by up to 2.3 BLEU over standard BPE. So while the canonical
References
- Chatzi, I., Corvelo Benz, N., Tsirtsis, S., & Gomez-Rodriguez, M. (2025). Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service. arXiv:2506.06446
- Chirkova, N., Kruszewski, G., Rozen, J., & Dymetman, M. (2023). Should you marginalize over possible tokenizations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). arXiv:2306.17757
- Provilkov, I., Emelianenko, D., & Voita, E. (2020). BPE-Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. arXiv:1910.13267