Tokenization and the Noiseless Channel

Subword tokenization is a key part of most NLP pipelines.However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution.Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance.In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.


Introduction
Tokenization, the practice of breaking up text into words or subword pieces, or more generally, tokens,1 is often the first step in an NLP pipeline.A wide variety of tokenization functions have been proposed in the NLP literature (Mermer, 2010;Sennrich et al., 2016;Kudo, 2018).And, indeed, research on developing a good tokenization function continues because how one tokenizes may have a large impact on model performance in the downstream task.For instance, Gowda and May (2020) note BLEU ranges from 28 to 37 just by changing the size of the vocabulary in their machine translation (MT) pipeline.A direct extrinsic evaluation of a tokenization function, however, is computationally intensive: One first has to retokenize the corpora (generally quick), but then retrain the NLP model (often computationally intensive) to evaluate the effect.For this reason, characterizing the intrinsic properties of a good tokenization function has practical benefits (Gallé, 2019;Gowda and May, 2020).
Our paper takes an information-theoretic approach to characterizing a good tokenization function.Following Gallé (2019), we contend that tokenization may be fruitfully viewed as determining a good dictionary code for a language.Fortunately, dictionary codes are equipped with a natural intrinsic metric of utility (expected code length) whereas there are many ways to extrinsically measure the tokenization quality.For simplicity and in line with previous research, we choose a specific downstream task metric: BLEU (Papineni et al., 2002) in the domain of MT. 2  We hypothesize that, ceteris paribus, downstream task metrics should correlate with the expected code length of the unigram token distribution.While not immediately intuitive, the motivation is that there is a theoretical connection between the expected code length (under an optimal encoder) of a token distribution and that distribution's Shannon entropy: The latter gives us a lower bound on the former.And given a fixed vocabulary size, higher entropy token distributions are more desirable because they are more balanced, i.e., there are fewer tokens that occur too rarely or too frequently.This characteristic should in turn balance a model's ability to learn representations for the entire vocabulary, which requires exposure to enough instances of each token while also penalizing the use of very frequent character sequences as tokens, which is often inefficient due to their lack of distinct meaning.
Yet when using Shannon entropy as our metric of a distribution's balance, the optimal token distribution may still include a large number of in-frequent tokens.This behavior may be undesirable for a number of reasons that we subsequently discuss.Accordingly, we formulate the compression principle, which states that downstream task metrics, e.g., BLEU, should correlate with the expected code length subject to a penalty for long codewords (which correspond to infrequent tokens).Consequently, we introduce a more nuanced formulation of efficiency that employs Rényi entropy (Rényi, 1961), 3 whose hyperparameter α allows us to penalize the use of long codes to varying degrees.
In the experimental portion of our paper, we predict the performance of MT models.We find that the channel efficiency with Rényi entropy with α = 2.5 yields a Pearson correlation of 0.78 with BLEU on German → English MT (1M parallel sentences from CommonCrawl).This stands in contrast to Shannon entropy or expected sequence length, which yield Pearson correlations of only 0.22 and −0.30, respectively.
We also provide an easy-to-use package to score tokenizations.See App.B for usage instructions.

Tokenization
Tokenization is generally defined informally as the breaking up of text into a sequence of tokens which are then encoded into a machine-interpretable format.However, to proceed with our analysis, we require a more formal treatment.First, we assume that there exists an alphabet, a finite, non-empty set of characters Σ.We call a string of charac- In this formulation, we assume that the alphabet Σ includes all characters, including punctuation and a distinguished white space character.Finally, an unordered multiset of texts {σ 1 , . . ., σ M } ⊂ Σ * is termed a corpus of size M .We denote the true distribution over all texts as p Σ * .Every p Σ * induces a marginal distribution over Σ, which we call the Σ-unigram distribution: where count(σ, σ) returns the number of times the character σ appears in text σ.In general, we do not have access to p Σ * but rather only to samples from p Σ * with which we can represent an empirical distribution p Σ * .Our formal analysis, however, will consider p Σ * .Let ∆ be a second alphabet, which we call the tokenization alphabet.We define a tokenization function t : Σ * → D ⊆ ∆ * as a function mapping texts in alphabet Σ to sequences of tokens in D = t(Σ * ), One popular choice in NLP is to have Σ be a set of Unicode characters and ∆ be a set of strings of Unicode characters.In this case, the tokenization function t segments the text σ into tokens corresponding to smaller chunks of text.There are many approaches for devising different t's; a brief overview of some of them is offered in App.E.
Furthermore, for our purposes, it is useful to restrict tokenization functions to those that are invertible (bijections), i.e., rules where we can undo the tokenization.This way, the original text can be reconstructed and no information is lost during tokenization.
Because of our restriction to invertible tokenization functions, with a change of variable we can convert the distribution over texts in Σ * into one over token sequences δ in D in a straightforward manner: p ∆ * (δ) = p Σ * (t −1 (δ)).Note that the pushforward p ∆ * (δ) induces a distribution over ∆ * but with support limited to D.
In applied NLP, there is currently no widely accepted notion of the intrinsic quality of a tokenization function.Rather, practitioners are generally interested in its extrinsic performance, i.e., the performance of a model trained on a corpus tokenized using a certain tokenization function.Under such an evaluation, given two tokenization functions, the one that enables better performance on the downstream task is taken to be better.However, gauging the quality of a tokenizer function in this manner is computationally expensive.Thus, we develop an information-theoretic intrinsic evaluation.

Communication in a Noiseless Channel
Our analysis of tokenization schemes relies on the following framing: Our ultimate goal when tokenizing a text σ ∼ p Σ * is the transmission of this text across a hypothetical channel.To perform this feat, we first tokenize σ into a sequence in D ⊆ ∆ * .We then encode each token in ∆ as a sequence of symbols from the set {1, . . ., b}, where b is determined by the channel.Our goal is to analyze the properties of tokenization schemes that lead to models with good downstream performance.
In the case of a noisy channel, we seek an encoding scheme that will help ensure that σ is resilient to noise in addition to efficiently encoding σ.However, in the noiseless case, we only care about efficiency.We can assume that we are working with a noiseless channel because, in the process of encoding data, no information is ever altered by a stochastic process.In this case, one can equivalently think of noiseless channel encoding as compression.Thus, our analysis proceeds by considering the efficiency of different tokenization functions as if our goal is to use them to communicate over a noiseless channel.To this end, we first discuss the conditions for building such an encoding and then discuss the concept of efficient channel usage.Definition 3.1.A token-level encoder enc ∆ is a function enc ∆ : ∆ → {1, . . ., b} * that maps every token δ ∈ ∆ to a string of symbols in base b, which we call a codeword.We can naturally lift the tokenlevel encoder to a sequence-level encoder using concatenation as enc ∆ (δ) = |δ| n=1 enc ∆ (δ n ). 4  In order to be able to uniquely decode a string δ, we further require that enc ∆ produces prefix-free 5 codes for all tokens in ∆.As an example, Huffman encoding provides a fast and nearly optimal (in a sense to be discussed in the subsequent section) method to construct prefix-free codes (Huffman, 1952).
Example 3.2 (One-hot encoding).Consider a tokenization alphabet ∆.In NLP, when b = 2, the most straightforward way of encoding the n th element of ∆ is a vector of zeroes with length |∆| with 1 on position n.
For the remainder of the paper, we will not be interested in any specific enc ∆ , but rather in an optimal token-level encoder that we can achieve, as measured by expected code length measures.
Definition 3.4.The expected code length L enc ∆ of a token-level encoder enc ∆ is defined as A well-known result from information theory tells us that Eq. ( 2) is bounded by the Shannon entropy of W ∆ , a ∆-valued random variable with law p ∆ .To introduce the theorem, we first define Shannon entropy.
Definition 3.5.The Shannon entropy of W ∆ is defined as For channels using b symbols for transmission, the logarithm is of base b.Traditionally in information theory, ones takes b = 2.
Let W ∆ be a ∆-valued random variable with law p ∆ and let enc ∆ be an encoder.Then, with the optimal token-level encoder enc This theorem tells us that if we wish to communicate tokens from the alphabet ∆ through a noiseless channel, their minimum expected length for any possible encoding is bounded by the Shannon entropy of the distribution p ∆ .An optimal token-level encoder will produce codes with the expected length within those exact bounds.We can prove a very similar result to Shannon's source coding theorem (Shannon, 1948) that tells us how well we can optimally encode a ∆ * -valued source using only the token-level encoder.
To this end, we first introduce the notions of expected sequence length and average per-token encoding length, and then offer a lower-bound on the compression achievable using only a tokenlevel encoder.We additionally define three random variables that will prove useful in our analysis; all of them are pushforwards of p ∆ * .
Let L be a random variable whose values range over strings' lengths, i.e., L(δ) = |δ|.The expected token sequence length E[L] for sequences sampled according to p ∆ * is then where for notational simplicity, we leave the dependence of this expectation on p ∆ * implicit as it will always be clear from context.Let X δ (δ) = count(δ,δ) |δ| be the unigram random variable, i.e., a function that returns the proportion of δ that consists of a particular δ.Finally, define the random variable We now turn to our first major theorem.
Theorem 3.7.Let p ∆ * be a distribution over ∆ * , and let p ∆ be the unigram distribution induced by p ∆ * (Eq.( 1)).Then, for an optimal token-level encoder enc ⋆ ∆ : ∆ → {1, . . ., b} * lifted to the sequence level, the following lower and upper bounds hold:6 Proof.The proof is given in App. C. ■ In the special case of Shannon entropy, we additionally arrive at the following stronger inequality: This holds because enc ⋆ is not constrained to token-level codes and the unconstrained minimum over all codes is naturally lower than the constrained version.As a concrete example, even if two δ ′ , δ ′′ ∈ ∆ always appear together in practice, enc ⋆ ∆ must assign both δ ′ and δ ′′ their own unique code.Such a constraint does not apply to enc ⋆ .We foreshadow that this inequality does not generalize to Rényi entropy, as discussed in §4.
Theorem 3.7 tells us that the expected code length of a sequence-level encoder, based on a token-level encoder, is proportional to the expected code length of the unigram distribution up to an additive covariance factor.This allows us to determine both a lower-bound for the expected code length of such an encoder and an upper-bound for the expected code length of a sequence-level encoder based on an optimal token-level encoder.
We are now in the position to return to the main objective of this paper: Assessing the quality of different tokenizers.One natural way of comparing tokenizers would be to compare properties of the distributions over tokens that they each induce.At first glance, Shannon entropy looks like the most obvious candidate for such a property.However, for distributions over ∆ of different sizes, it is not directly comparable.The efficiency of a tokenization function addresses this issue.
Definition 3.8.Let p Σ * be a distribution over Σ * , let t : Σ * → ∆ * be a tokenization function, and let p ∆ * be the distribution over ∆ * induced by t.The efficiency of t is defined as where enc ⋆ ∆ is an optimal token-level encoder and enc U ∆ a uniform encoder that assigns all tokens in ∆ codes of equal length: ⌈log |∆|⌉.
Theorem 3.9.The efficiency of t is upper-bounded by and lower-bounded by where W ∆ is a ∆-valued random variable with law p ∆ , the unigram distribution induced by t.
Proof.The proof is given in App. C. ■ Note that the upper bound given in Eq. ( 10) tells us how efficient the best code could be, which is the more interesting bound for our purposes.Additionally, we note that, by introducing a normalization factor, efficiency provides a better solution than directly comparing distributions' entropies.We illustrate this in the following example.
Example 3.10.Consider a tokenization function t 1 with tokenization alphabet ∆ 1 where |∆ 1 | = 6.We then introduce a second tokenization function t 2 with a tokenization alphabet ∆ 2 defined to be ∆ 1 plus an additional 6 tokens that occur very infrequently.The difference between these two distributions is illustrated in Fig. 1.If, for example, we relied solely on the Shannon entropy, which is higher for more uniformly spread-out distributions, we would judge the second distribution to be better (2.50 < 3.08).However, the efficiency tells the opposite story (0.97% > 0.86%).
As Example 3.10 suggests, the measure provided by efficiency is in line with the idea of a more balanced distribution over ∆ * .Informally, we do not want a tokenizer that induces a distribution with very low entropy, as this is indicative of an unbalanced distribution.The efficiency eff provides us with a notion of this imbalance.To relate efficiency back to our metaphor of the noiseless channel, we note that the quantity 1 − eff(p ∆ * , t) is known as relative redundancy and corresponds to the maximum data compression ratio (in percentage of how much can data size be reduced) that can be achieved.

Rényi Efficiency
Definition 3.8, the standard definition of efficiency, is based on Shannon entropy.Upon closer inspection, we see it linearly penalizes the use of long codes.To see why, consider a case where the distribution changes such that the entropy increases by one.Then, the upper-bound for the expected code length provided by an optimal encoder also increases by one.However, in some cases, we may wish to assign a non-linear cost to code length, e.g., there may be a non-linearly higher cost for decoding longer codes.In the context of choosing the vocabulary for a model, this corresponds to our desire to avoid inducing tokens that occur very infrequently because there may not be enough examples of them in the training data for the model to learn.To add an additional degree of freedom to accommodate such preferences Campbell (1965) generalizes the measure of expected code length to discounted expected code length for a hyperparameter s as follows: 7 7 Our notation differs slightly from Campbell (1965).and, additionally, that Beyond the limiting cases, for s ∈ (−1, ∞) \ {0}, we further note that L (s) ) is a monotonically increasing function of s.The larger our value of s, the more disproportionately L (s) enc ∆ (p ∆ ) increases as a function of the longest codeword, which often corresponds to the encoding of a low-frequency character in a good code because high-frequency tokens are assigned the shorter codes in order to minimize the expected code length.For large enough s, this has the effect of encouraging all codewords to be roughly of equal length.Campbell (1965) sought an analogue of Shannon's coding theorem for L (s) enc ∆ (p ∆ ) where s ̸ = 0.As it turns out, there is a deep connection with the Rényi entropy.
Definition 4.1.The Rényi entropy of order α > 0 is defined as Prima facie, Rényi entropy bears some semblance to L (s) enc ∆ (p ∆ ).To see this, consider the limiting cases.At α = 0, we have And, at α = ∞, we have Finally, we have H 1 (p) = H(p), i.e., α = 1 corresponds to Shannon entropy, another result which can be shown by L'Hôpital's rule.These examples suggest the correspondence α = (1 + s) −1 , which fits the three cases considered, e.g., note that α = 0 when s → ∞.Moreover, this is exactly the intuition we argued for above: When α = 0, we encode tokens with codewords of the same length which follows from minimizing the length of the longest codeword.On the other hand, when s = −1, we encourage shorter codes for high-probability tokens.This case corresponds to α = ∞.We now prove that, similarly to how H(W ∆ ) provides bounds for ∆ is an encoder optimal with respect to a given s = α −1 + 1.We term such an encoder s-optimal.Theorem 4.2 (Generalization of Campbell (1965)).Let H α be the Rényi entropy of order α and let L (s) enc ∆ (p ∆ ) (Eq. ( 12)) be the discounted expected code length for the encoder enc ∆ , where s = α −1 − 1.Moreover, let W ∆ be a ∆-valued random variable with law p ∆ .Then for an s-optimal tokenlevel encoder enc s ∆ , the following bound holds on the discounted expected code length: Proof.Proof in App. C. ■ Note that we have further generalized Campbell's (1965) result by allowing some negative values for s, namely, s > −1.As a result, we can induce additional non-linear weight on too short codes as opposed to only long codes.Now we generalize the efficiency with respect to Shannon entropy to Rényi entropy.Let enc s ∆ be an s-optimal token-level encoder over token alphabet ∆.Note that several terms from our prior notation can now be expressed in terms of enc s ∆ , i.e., enc ⋆ ∆ = enc 0 ∆ and enc U ∆ = enc ∞ ∆ .Theorem 4.3.Let α = (1 + s) −1 and p ∆ * be a distribution over ∆ * , and let p ∆ be the unigram distribution induced by p ∆ * (Eq.(1)).Then, the following inequality holds where s = α −1 − 1.
The Rényi efficiency can be easily upperbounded in a similar manner to the Shannon efficiency.
Theorem 4.5.Let p Σ * be a distribution over Σ * , let t : Σ * → ∆ * be a tokenization function, and let p ∆ * be the distribution over ∆ * induced by t.Then, for an s-optimal token-level encoder enc s ∆ lifted to the sequence-level, the Rényi efficiency of t at α is upper-bounded by where W ∆ is a ∆-valued random variable with law p ∆ , the unigram distribution induced by t.

Proof. Proof in App. C. ■
To provide more intuition of why the non-linear penalization in Rényi efficiency makes for a good measure of distribution balance, we offer a worked example in Example E.1.

The Compression Principle
In previous sections, we discussed how different tokenizers lead to token distributions of varying properties.Now, we add the last piece necessary to link the downstream performance of a system with the choice of a tokenizer.
Hypothesis 5.1 (Compression Principle).Let p Σ * be a distribution over texts with characters from alphabet Σ and t be a tokenization function from Σ * to ∆ * for some token alphabet ∆.Let p ∆ be the ∆-unigram distribution induced by t.Finally, let PERFORMANCE M (t) be some measure of performance of a system M which uses tokenization t.Then, for some α dependent on M , eff α (p Σ * , t) is a good predictor of PERFORMANCE M (t).
In words, we hypothesize that the efficiency of the tokenization function t is highly correlated with the downstream performance.We will verify this claim experimentally in §6.
Rényi Entropy α.The choice of α for H α determines the extent to which longer codewords are penalized.On one hand, if we observe that Rényi efficiency with low α correlates the best with performance, we can conclude that longer Learnability.The most intuitive explanation for why some tokenization functions enable good downstream results and some worse is that having many low-frequent tokens will prevent the model from learning their distributional properties.This hypothesis can be related back to the sample complexity of the learning algorithm, i.e., the number of training samples needed by the model in the given setting to learn the function of interest.If we accept that part of the MT task is learning the meaning of all individual vocabulary tokens, then sample complexity could (at least partially) be expressed in terms of the number of instances of each token.This argument is made by Gowda and May (2020), who are concerned with what proportion of δ ∈ ∆ appears at least 100 times in the corpus for the downstream task at hand.Nevertheless, we will see shortly that the best predictor with Rényi efficiency is for α > 1, meaning that higher weight is given to codewords for more frequent tokens.We therefore hypothesize, that very high-frequency tokens have the most impact in downstream performance.

Experiments
We now seek empirical evidence for Hyp.5.1.We focus on MT, where a standard automatic evalu-ation metric is BLEU (Papineni et al., 2002).We use the English→German CommonCrawl dataset in all experiments.The specifics of the MT system, data and evaluation are described in App.D. We consider two different experimental manipulations.First, we experiment with various modifications of the popular byte-pair encoding (BPE) tokenizer (Sennrich et al., 2016) to control its compression rate.The details are discussed in §6.1.Second, we experiment with a variety of tokenization schemes: Unigram (Kudo, 2018), WordPiece (Devlin et al., 2019), Lempel-Ziv-Welch (Ziv and Lempel, 1977;Welch, 1984) and Morfessor (Creutz and Lagus, 2007;Virpioja et al., 2013;Smit et al., 2014).The details are discussed in §6.2.
Note that throughout our experiments, we make the simplifying assumption of Cov L enc s ∆ , L = 0.It simplifies the upper bound of eff(p Σ * , t) (from Theorem 3.9) to ⌈H(W ∆ )⌉ log |∆| and the upper bound of eff α (p Σ * , t) (from Theorem 4.5) to ⌈Hα(W ∆ )⌉ log |∆| .From our preliminary results, Cov L enc s ∆ , L is negative and small.We leave its more accurate approximation, which requires a Rényi analogue of Huffman coding as in Jelinek (1968), to future work.

Experiment 1
In our first experiment, we analyze how predictive various quantitative attributes of a tokenization scheme are of downstream model performance.We consider BPE with 5 different vocabulary sizes: 2k, 4k, 8k, 16k, and 32k.For each vocabulary size, we create multiple tokenization schemes with varying compression rates.As discussed in App.E.1, BPE produces a vocabulary through a greedy compression algorithm.However, in order to achieve a variety of different compression rates, we inject random noise into the algorithm. 8We achieve this by sampling from a Boltzmann distribution over the pair frequencies with temperature parameter τ ; see App.E.1 for details. 9 We then treat each vocabulary size-temperature pair as a single data point in our analysis.
Our main quantitative attribute of interest, i.e., predictor, is Rényi efficiency.Aside from Rényi efficiency, we further consider Shannon and Rényi entropies, Shannon efficiency, and average tokenized sequence length.Further, one popular heuristics for choosing the vocabulary size is given and justified by Gowda and May (2020).It can be summarized as: "Use the highest possible vocabulary size such that 95% of [tokens] occur at least 100 times in the data."While the constants seem arbitrary, this rule of thumb works well in practice (Gowda et al., 2022;Dramko et al., 2022;Kumar and Thawani, 2022).Nevertheless, it is stated in an algorithmic manner and not as a predictor of performance or learnability.We attempt to turn it into a regressive predictor so as to make it more comparable with the other quantities studied.Given p ∆ , let f n (p ∆ ) symbolize the frequency of the n th percentile.We then define the quantity which in words, is the sum of token frequencies from the γ th 1 to γ th 2 percentile.The original work suggests examining the frequency of the 95 th percentile, i.e., γ 1 = γ 2 = 0.95.In contrast, we add an additional degree of freedom as we do not inspect a single percentile frequency but rather a sum across an interval.Later, we scan the whole space for γ 1 and γ 2 and show that there are better choices that lead to much higher correlations.
We use Pearson and Spearman correlations with downstream model performance (measured with BLEU) as our metrics of predictor quality.Recall that Pearson correlation tells us the strength of a linear relationship between two variables.On the other hand, Spearman correlation quantifies the strength of a linear relationship of the ranking.
Results.In order to select α (for eff α ) as well as γ 1 and γ 2 ( for F γ 1 ,γ 2 ), we use half of the data to perform a grid-search, selecting the hyperparameters that lead to the highest Pearson correlation.We show the results of this grid search for Unless otherwise stated, we use these values in subsequent experiments.We show the relationship between BLEU, sequence length and Rényi efficiency as approximated by the lower bound (Theorem 4.5) in Fig. 2. A comprehensive comparison for all predictors is shown in Tab. 1.The visualization of the other predictors is in Fig. 6.From these analyses, we can see that the Rényi efficiency provides a significantly better explanation for downstream model performance than any of our other predictors.
When examining which α leads to the highest absolute correlation with BLEU, we can conclude that tokenization schemes that result in fewer very high-frequency tokens are the best for downstream performance.This is evinced by both the relatively high value of α that leads to the best correlation with performance (Fig. 3, α * . = 2.5) and by Fig. 4, which shows that frequencies in the top percentile correlate negatively with performance.Importantly, this finding does not contradict Gowda and May's (2020) rule of thumb, which focuses on low frequency tokens.While very high and very low frequencies produced by a tokenization scheme are not independent, a tokenization scheme may feasibly produce both, neither or only one.
Furthermore, the Pearson correlation between the efficiency (H 2.5 /H 0 ) and percentile frequency (F 0.03,0.83 ) is 0.96, which suggests that both predictors are capturing the same underlying effect.

Experiment 2
In this experiment, we evaluate whether there exist aspects of a tokenization scheme that influence BLEU beyond the Rényi efficiency.Following results in Experiment 1, we focus on Rényi efficiency at α = 2.5.In contrast to the first experiment, we consider different tokenization schemes (BPE, Unigram, WordPiece, LZW, Morfessor).We manipulate their efficiency by lowering the amount of tokenizer training data (2k, 8k, 100k parallel lines) together with varying vocabulary sizes of 4k, 8k, and 16k tokens.We then treat each tokenizationscheme-training-data-size-vocabulary-size triple as a single data point in this analysis.We compare three different linear models (Gelman and Hill, 2006), where BLEU is always the dependent variable: (i) with the tokenization scheme as a random effect, (ii) with Rényi efficiency as a fixed effect, and (iii) with both.Importantly, we treat tokenization scheme as a random effect because the set of tokenization algorithms that we consider does not encompass all possible methods, i.e., only a sample of all possible algorithms are observed.
To compare the ability of these different models to predict BLEU, we look at the average change in log-likelihood of held-out data points under a given model with respect to a baseline model: A model trained with only an intercept term.A larger value of ∆ log-likelihood indicates that the data point is more probable under the comparison model, i.e., the comparison model more closely fits the observed data.We use 10-fold cross-validation to estimate these differences: Our data is split randomly into 10 folds, where 9 of the folds are used to learn model coefficients and the 10 th fold is held back for evaluation.The same process is performed until we have a ∆ log-likelihood value for each data point.
Results.In Fig. 5, we see that Rényi efficiency is a stronger predictor of MT performance than the tokenization scheme alone.Interestingly though, the predictive power of these two predictors seems to be orthogonal, as evinced by the mean ∆ loglikelihood of a model with both predictors.This finding suggests that there are additional qualities of a good tokenization scheme that Rényi efficiency alone cannot capture.We leave the investigation of such qualities to future work.

Conclusion
Our paper presents a new information-theoretic approach to characterizing a good tokenization scheme.We contend that the Rényi efficiency of the unigram distribution that a tokenization scheme produces is a principled measure of the tokenization quality.To test this claim, we evaluate a large set of tokenization schemes, with varying vocabulary sizes and produced by different tokenization schemes.We observe how the Rényi efficiency of these different tokenizations relates to the performance of a downstream MT model.We find that, for an appropriate choice of the parameter α, this new metric has a very strong Pearson correlation with BLEU: 0.78 in comparison to just −0.32 for baseline sequence length.From a theoretical perspective, this property can be connected to a penalization of token distributions that are too unbalanced, having, in particular, very high-frequency tokens.This finding is in line with the more general principle that compression is connected with learnability.Our framework also has practical benefits as it allows for an intrinsic evaluation of tokenization functions.

Limitations
It is possible that there is a hidden effect caused by the language pair direction, model selection, or training data and its size.However, our results bear high statistical significance for cases where we desire high correlation and low statistical significance where we expect low correlation.Assured by this and concerned by the large cost of training a large number of MT systems, we did not experiment with larger data or other language directions apart from limited additional experiments in Tab. 2.

A Related Work
Prior to the widespread adoption of subword tokenization, large vocabulary sizes (e.g., 500k) were needed to allow for output expressivity and to avoid a high proportion of out-of-vocabulary tokens.Various tricks were devised to tackle the resulting computational issues (Jean et al., 2015;Mi et al., 2016;L'Hostis et al., 2016).On the other side of the spectrum, character-level NMT was also explored (Ling et al., 2015;Costa-jussà and Fonollosa, 2016), though issues arise with large sequence lengths.Mielke et al. (2021) provide an overview of the evolution of NLP tokenization and describe different types of tokenization approaches.They conclude that reasoning about tokenizer choices remains a vital part of modern pipeline preparation.In this context, our work quantifies and hence also automates some of this process by offering a framework to help guide the decision process and hyperparameter selection.Somewhat similar to our work, Ataman and Federico (2018) perform a comparison between BPE and Morfessor, though with only one specific vocabulary size (30k).Similarly to Gowda and May (2020), they suggest that homogeneity of token frequency is an important factor for MT model performance.

Predictor
Proof.Let δ ∈ ∆ * .Define the expected counts as follows We start by manipulating the expected code length Now, we proceed with algebraic manipulation.

■
Theorem C.2 (Generalized Hölder's inequality; §2 in Aczél and Beckenbach (1980)).Let f , g and h be vectors of positive values and coefficients p, q and r such that all but one are negative and 1 p + 1 q + 1 r = 0. Further, if ∀i : f i g i h i = 1 then ∥f ∥ p ∥g∥ q ∥h∥ r ≤ 1. (32) As noted in Aczél and Beckenbach (1980), Theorem C.2 is in fact a simple special case of Theorem 12 in Hardy et al. (1934).We will use Theorem C.2 specifically with r = −1 and h i = f i g i .This simplifies the requirements for exactly one of p and q to be negative and 1 p + 1 q = 1.Eq. ( 32) can then be restated as the following.

Figure 1 :
Figure 1: Examples of unigram distributions with efficient and inefficient channel usage.

Figure 3 :
Figure 3: Correlation of Rényi efficieny (H α /H 0 ) with BLEU on train data in the experiment 1. Maximums of Pearson and Spearman correlations are marked with ⋆.

Figure 4 :
Figure 4: Results for grid search over the best hyperparameters for percentile frequency predictor to maximize the absolute Pearson correlation.The highest is for 3 rd to 83 th percentile with ρ = 0.81.

Figure 5 :
Figure 5: Mean change in log-likelihood on held-out data under linear models using different predictors.Bars indicate 95% confidence interval around the mean.

Table 2 :
Variance explained between predictors and MT performance (BLEU, CHRF, BLEURT and COMET) in Experiment 1 (only 3 MT seeds, 5 temperatures and 4 vocabulary sizes) with different language directions.Let p ∆ * be a distribution over ∆ * , and let p ∆ be the unigram distribution induced by p ∆ * (Eq.( The efficiency of t is upper-bounded by⌈H(W ∆ )⌉ +where W ∆ is a ∆-valued random variable with law p ∆ , the unigram distribution induced by t.Proof.Recall that eff(p Σ * , t) =