A Statistical Extension of Byte-Pair Encoding

Sub-word segmentation is currently a standard tool for training neural machine translation (MT) systems and other NLP tasks. The goal is to split words (both in the source and target languages) into smaller units which then constitute the input and output vocabularies of the MT system. The aim of reducing the size of the input and output vocabularies is to increase the generalization capabilities of the translation model, enabling the system to translate and generate infrequent and new (unseen) words at inference time by combining previously seen sub-word units. Ideally, we would expect the created units to have some linguistic meaning, so that words are created in a compositional way. However, the most popular word-splitting method, Byte-Pair Encoding (BPE), which originates from the data compression literature, does not include explicit criteria to favor linguistic splittings nor to find the optimal sub-word granularity for the given training data. In this paper, we propose a statistically motivated extension of the BPE algorithm and an effective convergence criterion that avoids the costly experimentation cycle needed to select the best sub-word vocabulary size. Experimental results with morphologically rich languages show that our model achieves nearly-optimal BLEU scores and produces morphologically better word segmentations, which allows to outperform BPE’s generalization in the translation of sentences containing new words, as shown via human evaluation.


Introduction
Sub-word segmentation is currently a standard tool for machine translation systems (see e.g. the systems submitted to WMT and IWSLT evaluations (Barrault et al., 2019;Niehues et al., 2019), as well as systems for a wide variety of NLP tasks (see e.g. Devlin et al. (2018) and derived works). The goal * Now at Google. is to split words (both in the source and target language) into smaller units which then constitute the input and output of the machine translation system. The goal is twofold: On the one hand, sub-word splitting reduces the size of the input and output vocabularies. This is specially important when using neural models, as the size of the input layer is fixed and thus the vocabulary size cannot be dynamically adjusted. On the other hand, it tries to increase the generalization capabilities of the translation model, enabling the system to accept and/or generate new words at translation time by combining previously seen units. The most widespread method used for sub-word splitting in neural machine translation is Byte Pair Encoding (BPE), introduced by Sennrich et al. (2016). Since then, BPE has become a default preprocessing step for many NLP tasks.
The BPE extraction algorithm is an adaptation of the algorithm introduced by Gage (1994) for data compression. The main idea of this algorithm is to replace the most frequent pair of bytes found in the input data with a new, unseen byte. The process is repeated until no more byte pairs are repeated or until no free bytes are available. Sennrich et al. (2016) took this algorithm as a starting point, considering characters instead of bytes, and joining them using the same criterion to produce sub-word units (more details can be found in Section 3).
One potential problem with this approach is that the objective of the original BPE algorithm differs from the goals for which it is being used for translation, as detailed above. While it is certainly effective for the first objective (reducing the vocabulary size), it is arguable whether it is appropriate for the goal of generating new words (Ataman et al., 2017;Huck et al., 2017;Banerjee and Bhattacharyya, 2018).
Intuitively, in order to generate new words, we would expect the sub-word units to have some linguistic meaning, so that a new word can be created beklagen ↓ bek@@ lagen bewertungsinstrumente ↓ bewer@@ t@@ ungsin@@ stru@@ mente Table 1: Examples of unsatisfactory BPE splitting of German words. The two words are segmented by breaking the underlying morphological structure. in a compositional way. Being purely frequency driven, BPE does not take this intuition into consideration, as illustrated in the two German word examples in Table 1 taken from the WMT'19 training data. For the first word, the split "be@@ klagen" would be more satisfactory as the word is derived from "klagen" (complain); the second word is a compound word, with the splits "bewertungs@@ instrumente" (assessment instruments), separating the two words, and "bewert@@ ung@@ s@@ instru-ment@@ e" being morphologically more informed alternatives.
The BPE algorithm also introduces an additional practical problem. The original formulation does not specify a criterion for stopping the creation of new symbols. If the algorithm runs for an unlimited time, it will merge all sub-words into the original input vocabulary, which is clearly undesired. In practice, one specifies a fixed number of merges to be carried out, or a threshold frequency and when the considered symbols fall below this value the algorithm is stopped. It is however not clear how to set these hyperparameters, although they can have a drastic effect on translation quality depending on the translation direction, task and amount of data (Denkowski and Neubig, 2017;Sennrich and Zhang, 2019). Furthermore, these hyperparameters are rarely optimized, as evaluating them constitutes a full training-evaluation cycle, which is notoriously costly.
In this paper we introduce a new criterion for defining sub-word units that tries to address these shortcomings. We introduce a probability distribution over the units which in turn induces a likelihood function over the corpus which we can optimize. We will show how this statistical approach can guide the extraction process towards more linguistically satisfying units, while still remaining a purely data driven approach. Having a well founded optimization criterion also allows us to define a data driven stopping criterion. Our proposed criterion allows to select a nearly optimal number of units using only an intrinsic measure on the training corpus, thus dramatically reducing experimentation costs.

Related work
As stated in the introduction, our starting point is the BPE algorithm introduced in (Sennrich et al., 2016). In this work, the authors adapt the data compression algorithm by Gage (1994) to the task of sub-word unit generation.
Some authors have tried to expand the extraction of sub-word units by leveraging linguistic information. Sánchez-Cartagena and Toral (2016) use morphological segmentation for Finnish and compare the effectiveness of these sub-word units for the WMT evaluation. The system using this segmentation approach together with other extensions performed best in human evaluation. Huck et al. (2017) follow a similar approach with the addition of compound splitting for translation into German, achieving improvements of around 0.5 BLEU points on WMT data. Ataman and Federico (2018) propose to replace BPE with unsupervised morphological segmentation which also takes morphological coherence into consideration during prediction of the sub-words. Experiments run under smalldata conditions on TED Talks in five directions, all from/to English, show systematic improvements on Arabic, Turkish, Czech, but not on Italian and German. Banerjee and Bhattacharyya (2018) also use unsupervised morphological units generated by Morfessor (Virpioja et al., 2013) as input for a neural machine translation system and report improvements for low-resource conditions. Macháček et al. (2018) follow a similar approach for translation into Czech on WMT data, but were not able to obtains improvements over the standard BPE approach.
An alternative model to BPE which is also widely used was presented by Kudo (2018), which can be considered as an extension of (Schuster and Nakajima, 2012). They show that using a purely statistical approach, they are able to achieve subword units that are better linguistically motivated. Similar to our approach, a probability distribution over the sub-word units is defined with the goal of improving the likelihood over the training data. The strategy for defining the sub-word units differ in his approach and ours. While we start with single characters and expand the units, Kudo (2018) starts with a large set of sub-word units and prunes iteratively until reducing the number to a desired quantity. Segmentation probabilities are modeled with a multinomial distribution trained via expectation maximization.
In order to improve generalization of the segmentation model (i.e. performance on new words), different regularization approaches have been proposed. Kudo (2018) applies different segmentations at training time. For each parameter update, segmentations for each word are sampled from a smoothed posterior distribution computed from the multinomial distribution. Along the same line, Provilkov et al. (2019), proposed to generate alternative segmentations directly with BPE, by randomly dropping out merging rules. These approaches, as noted by Kudo (2018), can be seen as variants of the ensemble training principle, where many different models are trained (and finally combined) on different subsets of the training data. Our work differs with respect to (Kudo, 2018) in that we train an observable model in a stepwise fashion, like BPE, by maximizing the likelihood of the training data. Thus, we expect our approach to be more efficient than Kudo (2018). Differently from Kudo (2018) and Provilkov et al. (2019), we do not apply regularization, however nothing prevents from applying the drop out method also to our merging rules, although we expect that our model has already learned more general segmentation rules than BPE.
To the best of our knowledge, there has been little previous work on automatically determining the number of sub-word units to produce by segmentation algorithms. Kreutzer and Sokolov (2018) integrate segmentation into the NMT system and find that the system favors character-based translation over sub-word segmentation. Henderson (2020) pointed out that determining vocabulary sizes for NLP tasks is one of the few aspects that is still done manually, and suggests it as one possible direction for future improvement of NLP models.

The Byte Pair Encoding (BPE) algorithm
The BPE training algorithm as presented in (Sennrich et al., 2016) is shown in Algorithm 1. It closely follows the original BPE for data compression algorithm by Gage (1994). The algorithm re-ceives as input a text as a sequence of words, which in turn are represented as sequences of characters. The single characters constitute the initial set of symbols. At each iteration the pair of symbols (occurring inside words) with highest frequency is selected and substituted with a new symbol. This substitution is recorded as a new rule. This merging operation is repeated for a fixed number of steps. The algorithm returns the sorted list of merging rules.
Algorithm 1: BPE training algorithm. Input: training corpus S of words split into character sequences; number N of rules

Algorithm 2: BPE inference algorithm
Input: list R of merge rules; word w split into characters Output: segmented word 1 foreach rule ∈ R do 2 if matches(rule, w) then 3 w := apply(rule, w) 4 continue 5 return w Algorithm 2 shows how to apply the set of rules extracted by Algorithm 1 to a new text. It basically looks up the ordered list of rules and applies as many of them as possible.

The statistical BPE (S-BPE) algorithm
We can generalize the criterion for BPE unit selection by adjusting line 3 of Algorithm 1. Specifically, we define a probability distribution over the BPE units and define a maximum likelihood optimization criterion.
Let S be a corpus of words w from a vocabulary V , and let each word be decomposed as a sequence of symbols (initially characters) s from an alphabet Σ. The log-likelihood of S can be written as: where C S,Σ (s) is the count of symbol s in corpus S, in which words are segments according to Σ, i.e.: Algorithm 1 initializes Σ with single characters (Σ 0 ). Then, at each step n of training, it selects the pair of symbols with the highest frequency or, equivalently, joint probability: thus defining the new alphabet 1 where the probability distribution p n−1 is defined over the elements of the alphabet Σ n−1 . From a statistical modeling perspective, however, we would be more interested in rules for which the training data likelihood increases, i.e.: It can be shown (see the Appendix for a derivation) that for any pair of symbols x, y ∈ Σ n−1 , the following inequality holds, which provides a lower bound for the increase in likelihood: where as usual Σ n includes xy as given in Equation 4. Intuitively we can interpret the rightmost term as the likelihood of each word that contains the bigram xy being increased by merging the two symbols 2 . It also provides a good tie-in to our linguistic intuition about sub-word units: if two units appear only in combination with each other, they probably do not have linguistic meaning on their own. Thus the probability mass will shift to the probability of the joint symbol, and the probability of the single elements will be greatly reduced. On the other hand, if x or y do have linguistic meaning, e.g. verb suffixes, they are likely to have a high probability of appearing in the text, and thus the gain from joining them together is not as big.
The above inequality thus suggests the new update rule: (7) Note an important difference between Equations (3) and (7): In (3) we use a bigram probability p n−1 (x, y) computed on Σ n−1 ×Σ n−1 , while in (7) we use a unigram probability p n (xy) computed on Σ n . The two probabilities are expected to be close but not the same.
Note that in practice, in the course of the algorithm the count for a unit may drop to 0 (due to all the occurrences being combined with another unit to form a new pair), thus producing a probability of 0. In order to avoid computation of log 0 in Equation (7) we use Laplace smoothing for the computation of all probabilities.

Stopping criterion
One open question when defining BPE units is how many operations to carry out. As shown in Algorithm 1, this number is a parameter of the extraction algorithm, and there is no defined way to select it. The number of units has an important effect on the quality of the translation system (see Section 5), but selecting the optimal number involves training and testing a translation system for each candidate, at a high computational cost. Thus, normally system builders resort to previous experience and select a number of units that has worked well on previous tasks, although the performance can be very task dependent.
With the statistical formulation of BPE, for each operation we can compute a corresponding (approximate) increase in likelihood on the training corpus through Equation 6. Looking at the evolution of the likelihood, we can define a criterion of when to stop defining new units. Specifically, let us define δ i as the (approximate) increase in likelihood when defining the i-th BPE unit. We will stop the algorithm, and thus define the number of units N , when δ N ≤ kδ 1 , with k < 1. In order to improve the robustness of the criterion, in practice it is better to average each δ i with the previous M values.  Of course, one could argue that we just substituted one parameter of the algorithm with another, which also has to be selected externally. However, as we will show in Section 5, the same value obtains nearly optimal results for most language arcs.
Another possibility that could be considered for defining the number of operations is to measure the evolution of the likelihood on an external development corpus, and stop the iterations when the likelihood decreases. We implemented this approach, but found that the likelihood on the development corpus increases monotonically for each new unit extracted (up to the maximum number we allowed for the experiments), and thus it does not provide a useful stopping criterion for the algorithm.

Experimental results
We conducted experiments for machine translation in a variety of languages, focusing on morphologically rich ones, using the data available from the latest WMT evaluation campaign where the language pair was used. We include results for Finish (Fi), German (De) Table 2. It can be seen that we experiment with a wide variety of corpus sizes, varying between 200K sentences up to nearly 6 million.
For BPE training, the corpora were subsampled to 1M sentences for BPE training 3 , and a common BPE model was trained for the source and target languages (which also share the same em-bedding matrix). Experiments were carried out using Sockeye (Hieber et al., 2017) using mostly the default settings, except for a transformer architecture consisting of 20 encoder layers and 2 decoder layers (Hieber et al., 2020). The corpora were tokenized using the Moses tokenizer.

Analysis of BPE segmentation
We will start by focusing on the analysis of the produced sub-word units. Table 3 shows some differences between the statistical approach and the standard approach on words found in the German training data. The first example clearly shows how BPE does not use any linguistic information, even splitting the pair of characters 'ue', which is an alternative form of the letter 'ü'. In contrast, S-BPE produces a much more morphologically motivated split by separating the 's' at the end, which denotes genitive case. In the next two examples, S-BPE splits the words as derived forms of other words ('stehenden' and 'laeufige', respectively). In the last two examples, S-BPE correctly splits compound words into individual components. For none of these cases the standard BPE finds a linguistically satisfying sub-word decomposition. However note that although S-BPE improves over BPE, a more refined morphological splitting would still be possible for the last two examples.
Revisiting the examples of Table 1, we see that "beklagen" is now split into "be@@ kla@@ gen", and "bewertungsinstrumente" into "bewer@@ tungs@@ instrumente", which do not exactly correspond to the splitting points suggested in Section 1, but are more satisfactory than the BPE segmentation.
In order to quantify these improvements we use the data provided by the Morpho Challenge 2010 shared task (Kurimo et al., 2010). As part of the data of this evaluation, a morphological segmentation of words was provided for English, Finnish and Turkish. We applied the BPE and S-BPE models to the development dataset, and computed the F1-score of the produced segmentations, using the morphological segmentation as reference. For BPE segmentation, we selected the optimal segmentation as measured by the BLEU score on the translations of the WMT test data (see also Section 5.3). The results 4 are shown in Table 4. As English is a common language for all investigated language arcs, we provide results for the different language   pairs. It can be seem that S-BPE produces more linguistically motivated splits of English words in five out of six cases. For Finnish, S-BPE also produces better linguistic units, while for Turkish the F1 score is nearly identical. In light of these results we can affirm that in most cases S-BPE produces more linguistically motivated units than standard BPE.

Human evaluation
In the previous section we showed how S-BPE produces more linguistically motivated units. Of course, the main question is if these units help the system produce better translations. We hypothesize that S-BPE affects mainly single words, specially unknown words or words rarely seen in training (e.g. morphological variations of known words), and this effect is hardly captured by BLEU. Therefore we focus on human evaluation first and will present results with BLEU in the next section. We carried out a human evaluation on English-German and English-Turkish (both directions) with a subset of test sentences where at least one unknown word was found. BLEU did not show significant differences between BPE and S-BPE on this subset of sentences. A blind test was carried out with 7 members of our department, all native speakers of Turkish (1) or German (6) and experts in NLP.
The evaluators were shown a source sentence, together with a highlighted word, and the output of the BPE and S-BPE systems. They had to answer two questions: which system produced a better translation of the highlighted word? And, which system produced a better translation of the sentence overall? Table 5 shows examples of the German-to-English test sentences highlighting the translations of the unknown German word inside the translations of the sentence, as produced with BPE and S-BPE. (For completeness we also show the segmentation of the unknown German word.) The results of the human evaluation are shown on Table 6. It can be seen that when BPE and S-BPE produce different translations for the words being evaluated, in the majority of cases human graders prefer the translations produced with S-BPE. In particular, for language arcs involving German, the percentage of sentences for which translations based on S-BPE are preferred over translations based on BPE is 41.3% vs. 23.3% and 41.5% vs. 29.3%. These results are statistical significant (using a paired proportion test, with p < 0.01). It is known that German has a high lexical prolificity, with a high number of morphological variations as well as compound words. In fact, out of 2 000 sentences of the De→En test set 736 (36.8%) contain unknown words. These results confirm the superior generalization of S-BPE over BPE, both at the word and sentence levels.
For Turkish we also observe a preference for the S-BPE translations of unknown words, as well as a general preference for S-BPE sentences for English to Turkish translation, with no clear winner for the reverse direction. The statistical significance of these results is lower than for German, clearly due to the smaller amount of evaluated sentences.

Translation results
In this section we present global translation results, evaluated using BLEU scores. Table 7 compares Segmentation Sentence

Reference
After conversion to the new emissions and consumption standard WLTP, there were production losses at Audi, Schot told the 'Heilbronner Stimme'. BPE verbrau@@ ch@@ spru@@ ef@@ standard Due to the changeover to the new exhaust and exhaust test standard WLTP there were production downs at Audi, said the "Heilbronner Stimme". S-BPE verbrauch@@ spruef@@ standard Due to the changeover to the new WLTP exhaust and consumption testing standard, production was lost at Audi, Schot said "Heilbronner Voice".

Reference
There is no turning lane on Haaße Hügel. BPE ab@@ bi@@ e@@ ges@@ pur There is no bending on the Haasse Hill.

S-BPE ab@@ bie@@ ge@@ spur
There is no turning lane on the Haasse hill.

Reference
In the household goods department on the upper floor, the winged man tips down a cocktail made with espresso called "Golden Eye", which is suited to the festival award. BPE haushalt@@ war@@ enab@@ teilung In the household section on the upper floor, the poultry tick a cocktail prepared with espresso called "Golden Eye", in line with the festival award.

S-BPE haushalt@@ waren@@ abteilung
In the household goods department on the upper floor the poultry tilts a cocktail prepared with espresso called "Golden Eye", matching the festival award.

Reference
The 46 year old driver of the ambulance ran a red light on Saturday afternoon with the blue lights flashing and siren sounding.    ping parameter set to k = 0.002 and averaging over the last 5 iterations. These values were obtained empirically by doing a grid search over a small set of values and languages. It can be seen that the results obtained for most translation directions are in the range of the optimal result obtained by BPE, with many results not being statistical significantly different, as computed with the bootstrap method (Koehn, 2004), with 99% confidence interval. One can also consider that there is additional variability due to random initialization of the NMT optimization algorithm, in our experience in the range of ±0.4 BLEU. We also marked the systems within this range in the table. 5 It is also worth noting that for the language arcs where the stopping criterion is outperformed by the optimized baseline BPE extraction, the difference in performance is smaller than the difference due to choosing an incorrect number of operations on the standard BPE approach.
In conclusion, we do not see a clear difference in BLEU scores with S-BPE with respect to the standard BPE approach, using the optimal number of operations. However, as Sections 5.1 and 5.2 show, we obtain focused improvements on single words, which improves the translation quality as 5 We did not do an extensive search for random initializations for this investigations due to the high number of experiments involved. perceived by human judges.

Conclusions and future work
We introduced a statistical extension of BPE extraction. It introduces a well-founded objective for unit selection, which also allows the definition of a statistically motivated stopping criterion. Using this approach we achieve nearly optimal machine translation performance as measured with BLEU, while at the same time producing more linguistically motivated units. This leads to better translations of single words, which increases the translation quality as perceived by human judges, especially in the case of sentences containing unseen words. Using the stopping criterion we approximate the optimal selection of number of units, without the need to perform the costly optimization required by BPE, involving a full training-evaluation cycle for each tested number of operations.
Regarding future work, we observe that the probability distributions defined for our approach are closely related to those used for n-gram language models. Thus, smoothing methods can be applied, which can enhance the robustness of the method for unseen events, which opens a wide variety of possible extensions of this work.

Appendix A Full derivation of likelihood increase
Lemma Given a, b, c such that a > 0, b > a and 0 < c < b we have: (1) Proof. By assumption denominators are positive, hence we can rearrange (1) as: b(a − c) < a(b − c). By assumption, a(b−c) < ab from which we get b(a−c) < ab and (a−c) < a which is true by assumption.
Define the count of a sub-word unit s ∈ Σ for a corpus S and a sub-word vocabulary Σ as C S,Σ (s). The likelihood function is then defined as L(S, Σ) = s∈Σ C S,Σ (s) log p(s) We are interested in the increase in likelihood at step n ∆L n (S) = L(S, Σ n ) − L(S, Σ n−1 ) .
When adding a new rule (x, y) → xy in step n of the algorithm, thus defining Σ n , we can express the likelihood increase as 1 ∆L n (S) = s∈Σ n−1 \{x,y} C S,Σn (s) log p n (s) − C S,Σ n−1 (s) log p n−1 (s) + s∈{x,y} C S,Σn (s) log p n (s) − C S,Σ n−1 (s) log p n−1 (s) + C S,Σn (xy) log p n (xy) We note that for s ∈ Σ n−1 \ {x, y} C S,Σn (s) = C S,Σ n−1 (s) and p n (s) > p n−1 (s) as the total number of observations (denominator of p n ) shrinks after combining two symbols. Thus, for the first term in equation 4 we have s∈Σ n−1 \{x,y} C S,Σn (s) log p n (s) − C S,Σ n−1 (s) log p n−1 (s) > 0 .
This quantity is expected to be small, specially when the number of produced symbols increases. Next, let us note that for the counts of the units involved in the new rule, we have C S,Σn (x) = C S,Σ n−1 (x) − C Σn (xy) C S,Σn (y) = C S,Σ n−1 (y) − C Σn (xy) (the equation holds for both x and y because the C Σn (xy) is added to the total amount of units).
For the probability of x and y we are reducing the occurrences and the total number of events by the same positive amount, which is lower that the sample size. Hence, by subtracting the same counts from the sample size and from the previous Lemma we can derive: p n (x) = C S,Σn (x) C S,Σn (·) = C S,Σ n−1 (x) − C S,Σn (xy) C S,Σ n−1 (·) − C S,Σn (xy) < C S,Σ n−1 (x) C S,Σ n−1 (·) = p n−1 (x) 1 As some counts may decrease to 0 when defining a new pair, we use the convention 0 log 0 = 0. and similarly for y.