Vocabulary Learning via Optimal Transport for Neural Machine Translation

The choice of token vocabulary affects the performance of machine translation. This paper aims to figure out what is a good vocabulary and whether we can find the optimal vocabulary without trial training. To answer these questions, we first provide an alternative understanding of vocabulary from the perspective of information theory. It motivates us to formulate the quest of vocabularization – finding the best token dictionary with a proper size – as an optimal transport (OT) problem. We propose VOLT, a simple and efficient solution without trial training. Empirical results show that VOLT beats widely-used vocabularies in diverse scenarios, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. For example, VOLT achieves 70% vocabulary size reduction and 0.5 BLEU gain on English-German translation. Also, compared to BPE-search, VOLT reduces the search time from 384 GPU hours to 30 GPU hours on English-German translation. Codes are available at https://github.com/Jingjing-NLP/VOLT.

The key idea of these approaches is selecting the most frequent sub-words (or word pieces with higher probabilities) as the vocabulary tokens. In information theory, these frequency-based approaches are simple forms of data compression to reduce entropy (Gage, 1994), which makes the resulting corpus easy to learn and predict (Martin and England, 2011;Bentz and Alikaniotis, 2016).
However, the effects of vocabulary size are not sufficiently taken into account since current approaches only consider frequency (or entropy) as the main criteria. Many previous studies (Sennrich and Zhang, 2019;Ding et al., 2019;Provilkov et al., 2020;Salesky et al., 2020) show that vocabulary size also affects downstream performances, especially on low-resource tasks. Due to the lack of appropriate inductive bias about size, trial training (namely traversing all possible sizes) is usually required to search for the optimal size, which takes high computation costs. For convenience, most existing studies only adopt the widely-used settings in implementation. For example, 30K-40K is the most popular size setting in all 42 papers of Conference of Machine Translation (WMT) through 2017 and 2018 (Ding et al., 2019).
In this paper, we propose to explore automatic vocabularization by simultaneously considering entropy and vocabulary size without expensive trial training. Designing such a vocabularization approach is non-trivial for two main reasons. First, it is challenging to find an appropriate objective function to optimize them at the same time. Roughly speaking, the corpus entropy decreases with the increase of vocabulary size, which benefits model learning (Martin and England, 2011). On the other side, too many tokens cause token sparsity, which hurts model learning (Allison et al., 2006). Second, supposing that an appropriate measurement is given, it is still challenging to  solve such a discrete optimization problem due to the exponential search space. To address the above problems, we propose a VOcabulary Learning approach via optimal Transport, VOLT for short. It can give an appropriate vocabulary in polynomial time by considering corpus entropy and vocabulary size. Specifically, given the above insight of contradiction between entropy and size, we first borrow the concept of Marginal Utility in economics (Samuelson, 1937) and propose to use Marginal Utility of Vocabularization (MUV) as the measurement. The insight is quite simple: in economics, marginal utility is used to balance the benefit and the cost and we use MUV to balance the entropy (benefit) and vocabulary size (cost). Higher MUV is expected for Pareto optimality. Formally, MUV is defined as the negative derivative of entropy to vocabulary size. Figure 1 gives an example about marginal utility. Preliminary results verify that MUV correlates with the downstream performances on two-thirds of tasks (See Figure 2). Then our goal turns to maximize MUV in tractable time complexity. We reformulate our discrete optimization objective into an optimal transport problem (Cuturi, 2013) that can be solved in polynomial time by linear programming. Intuitively, the vocabularization process can be regarded as finding the optimal transport matrix from the character distribution to the vocabulary token distribution. Finally, our proposed VOLT will yield a vocabulary from the optimal transport matrix.
We evaluate our approach on multiple machine translation tasks, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. Empirical results show that VOLT beats widely-used vocabularies in diverse scenarios. Furthermore, VOLT is a lightweight solution and does not require expensive computation resources. On English-German translation, VOLT only takes 30 GPU hours to find vocabularies, while the traditional BPE-Search solution takes 384 GPU hours.

Related Work
Initially, most neural models were built upon word-level vocabularies (Costa-jussà and Fonollosa, 2016;Vaswani et al., 2017;Zhao et al., 2019). While achieving promising results, it is a common constraint that word-level vocabularies fail on handling rare words under limited vocabulary sizes.
Researchers recently have proposed several advanced vocabularization approaches, like bytelevel approaches (Wang et al., 2020), characterlevel approaches (Costa-jussà and Fonollosa, 2016;Lee et al., 2017;Al-Rfou et al., 2019), and sub-word approaches (Sennrich et al., 2016;Kudo and Richardson, 2018). Byte-Pair Encoding (BPE) (Sennrich et al., 2016) is proposed to get subword-level vocabularies. The general idea is to merge pairs of frequent character sequences to create sub-word units. Sub-word vocabularies can be regarded as a trade-off between character-level vocabularies and word-level vocabularies. Compared to word-level vocabularies, it can decrease the sparsity of tokens and increase the shared features between similar words, which probably have similar semantic meanings, like "happy" and "happier". Compared to character-level vocabularies, it has shorter sentence lengths without rare words. Following BPE, some variants recently have been proposed, like BPE-dropout (Provilkov et al., 2020), SentencePiece (Kudo and Richardson, 2018), and so on.
Despite promising results, most existing subword approaches only consider frequency while the effects of vocabulary size is neglected. Thus, trial training is required to find the optimal size, which brings high computation costs. More recently, some studies notice this problem and propose some practical solutions (Kreutzer and Sokolov, 2018;Cherry et al., 2018;Salesky et al., 2020).

Marginal Utility of Vocabularization
In this section, we propose to find a good vocabulary measurement by considering entropy and size. As introduced in Section 1, it is non-trivial to find an appropriate objective function to optimize them simultaneously. On one side, with the increase of vocabulary size, the corpus entropy is decreased, which benefits model learning (Bentz and Alikaniotis, 2016). On the other side, a large vocabulary causes parameter explosion and token sparsity problems, which hurts model learning (Allison et al., 2006). To address this problem, we borrow the concept of Marginal Utility in economics (Samuelson, 1937) and propose to use Marginal Utility of Vocabularization (MUV) as the optimization objective. MUV evaluates the benefits (entropy) a corpus can get from an increase of cost (size). Higher MUV is expected for higher benefit-cost ratio. Preliminary results verify that MUV correlates with downstream performances on two-thirds of translation tasks (See Figure 2). According to this feature, our goal turns to maximize MUV in tractable time complexity.
Definition of MUV Formally, MUV represents the negative derivation of entropy to size. For simplification, we leverage a smaller vocabulary to estimate MUV in implementation. Specially, MUV is calculated as: where v(k), v(k + m) are two vocabularies with k and k + m tokens, respectively. H v represents the corpus entropy with the vocabulary v, which is defined by the sum of token entropy. To avoid the effects of token length, here we normalize entropy with the average length of tokens and the final entropy is defined as: where P (i) is the relative frequency of token i from the training corpus and l v is the average length of tokens in vocabulary v.

Preliminary Results
To verify the effectiveness of MUV as the vocabulary measurement, we conduct experiments on 45 language pairs from TED and calculate the Spearman correlation score * between MUV and BLEU scores. We adopt the same and widely-used settings to avoid the effects of other attributes on BLEU scores, such as model hyper-parameters and training hyper-parameters.
We generate a sequence of vocabularies with incremental sizes via BPE. All experiments use the same hyper-parameters. Two-thirds of pairs show positive correlations as shown in Figure 2. The middle Spearman score is 0.4. We believe that it is a good signal to show MUV matters. Please refer to Section 5 for more dataset details and Appendix A for more implementation details. Given MUV, we have two natural choices to get the final vocabulary: search and learning. In the search-based direction, we can combine MUV with widely-used vocabularization solutions. For example, the optimal vocabularies can be obtained by enumerating all candidate vocabularies generated by BPE. While being simple and effective, it is not a self-sufficient approach. Furthermore, it still requires a lot of time to generate vocabularies and calculate MUV. To address these problems, we further explore a learning-based solution VOLT for more vocabulary possibilities. We empirically compare MUV-Search and VOLT in Section 5.

Maximizing MUV via Optimal Transport
This section describes the details of the proposed approach. We first show the general idea of VOLT in Section 4.1, then describe the optimal transport solution in Section 4.2, followed by the implementation details in Section 4.3.

Overview
We formulate vocabulary construction as a discrete optimization problem whose target is to find the vocabulary with the highest MUV according to Eq. 1. However, the vocabulary is discrete and such discrete search space is too large to traverse, which makes the discrete optimization intractable.
In this paper, we simplify the original discrete optimization problem by searching for the optimal vocabulary from vocabularies with fixed sizes. Intuitively, MUV is the first derivative of entropy according to the vocabulary size (Eq. 1), and we introduce an auxiliary variable S (S is an incremental integer sequence) to approximate the computation by only computing MUV between vocabulary sizes as adjacent integers in S.
Formally, S = {i, 2 · i, ..., (t − 1) · i, · · · } where each timestep t represents a set of vocabularies with the number up to S[t]. For any vocabulary, its MUV score can be calculated based on a vocabulary from its previous timestep. With sequence S, the target to find the optimal vocabulary v(t) with the highest MUV can be formulated as: are two sets containing all vocabularies with upper bound of size S[t − 1] and S[t]. Due to exponential search space, we propose to optimize its lower bound: where i means the size difference between t − 1 vocabulary and t vocabulary. MUV requires the size difference as a denominator. Based on this equation, the whole solution is split into two steps: 1) searching for the optimal vocabulary with the highest entropy at each timestep t; 2) enumerating all timesteps and outputing the vocabulary corresponding to the time step satisfying Eq. 3. The first step of our approach is to search for the vocabulary with the highest entropy from V S [t] . Formally, the goal is to find a vocabulary v(t) such that entropy is maximized, Token Vocab Corpus Figure 3: An illustration of vocabulary construction from a transport view. Each transport matrix represents a vocabulary. The transport matrix decides how many chars are transported to token candidates. The tokens with zero chars will not be added into the vocabulary.
where l v is the average length for tokens in v(t), P (i) is the probability of token i. However, notice that this problem is in general intractable due to the extensive vocabulary size. Therefore, we instead propose a relaxation in the formulation of discrete optimal transport, which can then be solved efficiently via the Sinkhorn algorithm (Cuturi, 2013).
Intuitively, we can imagine vocabulary construction as a transport process that transports chars into token candidates with the number up to S[t]. As shown in Figure 3, the number of chars is fixed, and not all token candidates can get enough chars. Each transport matrix can build a vocabulary by collecting tokens with chars. Different transport matrices bring different transport costs. The target of optimal transport is to find a transport matrix to minimize the transfer cost, i.e., negative entropy in our setting.

Vocabularization via Optimal Transport
Given a set of vocabularies V S[t] , we want to find the vocabulary with the highest entropy. Consequently, the objective function in Eq. 4 becomes Token(i) is the frequency of token i in the vocabulary v. len(i) represents the length of token i. Notice that both the distribution P (i) and the average length l v depend on the choice of v.
Objective Approximation To obtain a tractable lower bound of entropy, it suffices to give a tractable upper bound of the above objective function. We adopt the merging rules to segment raw text similar with BPE where two consecutive tokens will be merged into one if the merged one is in the vocabulary. To this end, let T ∈ V S[t] be the vocabulary containing top S[t] most frequent tokens, C be the set of chars and |T|, |C| be their sizes respectively. Since T is an element of V S[t] , clearly, we have Here we start from the upper bound of the above objective function, that is 1 l T i∈T P (i) log P (i) and then search for a refined token set from T. In this way, we reduce the search space into the subsets of T. Let P (i, j) be the joint probability distribution of the tokens and chars that we want to learn. Then we have The details of proof can be found at Appendix C. Since L 1 is nothing but the negative entropy of the joint probability distribution P (i, j), we shall denote it as −H(P ).
Let D be the |C| × |T| matrix whose (i, j)-th entry is given by − log P (j|i), and let P be the joint probability matrix, then we can write In this way, Eq. 6 can be reformulated as the following objective function which has the same form as the objective function in optimal transport: Setup of OT From the perspective of optimal transport, P can be regarded as the transport matrix, and D can be regarded as the distance matrix. Intuitively, optimal transport is about finding the best transporting mass from the char distribution to the target token distribution with the minimum work defined by P , D .
To verify the validness of transport solutions, we add the following constraints. First, to avoid invalid transport between char j and token i, we set the distance to +∞ if the target token i does not contain the char j. Otherwise, we use 1 len(i) to estimate P (j|i) where len(i) is the length of token i. Formally, the distance matrix is defined as Furthermore, the number of chars is fixed and we set the sum of each row in the transport matrix to the probability of char j. The upper bound of the char requirements for each token is fixed and we set the sum of each column in the transport matrix to the probablity of token j. Formally, the constraints are defined as: and i P (i, j) = P (j).
Given transport matrix P and distance matrix D, the final objective can be formulated as: with small > 0. Figure 4 shows the details of optimal transport solution. Strictly speaking, this is an unbalanced entropy regularized optimal transport problem. Nonetheless, we can still use the generalized Sinkhorn algorithm to efficiently find the target vocabulary as detailed in Section 4.6 of Peyré and Cuturi (2019). The algorithm details are shown in Algorithm 1. At each timestep t, we can generate a new vocabulary associated with entropy scores based on the transport matrix P . Finally, we collect these vocabularies associated with entropy scores, and output the vocabulary satisfying Eq. 3.

Implementation
Algorithm 1 lists the process of VOLT. First, we rank all token candidates according to their frequencies. For simplification, we adopt BPEgenerated tokens (e.g. BPE-100K) as the token candidates. It is important to note that any segmentation algorithms can be used to initialize token candidates. Experiments show that different initialization approaches result in similar results. We simply adopt BPE-100K for bilingual translation and BPE-300K for multilingual translation in this work. All token candidates with their probabilities are then used to initialize L in Algorithm 1. Figure 4: The details of optimal transport. The objective function is the sum of negative entropy and transport cost. Each element D(i, j) in the distance matrix is the negative log of 1/n where n is the length of token i. It defines the distance between char j and token i. To avoid invalid transport between char j and token i, we set the distance to infinite if the target token i does not contain the char j.
The size of the incremental integer sequence S is a hyper-parameter and set to (1K, ..., 10K) for bilingual translation, (40K, ..., 160K) for multilingual settings. At each timestep, we can get the vocabulary with the maximum entropy based on the transport matrix. It is inevitable to handle illegal transport case due to relaxed constraints. We remove tokens with distributed chars less than 0.001 token frequencies. Finally, we enumerate all timesteps and select the vocabulary satisfying Eq. 3 as the final vocabulary.
After generating the vocabulary, VOLT uses a greedy strategy to encode text similar to BPE. To encode text, it first splits sentences into characterlevel tokens. Then, we merge two consecutive tokens into one token if the merged one is in the vocabulary. This process keeps running until no tokens can be merged. Out-of-vocabulary tokens will be split into smaller tokens.

Experiments
To evaluate the performance of VOLT, we conduct experiments on three datasets, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation.

Settings
We run experiments on the following machine translation datasets. See Appendix B for more model and training details.
2. TED bilingual dataset: We include two settings: X-to-English translation and Englishto-X translation. We choose 12 languagepairs with the most training data. We use the language code according to ISO-639-1 standard † . TED data is provided by Qi et al. (2018).
3. TED multilingual dataset: We conduct experiments with 52 language pairs on a many-to-English setting. The network is trained on all language pairs. We adopt the same preprocessing pipeline in the WMT-14 En-De dataset.

Main Results
Vocabularies   BPE affects the model performance in low-resource settings. They conduct experiments on four language pairs and find that smaller vocabularies are more suitable for lowresource datasets. For Transformer architectures, the optimal vocabulary size is less than 4K, around up to 2K merge actions. We compare VOLT and BPE-1K on an X-to-English bilingual setting. The results are shown in Table 2. We can see that VOLT can find a good vocabulary on par with heuristically searched vocabularies in terms of BLEU scores. Note that BPE-1K is selected based on plenty of experiments. In contrast, VOLT only requires one trials for evaluation and only takes 0.5 CPU hours plus 30 GPU hours to find the optimal vocabulary.
VOLT Works Well on Multilingual MT Settings. We conduct a multilingual experiment. These languages come from multiple language families and have diverse characters. We compare VOLT with BPE-60K, the most popular setting in multilingual translation tasks. Table 3 lists the full results. The size of the searched vocabulary is around 110K. As we can see, VOLT achieves better BLEU scores on most pairs.
VOLT is a Green Vocabularization Solution. One advantage of VOLT lies in its low resource consumption. We compare VOLT with BPE-Search, a method to select the best one from a BPE-generated vocabulary set based on their BLEU scores. The results are shown in Table 4. In BPE-Search, we first define a vocabulary set including BPE-1K, BPE-2K, BPE-3K, BPE-4K, BPE-5K, BPE-6K, BPE-7K, BPE-8K, BPE-9K, BPE-10K, BPE-20K, BPE-30K. Then, we run full experiments to select the best vocabulary. Table 4 demonstrates that VOLT is a lightweight solution that can find a competitive vocabulary within 0.5 hours on a single CPU, compared to BPE-Search that takes hundreds of GPU hours. The cost of BPE-Search is the sum of the training time on all vocabularies. Furthermore, we also compare VOLT with MUV-Search as introduced in Section 3. MUV-Search is a method that combines MUV and popular approaches by selecting the vocabulary with the highest MUV as the final vocabulary.

Discussion
We conduct more experiments to answer the following questions: 1) can a baseline beat strong approaches with a better vocabulary; 2) can VOLT beat recent vocabulary solutions, like Sentence-Piece; 3) can VOLT work on diverse architectures?
A Simple Baseline with a VOLT-generated Vocabulary Reaches SOTA Results. We compare VOLT and several strong approaches on the En-De  (Vaswani et al., 2017) 28.4 210M (Shaw et al., 2018) 29.2 213M (Ott et al., 2018) 29.3 210M (So et al., 2019) 29.8 218M (Liu et al., 2020) 30.  dataset. Table 5 shows surprisingly good results. Compared to the approaches in the top block, VOLT achieves almost the best performance with a much smaller vocabulary. These results demonstrate that a simple baseline can achieve good results with a well-defined vocabulary.
VOLT Beats SentencePiece and WordPiece. SentencePiece and WordPiece are two variants of sub-word vocabularies. We also compare our approach with them on WMT-14 En-De translation to evaluate the effectiveness of VOLT. The middle block of Table 5 lists the results of Senten-Piece and WordPiece. We implement these two approaches with the default settings. We can observe that VOLT outperforms SentencePiece and WordPiece by a large margin, with over 1 BLEU improvements.
VOLT Works on Various Architectures. This work mainly uses Transformer-big in experiments.
We are curious about whether VOLT works on other architectures. We take WMT-14 En-De translation as an example and implement a Convolutional Seq2Seq model. The network uses the default settings from Fairseq ‡ . We set the maximum epochs to 100 and average the last five models as the final network for evaluation. Table 6 demonstrates that vocabularies searched by VOLT also works on Convolutional Seq2Seq with competitive BLEU but much smaller size. In this work, we verify the effectiveness of VOLT on architectures with standard sizes. Since model capacity is also an important factor on BLEU scores, we recommend larger vocabularies associated with more embedding parameters for small architectures.
VOLT can Bring Slight Speedup During Training. We evaluate the running time for VOLT vocabulary and BPE-30K on WMT En-De translation. The model with VOLT-searched vocabulary (11.6k tokens) can process 133 sentences per second, while the model with BPE-30K (33.6k tokens) only executes 101 sentences per second. All experiments run on the same environment (2 Tesla-V100-GPUs + 1 Gold-6130-CPU), with the same beam size for decoding. The speedup mainly comes from larger batch size with reduced embedding parameters. We also find that although VOLT reduces the Softmax computations, it does not significantly boost the Softmax running time due to optimized parallel computation in GPUs.
VOLT Vocabularies and BPE Vocabularies are Highly Overlapped. For simplification, VOLT starts from BPE-segmented tokens. We take WMT En-De as an example to see the difference between VOLT vocabulary and BPE vocabulary. The size of VOLT vocabulary is around 9K and we adopt BPE-9K vocabulary for comparison. We find that these two vocabularies are highly overlapped, especially for those high-frequency words. ‡ https://github.com/pytorch/fairseq/ tree/master/examples/translation They also have similar downstream performance. Therefore, from an empirical perspective, BPE with VOLT size is also a good choice.

Conclusion
In this work, we propose a new vocabulary search approach without trail training. The whole framework starts from an informtaion-therotic understanding. According to this understanding, we formulate vocabularization as a two-step discrete optimization objective and propose a principled optimal transport solution VOLT. Experiments show that VOLT can effectively find a well-performing vocabulary in diverse settings.
Appendix C: Proofs for Eq. 6 i∈T P (i) log P (i) = i∈T j∈C P (i, j) log P (i) = i∈T j∈C P (i, j) log P (i, j) · P (i) P (i, j) = i∈T j∈C P (i, j) log P (i, j) + i∈T j∈C P (i, j) log P (i) P (i, j) = i∈T j∈C P (i, j) log P (i, j)