Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training

Learning multilingual and multi-domain translation model is challenging as the heterogeneous and imbalanced data make the model converge inconsistently over different corpora in real world. One common practice is to adjust the share of each corpus in the training, so that the learning process is balanced and low-resource cases can benefit from the high resource ones. However, automatic balancing methods usually depend on the intra- and inter-dataset characteristics, which is usually agnostic or requires human priors. In this work, we propose an approach, MultiUAT, that dynamically adjusts the training data usage based on the model’s uncertainty on a small set of trusted clean data for multi-corpus machine translation. We experiments with two classes of uncertainty measures on multilingual (16 languages with 4 settings) and multi-domain settings (4 for in-domain and 2 for out-of-domain on English-German translation) and demonstrate our approach MultiUAT substantially outperforms its baselines, including both static and dynamic strategies. We analyze the cross-domain transfer and show the deficiency of static and similarity based methods.


Introduction
Text corpora are commonly collected from several different sources in different languages, raising the problem of learning a NLP system from the heterogeneous corpora, such as multilingual models (Wu and Dredze, 2019; Arivazhagan et al., 2019;Aharoni et al., 2019;Freitag and Firat, 2020;Arthur et al., 2021) and multi-domain models (Daumé III, 2007;Li et al., 2019;Deng et al., 2020;Jiang et al., 2020). A strong demand is to deploy a unified model for all the languages and domains, because * Work done during the internship at Huawei Noah's Ark Lab.
One common issue on training models across corpora is that data from a variety of corpora are both heterogeneous (different corpora reveal different linguistic properties) and imbalance (the accessibility of training data varies across corpora). The standard practice to address this issue is to adjust the training data distribution heuristically by up-sampling the training data from LRLs/LRDs (Arivazhagan et al., 2019;Conneau et al., 2020). Arivazhagan et al. (2019) rescale the training data distribution with a heuristic temperature term and demonstrate that the ideal temperature can substantially improve the overall performance. However, the optimal value for such heuristics is both hard to find and varies from one experimental setting to another (Wang and Neubig, 2019;Wang et al., 2020a,b). Wang et al. (2020a) and Wang et al. (2020b) hypothesize that the training data instances that are similar to the validation set can be more beneficial to the evaluation performance and propose a general reinforcement-learning framework Differentiable Data Selection (DDS) that automatically adjusts the importance of data points, whose reward is the cosine similarity of the gradients between a small set of trusted clean data and training data. They instantiate this framework on multilingual NMT, known as MULTIDDS, to dynamically weigh the importance of language pairs. Both the hypothesis and the proposed approach rely on the assumption that knowledge learned from one corpus can always be beneficial to the other corpora.
However, their assumption does not always hold. If the knowledge learned from one corpus is not able to be transferred easily or is useless to the other corpora, this approach fails. Unlike cosine similarity, model uncertainty is free from the aforementioned assumption on cross-corpus transfer. From a Bayesian viewpoint, the model parameters can be considered as a random variable that describes the dataset. If one dataset is well-described by the model parameters, its corresponding model uncertainty is low, and vice versa. This nature makes the model uncertainty an ideal option to weigh the datasets.
In this work, we propose an approach MUL-TIUAT that leverages the model uncertainty as the reward to dynamically adjust the sampling probability distribution over multiple corpora. We consider the model parameter as a random variable that describes the multiple training corpora. If one corpus is well-described by the model compared with other corpora, we spare more training efforts to the other poorly described corpora. We conduct extensive experiments on multilingual NMT (16 languages with 4 settings) and multi-domain NMT (4 for indomain and 2 for out-of-domain), comparing our approach with heuristic static strategy and dynamic strategy. In multilingual NMT, we improve the overall performance from +0.83 BLEU score to +1.52 BLEU score among 4 settings, comparing with the best baseline. In multi-domain NMT, our approach improves the in-domain overall performance by +0.58 BLEU score comparing with the best baseline and achieves the second best out-ofdomain overall performance. We also empirically illustrate the vulnerability of cosine similarity as the reward in the training among multiple corpora.

Preliminaries
Standard NMT A standard NMT model, parameterized by θ θ θ, is commonly trained on one language pair D o trn = {(x x x, y y y) i } M i=1 from one domain. The objective is to minimize the negative log-likelihood of the training data with respect to θ θ θ: log p(y y y|x x x; θ θ θ) . (1) Multi-corpus NMT Both multilingual NMT and multi-domain NMT can be summarized as multicorpus NMT that aims to build a unified translation system to maximize the overall performance across all the language pairs or domains. Formally, let us assume we are given a number of datasets D trn = {D j trn } N j=1 from N languages pairs or domains, in , where M j is size of j-th language/domain. Similar to Equation 1, a simple way of training multi-corpus NMT model is to treat all instances equally: Heuristic strategy for multi-corpus training In practice, Equation 2 can be reviewed as training using mini-batch sampling according to the proportion of these corpora, , and thus we minimize: However, this simple training method does not work well in real cases, where low-resource tasks are under-trained. A heuristic static strategy is to adjust the proportion exponentiated by a temperature term τ (Arivazhagan et al., 2019;Conneau et al., 2020): And the loss function for multi-corpus training can be re-formulated as: Specifically, τ = 1 or τ = ∞ is equivalent to proportional (Equation 2) or uniform sampling respectively.
Differentiable Data Selection (DDS) Wang et al. (2020a) propose a general framework that automatically weighs training instances to improve the performance while relying on an independent set of held-out data D dev . Their framework consists of two major components, the model θ θ θ and the scorer network ψ ψ ψ. The scorer network ψ ψ ψ is trained to assign a sampling probability to each training instance, denoted as p ψ ψ ψ (x x x, y y y), based on its contribution to the validation performance. The training instance that contributes more to the validation performance is assigned a higher probability and more likely to be used for updating the model θ θ θ. This strategy aims to maximize the overall performance over D dev and is expected to generalize well on the unseen D tst with the assumption of the independence and identical distribution between D dev and D tst . Therefore, the objective can formulated as: and ψ ψ ψ and θ θ θ are updated iteratively using bilevel optimization (Colson et al., 2007;von Stackelberg et al., 2011).

Methodology
In this work, we leverage the idea of DDS under the multi-corpus scenarios. We utilize a differentiable domain/language scorer ψ ψ ψ to weigh the training corpora. To learn ψ ψ ψ, we exploit the model uncertainty to measure the model's ability over the target corpus. Below, we elaborate on the details of our method.

Model Uncertainty
Model uncertainty can be a measure that indicates whether the model parameters θ θ θ are able to describe the data distribution well (Kendall and Gal, 2017;Dong et al., 2018;Xiao and Wang, 2019). Bayesian neural networks can be used for quantifying the model uncertainty (Buntine and Weigend, 1991), which models the θ θ θ as a probabilistic distribution with constant input and output. From the Bayesian point of view, θ θ θ is interpreted as a random variable with the prior p(θ θ θ). Given a dataset D, the posterior p(θ θ θ|D) can be obtained via Bayes' rule. However, the exact Bayesian inference is intractable for neural networks, so that it is common to place the approximation q(θ θ θ) to the true posterior p(θ θ θ|D). Several variational inference methods have been proposed (Graves, 2011;Blundell et al., 2015;Gal and Ghahramani, 2016).
In this work, we leverage Monte Carlo Dropout (Gal and Ghahramani, 2016) to obtain samples of sentence-level translation probability. To quantify the model uncertainty when the model makes predictions, we treat the sentence-level translation probability as random variable. We run K forward passes with a random subset of model parameters θ θ θ deactivated, which is equivalent to drawing samples from the random variable, and average the samples as the estimate of the model uncertainty. 2 Consider an ensemble of models {p θ θ θ k (y y y|x x x)} K k=1 sampled from the approximate posterior q(θ θ θ), the predictive posterior can be obtained by taking the expectation over multiple inferences: p(y y y|x x x, D) ≈ E θ θ θ∼q(θ θ θ) [p(y y y|x x x, θ θ θ)] 2 K is set to 30 in our work.

Uncertainty-Aware Training
To make the training more efficient and stable, MULTIUAT leverages the scorer network ψ ψ ψ to dynamically adjust the sampling probability distribution of domains/languages. 3 We present the pseudo-code for training with MULTIUAT in Algorithm 1. MULTIUAT firstly parameterizes the initial sampling probability distribution for multi-corpus training with ψ ψ ψ as Equation 4 with the warm-up temperature τ = 1. For the computational efficiency, the scorer network ψ ψ ψ is updated for every S steps. When updating ψ ψ ψ, we randomly draw one mini-batch from each validation set {D i dev } N i=1 and compute the corresponding uncertainty measure as in Section 3.3 with Monte Carlo Dropout to approximate the model uncertainty towards this corpus, assuming the validation set is representative enough for its corresponding true distribution. The corpus associated with high uncertainty is considered to be relatively poorly described by the model θ θ θ and its sampling probability will be increased. The model θ θ θ is updated by mini-batch gradient descent between two updates of ψ ψ ψ, like common gradient-based optimization, and hence the objective is formulated as follows: A considerable problem here is Equation 6 is not directly differentiable w.r.t. the scorer ψ ψ ψ. To tackle this problem, reinforcement learning (RL) with suitable reward functions is required (Fang et al., 2017;Wang et al., 2020a): Details for the reward functions R(n) are depicted at Section 3.3 and the update of ψ ψ ψ follows the RE-INFORCE algorithm (Williams, 1992).

Uncertainty Measures
We explore the utility of two groups of model uncertainty measures: probability-based and entropybased measures at the sentence level Fomicheva et al., 2020;Malinin and Gales, 2021).

Probability-Based Measures
We explore four probability-based uncertainty measures following the definition of . For the sampled model parameters θ θ θ k , with the teacher-forced decoding, we note the predicted probability of the t-th position as: where we have used the ground truth prefix y y y n,<t in the conditioning context. We then define the reward function as the following uncertainty measures: • Predicted Translation Probability (PRETP): The predicted probability of the sentence, p(ŷ n,t |x x x n , y y y n,<t ; θ θ θ k ) .

Entropy-Based Measures Malinin and Gales
(2021) consider the uncertainty estimation for autoregressive models at the token-level and sequencelevel and treat the entropy of the posterior as the total uncertainty in the prediction of y y y. Following their interpretation, we leverage the entropy as the measure of the model uncertainty.
Drawing a pair of sentence (x x x, y y y) with T target tokens from the n-th corpus D n , the reward function is defined as the averaged entropy over all the positions: (11) where V is the vocabulary size and p(y n,t,v ) stands for the predicted conditional probability p(y n,t,v |x x x, y y y n,<t,v ; θ θ θ k ) on the v-th word in the vocabulary.
In this work, we explore the utility of two entropy-based uncertainty measures as follows: • Entropy of the sentence (ENTSENT): The average entropy of the sentence as defined in Equation 11.
• Entropy of EOS (ENTEOS): The entropy of the symbol EOS in the sentence as defined in Equation 11 where t = T .
Following Equation 7, we have the final reward by multiple sampled θ θ θ k for each uncertainty reward respectively: 4 Experimental Setup

MULTIDDS-S
We compare with the best model MULTIDDS-S proposed by Wang et al. (2020b) over multilingual NMT tasks. Its reward for the n-th corpus is defined using cosine similarity:

Multilingual Setup
We follow the identical setup as Wang et al. (2020b) in the multilingual NMT. The model is trained on two sets of language pairs based on the language diversity.
We run many-to-one (M2O, translating 8 languages to English) and one-to-many (O2M, translating English to 8 languages) translations for both diverse and related setups. 4

Multi-Domain Setup
We run experiments on English-German translation and collect six corpora from WMT2014 (Bojar et al., 2014) and the Open Parallel Corpus (Tiedemann, 2012), 4 for in-domain and 2 for out-ofdomain: In-Domain (ID) (i) WMT, from WMT2014 translation task (Bojar et al., 2014) with the concatenation from newstest2010 to newstest2013 for validation and newstest2014 for testing; (ii) Tanzil, 5 a collection of Quran translations; (iii) EMEA, 6 a parallel corpus from the European Medicines Agency; (iv) KDE, 7 a parallel corpus of KDE4 localization files. 4 Refer to Wang et al. (2020b)   Out-Of-Domain (OOD) (i) QED, 8 a collection of subtitles for educational videos and lectures (Abdelali et al., 2014); (ii) TED, 9 a parallel corpus of TED talk subtitles. These two domains are only used for out-of-domain evaluation.
All these corpora are first tokenized by Moses (Koehn et al., 2007) and processed into sub-word units by BPE (Sennrich et al., 2016) with 32K merge operations. Sentence pairs that are duplicated and violates source-target ratio of 1.5 are removed. The validation sets and test sets are randomly sampled, except for WMT. The dataset statistics are listed in Table 1.

Model Architecture
We believe all the approaches involved in this work, including the baseline approaches and MULTIUAT, are model-agnostic. To validate this idea, we experiment two variants of transformer (Vaswani et al., 2017). For multilingual NMT, the model architecture is a transformer with 4 attention heads and 6 layers. 10 And for multi-domain NMT models, we use the standard transformer-base with 8 attention heads and 6 layers. 11 All the models in this work are implemented by fairseq (Ott et al., 2019).

Evaluation
We report detokenized BLEU (Papineni et al., 2002) using SacreBLEU (Post, 2018) with statistical significance given by Koehn (2004). 12 µ BLEU is the macro average of BLEU scores within the same setting, with the assumption that all the language pairs/domains are equally important.

Main Results
The summarized results for both multilingual and multi-domain NMT are presented in Table 2. 13 The complete results with statistical significance can be found in Appendix A.
Multilingual NMT Overall, dynamic strategies (MULTIDDS-S and MULTIUAT) demonstrate their superiority against heuristic static strategies. As shown in Table 2, the optimal τ of heuristic static strategies varies as the combination of corpora changes. For example, proportional sampling yields best performance on M2O settings, yet achieves the worst performance on O2M settings among heuristic static strategies. Dynamic strategies are free from adjusting the data usage by tuning the τ . MULTIDDS-S marginally outperforms heuristic static strategies. MULTIUAT with various uncertainty measures reaches the best performance in all four settings. Based on the detailed results in Appendix A, we can observe that MULTIUAT appears to be more favorable to HRLs.
Multi-domain NMT MULTIUAT outperforms all its baselines on in-domain evaluation and achieves the second best performance on out-ofdomain evaluation. MULTIUAT with PRETP achieves the optimal balance on in-domain evaluation and the one with EXPTP achieves the second best performance on out-of-domain evaluation. However, MULTIDDS-S performs poorly on multi- 13 We report the multilingual results of the heuristic approach and MULTIDDS-S from Wang et al. (2020b) in Table 2. domain NMT and is even outperformed by some heuristic static strategies.
Based on the detailed results in Appendix A, we can observe that the higher sampling probability for certain domain is commonly but not always positively correlated to the corresponding indomain performance. Uniformly sampling minibatches from domains does not result in the best performance on LRDs, because the LRDs with too much up-sampling are not able to fully leverage the knowledge from the HRDs. Wang et al. (2020b) conduct exhaustive analyses on multilingual NMT and most of our observations are consistent with theirs. 14 Hence, we focus more on analyzing the results on multi-domain NMT.

Comparison of Uncertainty Measures
We explore the utility of different uncertainty measures and display the summarized results in Table 2. Different uncertainty measures deliver different results. We do not observe one uncertainty measure that consistently outperforms others. The probability-based uncertainty measures seem to be more sensitive to the intra-and inter-dataset characteristics, and perform well on either multilingual NMT or multi-domain NMT. MULTIUAT with the uncertainty measure of VARTP performs substantially worse than other uncertainty measures in multi-domain NMT. In contrast to the probabilitybased uncertainty measures, the entropy-based uncertainty measures are more robust to the change of datasets and deliver relatively stable improvements. We also find out that MULTIUAT with the uncertainty measures demonstrate better out-ofdomain generalization in the multi-domain NMT, compared with its baselines. Based on the detailed results in Appendix A, MULTIUAT with the entropy-based uncertainty measures demonstrates better robustness against the change of datasets. Therefore, we mainly compare MULTIUAT with the uncertainty measure of ENTEOS against the baselines in the following analyses, based on the macro-average results on both multilingual and multi-domain NMT.

Learned Distribution for Language Pairs/Domains
We visualize the change of sampling distribution, w.r.t. the training iterations, of the multilingual O2M-diverse ( Figure 1) and multi-domain (Figure 2)   HRLs/HRDs. In the multilingual NMT, we observe that the learned distributions by both MULTIDDS-S and MULTIUAT converge from proportional sampling to uniform sampling with a mild trend to divergence in the one given by MULTIDDS-S. In the multi-domain NMT, MULTIUAT illustrates the consistent adjustment as the trend illustrated in multilingual O2M-diverse setting, but the learned distribution given by MULTIDDS-S is overwhelmed by Tanzil.
The model uncertainty focuses on how well the dataset is described by the model θ θ θ, instead of the interference among datasets, so that MULTIUAT is free from the assumption on the cross-corpus transference and not affected by Tanzil.

Why Cosine Similarity Fails? 15
A natural question is raised after seeing Figure 2: why does Tanzil overwhelm the sampling distribution by MULTIDDS-S in multi-domain NMT?
As in Equation 13, MULTIDDS-S computes pairwise cosine similarities for all the language pairs/domains using sampled mini-batches between D trn and D dev to update the sampling probability. We average all the cosine similarity matrices during the training and visualize the averaged matrix in Figure 3. As visualized, Tanzil is a highly self-correlated domain whose cosine similarity is about at least two times larger than the other values in the matrix. This leads to a very high reward on Tanzil, and the sampling probability of Tanzil in MULTIDDS-S keeps increasing to more than 40% in Figure 2.   However, is Tanzil highly beneficial to the overall performance? To probe the cross-domain generalization, we train four single-domain NMT models on each in-domain corpus and evaluate these models on all the in-domain test sets, and the results are presented in Table 4. We can observe that the knowledge learned from WMT can be generalized to other domains, but the knowledge learned from Tanzil is almost not beneficial to other domains. Therefore, MULTIDDS-S with the datadependent cosine similarity reward is vulnerable to the change of datasets and can be possibly overwhelmed by a special dataset like Tanzil, since the cross-corpus transfer is intractable.

Effects of Sampling Priors
Both MULTIDDS-S and MULTIUAT initialize the sampling probability distribution to proportional distribution (line 1 in Algorithm 1). We investigate how the prior sampling distribution affects the performance and present the results in Table 3. We can observe that the prior sampling distribution can affect the overall performance. For both MULTIDDS-S and MULTIUAT, the overall results on both in-domain and out-of-domain evaluation are negatively correlated with the prior τ .
We also visualize the change of sampling probability of KDE given by MULTIDDS-S and MUL- TIUAT with different prior sampling distributions in Figure 4. The learned sampling distribution by MULTIUAT always converges to uniform distribution, regardless of the change of prior sampling distribution. However, the change of priors significantly affects the learned sampling distribution of MULTIDDS-S.

Related Work
Multi-corpus NLP Multilingual training has been particularly prominent in recent advances driven by the demand of training a unified model for all the languages (Dong et al., 2015;Plank et al., 2016;Johnson et al., 2017;Arivazhagan et al., 2019). Freitag and Firat (2020) extend current English-centric training to a many-to-many setup without sacrificing the performance on Englishcentric language pairs. Wang et al. (2021) improve the multilingual training by adjusting gradient directions based on gradient similarity. Existing works on multi-domain training commonly attempt to leverage architectural domain-specific components or auxiliary loss (Sajjad et al., 2017;Tars and Fishel, 2018;Zeng et al., 2018;Li et al., 2018;Deng et al., 2020;Jiang et al., 2020). These approaches commonly do not explore much on the training proportion across domains and are limited to in-domain prediction and less generalizable to unseen domains. Zaremoodi and Haffari (2019) dynamically balance the importance of tasks in multitask NMT to improve the low-resource NMT performance. Vu et al. (2021) leverage a pre-trained language model to select useful monolingual data from either source language or target language to perform unsupervised domain adaptation for NMT models. Our work is directly related to Wang et al. (2020a) and Wang et al. (2020b) that leverage cosine similarity of gradients as a reward to dynamically adjust the data usage in the multilingual training.

Conclusion
In this work, we propose MULTIUAT, a general model-agnostic framework that learns to automatically balance the data usage to achieve better overall performance on multiple corpora based on model uncertainty. We run extensive experiments on both multilingual and multi-domain NMT, and empirically demonstrate the effectiveness of our approach. Our approach substantially outperforms other baseline approaches. We empirically point out the vulnerability of a comparable approach MULTIDDS-S (Wang et al., 2020b). We focus on the problem that dynamically balances text corpora collected from heterogeneous sources in this paper. However, the heterogeneity of text corpora is far beyond the languages and domains which are discussed in this work. For example, the quality of datasets is not covered. We leave the study on the quality of datasets to the future work.

A Complete Results
We present the complete results of our own implementation in Table 5, Table 6, Table 7, Table 8 and  Table 9. The multilingual results for heuristic static strategies and MULTIDDS-S are obtained with the hyperparameters provided by Wang et al. (2020b).

B Hyperparameters for Optimization
Multilingual NMT For MULTIUAT, the NMT model is optimized with Adam (Kingma and Ba, 2015) with β 1 = 0.9 and β 2 = 0.98. The model is optimized for 40 epochs with the learning rate α = 5 × 10 −4 and the batch size of 9600 tokens. The learning rate increases linearly in the first 4K steps to the peak and then declines proportionally to the inverse square root of the number of steps. ψ ψ ψ is updated for every 2K steps with the learning rate 1 × 10 −4 .
Multi-domain NMT The NMT model is optimized with Adam (Kingma and Ba, 2015) with β 1 = 0.9 and β 2 = 0.98. The model is optimized for 20 epochs with the learning rate α = 7 × 10 −4 and the batch size of 32K tokens. The learning rate increases linearly in the first 4K steps to the peak and then declines proportionally to the inverse square root of the number of steps. ψ ψ ψ for both MULTIDDS-S and MULTIUAT is updated for every 1K steps with the learning rate 1 × 10 −4 . All the hyperparameters are identical among all the approaches in multi-domain setup.