Continual Quality Estimation with Online Bayesian Meta-Learning

Most current quality estimation (QE) models for machine translation are trained and evaluated in a static setting where training and test data are assumed to be from a fixed distribution. However, in real-life settings, the test data that a deployed QE model would be exposed to may differ from its training data. In particular, training samples are often labelled by one or a small set of annotators, whose perceptions of translation quality and needs may differ substantially from those of end-users, who will employ predictions in practice. To address this challenge, we propose an online Bayesian meta-learning framework for the continuous training of QE models that is able to adapt them to the needs of different users, while being robust to distributional shifts in training and test data. Experiments on data with varying number of users and language characteristics validate the effectiveness of the proposed approach.


Introduction
Quality Estimation (QE) models aim to evaluate the output of Machine Translation (MT) systems at run-time, when no reference translations are available (Blatz et al., 2004;Specia et al., 2009). QE models can be applied for instance to improve translation productivity by selecting high-quality translations amongst several candidates. A number of approaches have been proposed for this task (Specia et al., 2009(Specia et al., , 2015Kepler et al., 2019;Ranasinghe et al., 2020), and a shared task yearly benchmarks proposed approaches (Fonseca et al., 2019;Specia et al., 2020).
Different users of MT output have varying quality needs and standards, depending for instance on the downstream task at hand, or the level of their knowledge of the languages involved, and training for the task. Thus, the perception of the quality of MT output can be subjective, and therefore the quality estimates obtained from a model trained on data from one set of users may not serve the needs of a different set users. However, most existing QE models are trained and evaluated in a static setting which assumes a fixed distribution of train and test data. This consequently leads to suboptimal performance when faced with test data from a different set of users in practice.
The few previous approaches to develop QE models that are able to learn from a continuous stream of data suffer from the following limitations: they do not have an explicit objective that encourages the model to exploit common structures shared among different users to continually adapt efficiently for new users (Turchi et al., 2014), or assume a fixed number of users, and that the identity of each user is known in advance (de Souza et al., 2015). In addition, these previous approaches do not explicitly account for the underlying uncertainties in the data in order to improve performance.
In contrast, we propose a continual metalearning framework that makes none of the aforementioned assumptions, but instead considers each user as a task and explicitly meta-learns the common structure shared among different users. This approach further exploits the underlying uncertainties in the streaming data through Bayesian inference to improve performance. In addition, the proposed approach is applicable even in a setting where no user identities are available, for instance due to privacy concerns, but where we still want to learn and adapt as efficiently as possible from supervision data that arrives incrementally.

Continual Learning
Continual learning (Ring, 1994;Thrun, 1996;Zhao and Schmidhuber, 1996) aims to develop mod-els that are capable of learning from a continuous stream of sequential tasks, T 1 , T 2 , .., T T , with each task T t having its associated train D train t , validation D val t and test D test t splits. A major challenge associated with learning in this setting is the issue of catastrophic forgetting, where a model forgets knowledge of how to perform previous tasks as new tasks are encountered. Most recent work in lifelong learning has focused on ways of mitigating catastrophic forgetting, and approaches proposed include replay-based methods (Rebuffi et al., 2017;Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019), which replay either stored or generated samples to remind the model of how to perform previous tasks; regularization-based methods (Kirkpatrick et al., 2017;Zenke et al., 2017), which utilize an additional regularization term to enforce retaining knowledge learned from previous tasks; and parameter-isolation methods, which make use of dedicated parameters for each task to prevent interference among tasks (Rusu et al., 2016;Fernando et al., 2017). Lange et al. (2019) presents an overview of recent continual learning methods. Research in continual learning can generally be carried in one of two settings : in a task-incremental continual learning setting, where the learner is sequentially given access to all the data of each task and is allowed to make multiple passes over it, with task boundaries and identities known to the learner; or in an online continual learning setting, where the learner is only allowed a single pass over the data of each task, and with no task identities or boundaries known to the learner. In this work we conduct experiments in the online continual learning setting.

Meta-Learning
The goal of meta-learning, also known as learning to learn (Schmidhuber, 1987;Thrun and Pratt, 1998), is to develop models that can learn more efficiently over time, by generalizing from knowledge of how to solve related tasks from a given distribution of tasks. Given a learner model f w , for instance a neural network parametrized by w, and a distribution p(T ) over tasks T , gradient-based meta-learning approaches such as MAML (Finn et al., 2017) seek to learn the parameters of the learner model which can be quickly adapted to new tasks sampled from the same distribution of tasks. In formal terms, these approaches seek parameters that optimize the meta-objective: (1) where L T is the loss and D T is training data from task T , and U k denotes k steps of a gradient descent learning rule such as SGD.
In order to account for uncertainty and improve robustness, Bayesian approaches to meta-learning have also been proposed (Kim et al., 2018;Finn et al., 2018;Ravi and Beatson, 2019;Wang et al., 2020;Nguyen et al., 2020).

Meta-Learning for Continual Learning
Meta-learning for continual learning methods generally make use of the meta-learning objective one task at a time to ensure that learning on the current task does not lead to catastrophic forgetting on previous tasks. For instance, both Riemer et al. (2019) and Obamuyide and Vlachos (2019) propose to combine REPTILE (Nichol and Schulman, 2018), a first order meta-learning algorithm, together with experience replay to improve performance during continual learning. Javed and White (2019) proposed an online-aware meta-learning (OML) objective for learning representations that are less prone to catastrophic forgetting during continual learning. Holla et al. (2020) proposed to combine the OML objective together with experience replay for improved continual learning performance. Recently, Gupta et al. (2020) proposed Look-Ahead MAML (LA-MAML), which meta-learns per-parameter learning rates to help adapt to changing data distributions during continual learning.
These approaches have demonstrated that metalearning can yield performance improvements for continual learning. Our work builds on these approaches and additionally demonstrates that the performance of meta-learning for continual learning can be further improved with the combination of an adaptive learning rate and Bayesian inference.

Bayesian Inference with Stein Variational Gradient Descent
Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016) is a Bayesian inference method which works by initializing a set of samples, also known as particles, from a simple distribution and iteratively updating the particles to match samples from a target distribution. Because its particle update rule is deterministic and differentiable, it can be used to perform Bayesian inference in the metalearning inner loop, since the entire update process can still be differentiated through for gradientbased updates from the outer loop. In order to obtain N samples from a posterior P (w), SVGD maintains N samples of model parameters, and iteratively transports the samples to match samples from the target distribution. Let the samples be represented by W = {w n } N n=1 . At each successive iteration t, SVGD updates each sample with the following update rule: α t is a step-size parameter and k (., .) is a positivedefinite kernel, such as the RBF kernel. Intuitively, the first term in Equation 3 implies that a particle determines its update direction through a weighted aggregate of the gradients from the other particles, with the kernel distance between the particles serving as the weight. Thus, closer particles have more weight in the aggregate. The second term of the equation can be understood as a repulsive force that prevents the particles from collapsing to a single point. For the case when the number of particles is one, the SVGD update procedure reduces to standard gradient ascent on the objective p(w) for any kernel with the property ∇ w k (w, w) = 0, such as the RBF kernel. SVGD has been applied in a wide range of settings, including reinforcement learning (Liu et al., 2017;Haarnoja et al., 2017), uncertainty quantification (Zhu and Zabaras, 2018) and to improve performance in an offline meta-learning setup (Kim et al., 2018) which requires all tasks ahead of training. In this work we adapt SVGD to an online continual meta-learning setting for a natural language task.

Meta-Learning for Continual Learning with Adaptive SVGD
Learning continually from a stream of observations with varying underlying distributions involves dealing with various sources of uncertainty, which a model should properly account for in order to enhance its continual learning performance. One source of uncertainty is in the learning rate, that is, how fast learning should proceed on new data in order to both reduce catastrophic forgetting and enhance performance on the current task. Another source is the inherent uncertainty in the values of the model's parameters themselves. Learning an adaptive learning rate, for instance as proposed in Gupta et al. (2020), can help account for the first source of uncertainty, and Bayesian inference can be used to help a model account for the other source of uncertainty. In order to properly model both sources of uncertainty during continual learning, we propose to both perform inference of model parameters with SVGD, and meta-learn an adaptive per-parameter learning rate for SVGD updates. Thus, the SVGD update in Equation 2 now becomes: where α t is a learnable parameter containing perparameter learning rates, and · is the dot product. The aim is then to meta-learn both the parameters of the model and the per-parameter learning rates that enhance continual learning performance. The advantage of this approach is that it allows for greater flexibility to adapt to non-stationary data distributions during continual learning. In the experiments, we demonstrate that this change leads to improved performance for the task of continual quality estimation. The proposed approach is illustrated in Algorithm 1.
We first initialize the parameters of the QE model, and the learning rate (line 1). Then for each mini-batch in a task t that arrives, we store its training instances in the buffer with a probability p (lines 2-6). In the inner loop, we perform K SVGD updates (using Equation 4) starting from the initial model parameters W 0 (lines 7-9). In the outer loop, instances in the current mini-batch are augmented with instances sampled from the buffer (line 10). Finally, the augmented mini-batch is used to perform a meta-update on the learning rate (line 11), and on the parameters of the QE model (line 12). Because this approach can also be considered the online counterpart to the Bayesian Model Agnostic Meta-Learning approach of Kim et al. (2018), we refer to it as Continual Quality Estimation with Online Bayesian Meta-Learning (CQE-OBML).

Experiments and Results
The QT21 Dataset We evaluate our approach with the publicly available QT21 (Specia et al., 2017), a large-scale dataset containing translations for k = 1,..K do 8: W k = SV GD(W k−1 , α0, Xt, Yt) 9: end for 10:  from both statistical (smt) and neural (nmt) machine translation systems in multiple language directions. 1 This is the largest dataset with annotator information available. We use data from the English-Latvian (en-lv) and English-Czech (en-cs) language pairs. These languages were chosen as they contain the largest number of annotators. Each instance in the dataset is a tuple of source sentence, its machine translation, the corresponding post-edited translation by a professional translator (post-editor), a reference translation and other information such as (anonymized) post-editor identifier. We construct a QE dataset from this corpus by computing the HTER (Snover et al., 2006) values between each source sentence and its post-edited translation. We thereafter split the data into train, dev and test splits for each post-editor. A breakdown of the number of train, dev and test instances per post-editor is available in Table 1.
Benchmark Approaches SEQUENTIAL is a baseline trained sequentially over the streaming data of each task. In each round, the model parameters are initialized from that of the previous round; A-GEM (Chaudhry et al., 2019) is a continual learning method which utilizes the gradients of samples of previous tasks saved in a buffer as an optimization constraint to prevent catastrophic forgetting; OML-ER (Holla et al., 2020) augments the Online-Aware Meta-Learning approach of Javed and White (2019) with experience replay from a buffer; LA-MAML (Gupta et al., 2020) learns per-parameter learning rates using meta-learning; MTL-IID is trained on the concatenated and shuffled data from all users for multiple epochs in multi-task fashion. It assumes i.i.d access to the data from all users, and thus serves as an upper-bound for the performance.
QE Model The quality estimation model used by all continual learning methods is based on multilingual DistilBERT (Sanh et al., 2019), a smaller version of multi-lingual BERT (Devlin et al., 2019) trained with knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015). It accepts as input the source and machine translation outputs concatenated as a single text, separated by a '[SEP]' token and prepended with a '[CLS]' token. The representation of the '[CLS]' token is then passed to a linear layer to predict HTER (Snover et al., 2006) values as regression targets.
Evaluation We report Pearson's r correlation scores and Mean Absolute Error (MAE) between model output and gold labels, both standard evaluation metrics in QE. Each experiment is repeated across five (5) different orders of the tasks and five (5) different random seeds, and we report the average.

Comparison with Benchmark Approaches
The results of our approach in comparison with other benchmark approaches are presented in Table  2. We can observe that naively training sequentially on the data of each task as it arrives (SEQUEN-TIAL) leads to poor results.  OML-ER outperforms both SEQUENTIAL and A-GEM, likely because of its combination of metalearning and experience replay, which makes it better able to combat forgetting. LA-MAML slightly improves over the results of OML-ER, as a result of its meta-learned learning rate. We find that our approach, CQE-OBML, which combines a meta-learned adaptive learning rate together with Bayesian inference, outperforms previous approaches. This demonstrates the effectiveness of adequately modelling the various sources of uncertainty in continual meta-learning.

Analysis of Model Components
We investigate the effect of the various components of our approach through an ablation study. As shown in Table 3, our approach (CQE-OBML) without the adaptive learning rate (-LR (α)) has a drop in performance, especially for en-cs. Without inference with SVGD (-SVGD), we observe a larger reduction in performance on both datasets, demonstrating the usefulness of incorporating Bayesian inference into the continual meta-learning of quality estimation models.

Conclusions
We proposed a framework for the continual metalearning of machine translation quality estimation models, which is able to learn continually from the streaming data of multiple quality estimation users. We further incorporate an adaptive learning rate together with online Bayesian inference for improved performance. In experiments on quality estimation data from two language directions, we demonstrate improved performance over recent state-of-the-art continual learning methods.

A Additional Results
We present additional results on the WPTP12 dataset (Koponen et al., 2012),which is a small English-Spanish (en-es) translation dataset consisting of documents from the news domain. It features translations from eight different machine translation systems. Each instance in the dataset includes the corresponding post-edited translation along with post-editing time and HTER scores computed between the translation and the corresponding post-edit. Statistics about the number of instances per post-editor are in Table 4. Table 5 contains the results obtained on this dataset. As a result of its size, all methods generally find it challenging, with reduced performance across-the-board. Despite reduced performance in terms of mean absolute error, our approach obtains better Pearson correlation than all previous methods.

B Additional Experimental Details
All models make use of the same values for hyperparameters such as learning rate and batch size, selected by manual search in initial experiments. These are provided in Table 6.