Bayesian Model-Agnostic Meta-Learning with Matrix-Valued Kernels for Quality Estimation

Most current quality estimation (QE) models for machine translation are trained and evaluated in a fully supervised setting requiring significant quantities of labelled training data. However, obtaining labelled data can be both expensive and time-consuming. In addition, the test data that a deployed QE model would be exposed to may differ from its training data in significant ways. In particular, training samples are often labelled by one or a small set of annotators, whose perceptions of translation quality and needs may differ substantially from those of end-users, who will employ predictions in practice. Thus, it is desirable to be able to adapt QE models efficiently to new user data with limited supervision data. To address these challenges, we propose a Bayesian meta-learning approach for adapting QE models to the needs and preferences of each user with limited supervision. To enhance performance, we further propose an extension to a state-of-the-art Bayesian meta-learning approach which utilizes a matrix-valued kernel for Bayesian meta-learning of quality estimation. Experiments on data with varying number of users and language characteristics demonstrates that the proposed Bayesian meta-learning approach delivers improved predictive performance in both limited and full supervision settings.


Introduction
Quality Estimation (QE) models aim to evaluate the output of Machine Translation (MT) systems at run-time, when no reference translations are available (Blatz et al., 2004;Specia et al., 2009). QE models can be applied for instance to improve translation productivity by selecting high-quality translations amongst several candidates. A number of approaches have been proposed for this task (Specia et al., 2009(Specia et al., , 2015Kepler et al., 2019;Ranasinghe et al., 2020), and a shared task yearly benchmarks proposed approaches (Fonseca et al., 2019;Specia et al., 2020).
Different users of MT output have varying quality needs and standards, depending for instance on the downstream task at hand, or the level of their knowledge of the languages involved. Thus, the perception of the quality of MT output can be subjective, and therefore the quality estimates obtained from a model trained on data from one set of users may not serve the needs of a different set of users. In order to be able to make the most of these models, it is thus desirable to be able to efficiently adapt them to the needs and preferences of the end-user and with as little supervision as possible. However, most existing QE models are trained and evaluated in a fully supervised setting which assumes access to substantial quantities of labelled supervision data, which may not be available and can be expensive and time-consuming to obtain.
In order to endow QE models with the ability to learn to adapt efficiently with limited supervision data, this work proposes a Bayesian meta-learning framework for the training and evaluation of QE models that are able to adapt to the needs of endusers with limited supervision data. We further improve the performance of Bayesian meta-learning for the task of quality estimation by extending the state-of-the-art Bayesian Model-Agnostic Meta-Learning (BMAML) approach of Kim et al. (2018) to utilize Stein Variational Gradient Descent (Liu and Wang, 2016) with matrix-valued kernels (Wang et al., 2019), and demonstrate that this leads to enhanced predictive performance in both limited and full supervision settings.

Model-Agnostic Meta-Learning
The goal of meta-learning, also known as learning to learn (Schmidhuber, 1987;Thrun and Pratt, 1998), is to develop models that can learn more efficiently over time, by generalizing from knowledge of how to solve related tasks from a given distribution of tasks. Given a learner model f w , for instance a neural network parametrized by w ∈ R d , and a distribution p(T ) over tasks T , gradientbased model-agnostic meta-learning approaches such as MAML (Finn et al., 2017) seek to learn the parameters of the learner model which can be quickly adapted to new tasks sampled from the same distribution of tasks with limited supervision data.
In formal terms, these approaches seek parameters w that satisfy the meta-objective: where L T is the loss and D T is training data from task T , and U k denotes k steps of a gradient descent learning rule such as SGD.
Intuitively, the meta-objective explicitly encourages the model to learn model parameters that can be quickly adapted to achieve optimum predictive performance across all tasks using limited supervision data and with as few gradient descent steps as possible.
In order to account for uncertainty and improve robustness, Bayesian approaches to meta-learning have also been proposed (Kim et al., 2018;Finn et al., 2018;Ravi and Beatson, 2019;Wang et al., 2020;Nguyen et al., 2020). In contrast to their non-Bayesian counterparts which learn point estimates of the parameters, Bayesian meta-learning approaches learn a distribution over the parameters to further improve robustness in limited supervision settings.

Stein Variational Gradient Descent
Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016) is a Bayesian inference method which works by initializing a set of samples, also known as particles, from a simple distribution and iteratively updating the particles to match samples from a target distribution. Because its particle update rule is deterministic and differentiable, it can be used to perform Bayesian inference in the metalearning inner loop, since the entire update process can still be differentiated through for gradientbased updates from the outer loop, for instance as was done in Kim et al. (2018).
In order to obtain N samples from a posterior p(w), SVGD maintains N samples of model parameters, and iteratively transports the samples to match samples from the target distribution. Let the samples be represented by W = {w n } N n=1 . At each successive iteration t, SVGD updates each sample with the following update rule: where α t is a step-size parameter and k : R d × R d → R is a scalar-valued positive-definite kernel such as the Radial Basis Function (RBF) kernel. Intuitively, the first term in Equation 3 implies that a particle determines its update direction through a weighted aggregate of the gradients from the other particles, with the kernel distance between the particles serving as the weight. Thus, closer particles have more weight in the aggregate. The second term of the equation can be understood as a repulsive force that prevents the particles from collapsing to a single point. For the case when the number of particles is one, the SVGD update procedure reduces to standard gradient ascent on the objective p(w) for any kernel with the property ∇ w k (w, w) = 0, such as the RBF kernel. SVGD has been applied in a wide range of settings, including reinforcement learning (Liu et al., 2017;Haarnoja et al., 2017), uncertainty quantification (Zhu and Zabaras, 2018), and online continual learning (Obamuyide et al., 2021).

Stein Variational Gradient Descent with Matrix-Valued Kernels
Let H k denote a reproducing kernel Hilbert space (RKHS) H with kernel k. Wang et al. (2019) observed that the original SVGD as proposed in Liu and Wang (2016) searches for the optimal update direction φ in RKHS H d k = H k × · · · × H k , a product of d copies of RKHS of scalar-valued functions, which does not allow the encoding of any potential correlations between different co-ordinates of φ. Wang et al. (2019) proposed Matrix-SVGD, which addressed this limitation by replacing H d k with a more general RKHS of vector-valued functions (also known as vector-valued RKHS), which uses matrix-valued positive-definite kernels to specify rich correlation structures between the different co-ordinates. Concretely, Equation 3 as used in SVGD is replaced with Equation 4: where K : R d × R d → R d×d is now a matrixvalued kernel, and K(·, w)∇ w is formally defined as the product of matrix K(·, w) with vector ∇ w . The -th element of K(·, w)∇ w is computed as: where K ,m (w, w ) represents the ( , m)-element of matrix K (w, w ) and w m the m-element of w. Importantly, the advantage of Matrix-SVGD over the original SVGD algorithm is that it allows us to pre-condition SVGD by constructing a proper matrix kernel which incorporates the pre-conditioning information, in order to accelerate exploration and convergence.

Bayesian Model-Agnostic Meta-Learning
Kim et al. (2018) proposed a Bayesian Model-Agnostic Meta-Learning (BMAML) algorithm which learns a distribution over parameters which, when given data from a new task, can be adapted quickly to a task-specific distribution using SVGD updates as defined in Equation 3. Thus, BMAML as proposed in Kim et al. (2018) makes use of scalar-valued kernels for SVGD updates, which (as discussed earlier) does not allow the encoding of potential correlations between different parameter co-ordinates for effective optimization, a limitation which we next address.

Bayesian Model-Agnostic
Meta-Learning with Matrix-SVGD In this work we propose to improve the predictive performance of BMAML for quality estimation with the use of the Matrix-SVGD, which uses matrix-valued kernels for more effective parameter updates, in place of the original SVGD algorithm used in Kim et al. (2018). As pre-conditioning information, we use P , the average of the Fisher information matrix of the particles: where F (wn) is the Fisher information matrix for particle w n . The matrix-valued kernel is then computed as: where w − w 2 P := (w − w ) P (w − w ) and h is a bandwidth parameter.
The full algorithm, which we refer to as Matrix-BMAML, is outlined in Algorithm 1. We use machine translation quality estimation as a case study in this work, and so assume access to a distribution of quality estimation tasks p(T ) (each QE task can be a QE user/annotator/post-editor with their corresponding data), and a quality estimation model f W parameterized by W , though the approach can also be applied to other natural language processing or computer vision tasks.

13: end while
We first initialize the parameters of the quality estimation model (line 1). Then in each iteration, we sample a batch of QE tasks (line 3), and for each QE task, we sample instances from its training and validation sets (lines 4-6). Thereafter, task-specific parameters are initialized from the model's parameters (line 7), and then updated with K steps of Matrix-SVGD (using Equations (2) and (4) to (7)) (lines 8-10). At the end of each iteration, a metaupdate is performed on the model's parameters W .

Experiments and Results
We conduct experiments in two settings: in a limited supervision setting, where we provide all models access to only a limited number of training instances per QE task; and in a full-supervision setting, where we provide the models with access to all available training instances for each QE task.  The QT21 Dataset We evaluate our approach with the publicly available QT21 (Specia et al., 2017), a large-scale dataset containing translations from both statistical (smt) and neural (nmt) machine translation systems in multiple language directions 1 . This is the largest dataset with annotator information available. We make use of data from the English-Latvian (en-lv) and English-Czech (encs) language directions. The languages were chosen as they contain the largest number of annotators. Each instance in the dataset is a tuple of source sentence, its machine translation, the corresponding post-edited translation by a professional translator (post-editor), a reference translation and other information such as (anonymized) post-editor identifier. We construct a QE dataset from this corpus by computing the HTER (Snover et al., 2006) values between each source sentence and its post-edited translation. We thereafter split the data into train, dev and test splits for each post-editor, which constitutes a QE task. A breakdown of the number of train, dev and test instances per QE task/post-editor is available in Table 1.

QE Model
The quality estimation model used by all methods is based on multi-lingual DistilBERT (Sanh et al., 2019), a smaller version of multi-lingual 1 http://www.qt21.eu/resources/data/ BERT (Devlin et al., 2019) trained with knowledge distillation (Buciluǎ et al., 2006;Hinton et al., 2015). It accepts as input the source and machine translation outputs concatenated as a single text, separated by a '[SEP]' token and prepended with a '[CLS]' token. The representation of the '[CLS]' token is then passed to a linear layer to predict HTER (Snover et al., 2006) values as regression targets.
Benchmark Approaches We compare the proposed approach with the following: MTL-PRETRAIN is a baseline trained in classic multitask fashion for multiple epochs using data from all QE tasks. It is thereafter fine-tuned using each QE task's training data before making predictions on its test set, in a similar fashion as the meta-learning approaches; REPTILE (Nichol and Schulman, 2018); Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017); implicit Model-Agnostic Meta-Learning (iMAML) (Rajeswaran et al., 2019); Amortized Bayesian Meta-Learning (ABML) (Ravi and Beatson, 2019); and BMAML (Kim et al., 2018), a state-of-the-art Bayesian meta-learning method.
Evaluation We report Pearson's r correlation scores and Mean Absolute Error (MAE) between model output and gold labels, both standard evaluation metrics in QE. Each experiment is repeated across five (5) different random seeds, and we report the average.

Limited Supervision Results
Results obtained in a setting where all approaches have access to only very limited training instances is presented in Figure 1. As expected, training with classic multi-task learning and then finetuning on the training data of each QE task (MTL-PRETRAIN) results in very poor performance on both datasets. This result is consistent with the results observed in Finn et al. (2017), since classic multi-task learning does not have any explicit objective that encourages the model to learn how to learn with limited supervision data. In contrast, all meta-learning approaches obtain consistent improvements over the MTL-PRETRAIN baseline. We find that in general, our approach (Matrix-BMAML) obtains marked performance improvements over the other Bayesian and non-Bayesian meta-learning approaches. This demonstrates the importance of incorporating pre-conditioning information through matrix-valued kernels for more ef- fective SVGD updates in Bayesian model-agnostic meta-learning.   Table 2 presents results obtained when the approaches are given access to all available training data for each QE task. We can observe that Matrix-BMAML obtained the best MAE on the encs dataset, and the best Pearson's correlation on both datasets, which again demonstrates the effectiveness of our approach in this setting.

Conclusions
We proposed a Bayesian meta-learning framework for adapting machine translation quality estimation models to the quality needs and preferences of each user with limited supervision data. We further extend a state-of-the-art Bayesian metalearning method with the use of matrix-valued kernels, which enables the incorporation of preconditioning information for more effective SVGD updates. Using data from two language directions, we demonstrate improved predictive performance in both limited and full-supervision settings over recent state-of-the-art Bayesian and non-Bayesian meta-learning methods.

Hyper-parameter Value
Learning rate 3e-5 Mini-batch size 16 Max. sequence length 100 Table 3: Hyper-parameter values for all compared approaches All compared approaches have a run time of about two hours on average. Each model was implemented as a linear layer on top of multilingual DistilBERT (Sanh et al., 2019), which has a total of 134M parameters. 2 For the evaluation metrics, Pearson r correlation and MAE, we use open-source implementations available in SciPy 3 and scikit-learn 4 libraries respectively.
All models make use of the same values for hyper-parameters such as learning rate and batch size, selected by manual search in initial experiments. These are provided in Table 3.