Learning Compact Metrics for MT

Recent developments in machine translation and multilingual text generation have led researchers to adopt trained metrics such as COMET or BLEURT, which treat evaluation as a regression problem and use representations from multilingual pre-trained models such as XLM-RoBERTa or mBERT. Yet studies on related tasks suggest that these models are most efficient when they are large, which is costly and impractical for evaluation. We investigate the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model, using data from the WMT Metrics Shared Task. We present a series of experiments which show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck, by leveraging synthetic data generation and transferring knowledge from one teacher to multiple students trained on related languages. Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT’s performance using only a third of its parameters.


Introduction
Recent improvements in Machine Translation (MT) and multilingual Natural Language Generation (NLG) have led researchers to question the use of n-gram overlap metrics such as BLEU and ROUGE (Papineni et al., 2002;Lin, 2004). Since these metrics focus solely on surface-level aspects of the generated text, they correlate poorly with human evaluation, especially when models are producing high-quality text (Belz and Reiter, 2006;Callison-Burch et al., 2006;Ma et al., 2019;Mathur et al., 2020a). This has led to a surge of interest in learned metrics that cast evaluation as a regression problem and leverage pre-trained multilin-gual models to capture the semantic similarity between references and generated text (Celikyilmaz et al., 2020). Popular examples of those metrics include COMET (Rei et al., 2020a) and BLEURT-EXTENDED (Sellam et al., 2020a), based on XLM-RoBERTa (Conneau and Lample, 2019;Conneau et al., 2020a) and mBERT (Devlin et al., 2019) respectively. These metrics deliver superior performance over those based on lexical overlap, outperforming even crowd-sourced annotations (Freitag et al., 2021;Mathur et al., 2020b).
Large pre-trained models benefit learned metrics in at least two ways. First, they allow for cross-task transfer: the contextual embeddings they produce allow researchers to address the relative scarcity of training data that exist for the task, especially with large models such as BERT or XLNet (Zhang* et al., 2020;Devlin et al., 2019;Yang et al., 2019). Second, they allow for cross-lingual transfer: MT evaluation is often multilingual, yet few, if any, popular datasets cover more than 20 languages. Evidence suggests that training on many languages improves performance on languages for which there is little training data, including the zero-shot setup, in which no fine-tuning data is available (Conneau and Lample, 2019;Sellam et al., 2020b;Conneau et al., 2018;Pires et al., 2019).
However, the accuracy gains only appear if the model is large enough. In the case of cross-lingual transfer, this phenomenon is known as the curse of multilinguality: to allow for positive transfer, the model must be scaled up with the number of languages (Conneau and Lample, 2019). Scaling up metric models is particularly problematic, since they must often run alongside an already large MT or NLG model and, therefore, must share hardware resources (see Shu et al. (2021) for a recent use case). This contention may lead to impractical delays, it increases the cost of running experiments, and it prevents researchers with limited resources from engaging in shared tasks.
We first present a series of experiments that validate that previous findings on cross-lingual transfer and the curse of multilinguality apply to the metrics domain, using RemBERT (Rebalanced mBERT), a multilingual extension of BERT (Chung et al., 2021). We then investigate how a combination of multilingual data generation and distillation can help us reap the benefits of multiple languages while keeping the models compact. Distillation has been shown to successfully transfer knowledge from large models to smaller ones (Hinton et al., 2015), but it requires access to a large corpus of unlabelled data (Sanh et al., 2019;Turc et al., 2019), which does not exist for our task. Inspired by Sellam et al. (2020a), we introduce a data generation method based on random perturbations that allows us to synthesize arbitrary amounts of multilingual training data. We generate an 80M-sentence distillation corpus in 13 languages from Wikipedia, and show that we can improve a vanilla pre-trained distillation setup (Turc et al., 2019) by up to 12%. A second, less explored benefit of distillation is that it lets us partially bypass the curse of multilinguality. Once the teacher (i.e., larger) model has been trained, we can generate training data for any language, including the zero-shot ones. Thus, we are less reliant on cross-lingual transfer. We can lift the restriction that one model must carry all the languages, and train smaller models, targeted towards specific language families. Doing so increases performance further by up to 4%. Combining these two methods, we match 92.6% of the the largest RemBERT model's performance using only a third of its parameters.
A selection of code and models is available online at https://github.com/ google-research/bleurt.

Multilinguality and Model Size
To motivate our work, we quantify the trade-off between multilinguality and model capacity using data from the WMT Shared Task 2020, the most recent benchmark for MT evaluation metrics. The phenomenon has been well-studied for tasks such as translation (Aharoni et al., 2019) and language inference (Conneau et al., 2020b), but it is less well understood in the context of evaluation metrics.
Task and Data In the WMT Metrics task, participants evaluate the quality of MT systems with automatic metrics for 18 language pairs-10 to-English, 8 from-English. The success cri-  Figure 1: Performance of our models. The dashed line represents the performance of BLEURT-extended (Sellam et al., 2020b). The metric is WMT Metrics DaRR (Mathur et al., 2020b), a robust variant of Kendall Tau, higher is better. We run each experiment with 5 random seeds, report the mean result and Normal-based 95% confidence intervals. terion is correlation with human ratings. 1 Following established approaches (Ma et al., 2018(Ma et al., , 2019, we utilize the human ratings from the previous years' shared tasks for training. Our training set contains 479k triplets (Reference translation, MT output, Rating) in 12 languages, and it is heavily skewed towards English. It covers the target languages of the benchmark except Polish, Tamil, Japanese and Inuktitut. 2 We evaluate the first three in a zero-shot fashion and do no report results on Inuktitut because its alphabet is not covered by RemBERT.
Models Like COMET (Rei et al., 2020a) and BLEURT (Sellam et al., 2020a), we treat evaluation as a regression problem where, given a reference translation x (typically produced by a human) and predicted translationx (produced by an MT system), the goal is to predict a real-valued human rating y. As is typical, we leverage pretrained representations (Peters et al., 2018) to achieve strong performance. Specifically, we first embed sentence pairs into a fixed-width vector v = F(x,x) using a pretrained model F and use this vector as input to a linear layer:  Figure 2: Impact of the number of fine-tuning languages on zero-shot performance, using RemBERT-6 and RemBERT-32 on en-ja, en-pl, and en-ta.
cause RemBERT is massive (32 layers, 579M parameters during fine-tuning) we pre-trained three smaller variants, RemBERT-3, RemBERT-6, and RemBERT-12, using Wikipedia data in 104 languages. The models are respectively 95%, 92%, and 71% smaller, with only 3, 6, and 12 layers. We refer to RemBERT as RemBERT-32 for consistency. The details of architecture, pre-training and fine-tuning are in the appendix. Figure 1 presents the performance of the models. RemBERT-32 is on par with BLEURT-EXTENDED, a metric based on a similar model which performed well at WMT Metrics 2020. 3 It also corroborates that for a fixed number of languages, larger models perform better.
Cross-lingual transfer during fine-tuning Figure 2 displays the performance of RemBERT-6 and RemBERT-32 on the zero-shot languages as we increase the number of languages used for finetuning. We start with English, then add the languages cumulatively, in decreasing order of frequency (without adding data for any of the target languages). Cross-lingual transfer works: in all cases, adding languages improves performance. The effect is milder on RemBERT-6, which consistently starts higher but finishes lower. The appendix presents additional details and results.
Capacity bottleneck in pre-training To further understand the effect of multilinguality, we pretrained the smaller models from scratch using 18 languages of WMT instead of 104, and fine-tuned on the whole dataset. Figure 3  Relative Improvement Agreement w Human (DaRR) Figure 3: Improvement after removing 86 languages from from pre-training. y-axis: relative performance improvement over a RemBERT of equal size pretrained on 104 languages. Additional details in the appendix.  RemBERT-3. This suggests that the models are at capacity and that the 100+ languages of the pretraining corpus compete with one another.
Takeaways Learned metrics are subject to conflicting requirements. On one hand, the opportunities offered by pre-training and cross-lingual transfer encourage researchers to use large, multilingual models. On the other hand, the limited hardware resources inherent to evaluation call for smaller models, which cannot easily keep up with massively multilingual pre-training. We address this conflict with distillation.

Addressing the Capacity Bottleneck
The main idea behind distillation is to train a small model (the student) on the output of larger one (the teacher) (Hinton et al., 2015). This technique is believed to yield better results than training the smaller model directly on the end task because the teacher can provide pseudo-labels for an arbitrary large collection of training examples. Additionally, Turc et al. (2019) have shown that pre-training the student on a language model task before distillation improves its accuracy (in the monolingual setting), Model *-en en-* en-cs en-de en-ja en-pl en-ru en-ta en-zh a technique known as pre-trained distillation.
Since pre-trained distillation was shown to be simple and efficient, we use it for our base setup. Figure 4 summarizes the steps: we fine-tune RemBERT-32 on human ratings, run it on an unlabelled distillation corpus, and use the predictions to supervise RemBERT-3, 6, or 12. By default, we use the WMT corpus for distillation, i.e., we use the same sentence pairs for teacher fine-tuning and student distillation (but with different labels). Improvement 1: data generation Distillation requires access to a large multilingual dataset of sentence pairs (reference, MT output) to be annotated by the teacher. Yet the WMT Metrics corpus is relatively small, and no larger corpus exists in the public domain. To address this challenge we generate pseudo-translations by perturbing sentences from Wikipedia. We experiment with three types of perturbations: back-translation, word substitutions with mBERT, and random deletions. The motivation is to generate surface-level noise and paraphrases, to expose the student to the different types of perturbations that an MT system could introduce. In total, we generate 80 million sentence pairs in 13 languages. The approach is similar to Sellam et al. (2020a), who use perturbations to generate pre-training data in English. We present the details of the approach in the appendix. Improvement 2: 1-to-N distillation Another benefit of distillation is that it allows us to lift the constraint that one model must carry all the languages. In a regular fine-tuning setup, it is necessary to pack as many languages as possible in the same model because training data is sparse or non-existent in most languages. In our distillation setup, we can generate vast amounts of data for any language of Wikipedia. It is thus possible to bypass the capacity constraint by training N specialized students, focused on a smaller number of languages. For our experiments, we pre-train five versions of each RemBERT, which cover between 3 and 18 languages each. We tried to form clusters of languages that are geographically close or linguistically related (e.g., Germanic or Romance languages), such that each cluster would cover at least one language of WMT. We list all the languages in the appendix.
Results Table 1 presents performance results on WMT Metrics 2020. For each student model, we present the performance of a naive fine-tuning baseline, followed by vanilla pre-trained distillation on WMT data. We then introduce our synthetic data and 1-to-N distillation. We compare to COMET, PRISM, and BLEURT-EXTENDED, three SOTA metrics from WMT Metrics '20 (Mathur et al., 2020b).
On en-*, the improvements are cumulative: Distill WMT+Wiki outperforms Distill WMT (be- tween 5 and 12% improvement), and it is itself outperformed by 1-to-N (up to 4%). Combining techniques improves the baselines in all cases, up to 10.5% improvement compared to fine-tuning. RemBERT-12 matches 92.6% of the teacher model's performance using only a third of its parameters, and it is competitive with current state-of-the-art models.
Runtime To validate the usefulness of our approach, we illustrate how to speed up RemBERT in Figure 5. We obtain a first 1.5-2X speedup compared to RemBERT-32 by applying length-based batching, a simple optimization which consists in batching examples that have similar a length and cropping the resulting tensor, as done in BERT-Score (Zhang* et al., 2020). Doing so allows us to remove the padding tokens, which cause wasteful computations. We obtain a further 1.5X speedup by using the distilled version of the model, RemBERT-12. The final model processes 4.8 tuples per second without GPU (86 with a GPU), an 2.5-3X improvement over RemBERT-32. Note that RemBERT-32 and COMET are both based on the Transformer architecture (we used a COMET checkpoint based on XLM-R Large), and RemBERT-32 is larger than COMET. We hypoth-esize that the performance gap comes from differences in implementation and model architecture; in particular, RemBERT-32 has an input sequence length of 128 while XLM-R operates on sequences with length 512.

Conclusion
We experimented with cross-lingual transfer in learned metrics, exposed the trade-off between multilinguality and model capacity, and addressed the problem with distillation on synthetic data. Further work includes generalizing the approach other tasks and experimenting with complementary compression methods such as pruning and quantization (Kim et al., 2021;Sanh et al., 2020), as well as increasing linguistic coverage (Joshi et al., 2020).

A.1 RemBERT Pre-Training
RemBERT is an encoder-only architecture, similar to BERT but with an optimized parameter allocation (Chung et al., 2021). It has reduced input embedding dimension and the saved parameters are reinvested in the form of wider and deeper Transformer layers, keeping the model size constant. In addition, the input and the output embeddings (the weights associated with the softmax layer) are decoupled during pre-training. Table 2 describes the architecture of the four RemBERT models, along with the number of parameters (note that we remove the output embedding layer during fine-tuning, which reduces the model size). We obtained the original RemBERT model from its authors, and we trained the smaller models for the purpose of this study with a modified version of the public BERT codebase. 4 By default, all models are pre-trained on 104 languages using a masked language modelling objective (Devlin et al., 2019). The setup for the smaller models is similar to Chung et al. (2021), except that RemBERT uses on mC4 (Xue et al., 2020) and Wikipedia while we use Wikipedia only. We train the custom RemBERT models for 2 17 steps using the Adam optimizer (Kingma and Ba, 2015), using learning rate 0.0002 (with 10,000 linear warm-up followed by inverse square root decay schedule) and batch size 512 on 16 TPU v3 chips. To reduce the size of the models further, we use a smaller SentencePiece model with 120K tokens instead of   250k. Large RemBERT was fine-tuned with sequence size 128, while the student models were fine-tuned with sequence size 512.

A.2 Fine-Tuning for the WMT Metrics Shared Task
We fine-tune RemBERT on the WMT Metrics shared task following the methology of Sellam et al. (2020b). We combine all the sentence pairs of WMT 2015 to 2019, and set aside 5% of the data for continuous evaluation. The data can be downloaded from the WMT Website. 5 The distribution of examples per language is shown in Figure. 6. We sample the sentences randomly, then re-adjust the sample such that there are not reference translations leaking between the datasets. We train the model with Adam for 5,000 steps and a batch size of 128 while evaluating it on the eval set every 250 steps. We keep the checkpoint that leads to the best performance. To determine the learning rate, we ran a parameter sweep on a previous year of the benchmark (using 2015 to 18 for train and 2019 for test) using the values [1e-6, 2e-6, 5e-6, 7e-6, 8e-6, 9e-6, 1e-5, 2e-5], and kept the learning rate that led the best results (1e-6). We also experimented with language rebalancing, batch sizes, dropout, and training duration during preliminary sets of experiments. The setup we used for RemBERT-3, 6 and 12 is similar, except that we used learning rate 1e-5 (obtained with a parameter sweep on a randomly held-out sample), 20,000 training steps, batch size 32, and we evaluate the model every 1,000 steps. We train each model with 4 TPU v2 chips, and evaluate with a Nvidia Tesla V100 GPU.

B Additional Ablation Experiments on WMT Metrics Shared task 2020
We present the detail of our ablation experiments, which expose the trade-off between model capacity and multi-linguality in learned metrics. In Figure 7, we iteratively expand the number of fine-tuning languages, starting with only English and adding languages in decreasing order of frequency. We add the languages by bucket, such that each bucket contains about the same number of examples ( Figure 6 shows the size of the training set for each language).
We start with the five languages for which we have training data. In all cases introducing finetuning data for a particular language pair improves the metric's performance on this language. The effect of subsequent additions (that is, cross-lingual transfer) is mixed. For instance, the effect is mild to negative on * -en, while it is mostly positive en-cs.
Adding data has a different effect on zero-shot languages: in almost all cases, it brings improvements. The effect appears milder on the smaller models, especially RemBERT-3 for which we observe slight performance drops (en-ta and en-ja), which is consistent the "curse of multilinguality" (Conneau and Lample, 2019). Figure 8 shows the limit of our smaller models: in 21 cases out of 24 (regardless of whether the    language is zero-shot or not), the performance of the model improves when we remove 86 languages from pre-training. This is further evidence that the models are saturated.

C.1 Distillation Data Generation Method
We generate synthetic (Reference Translation, MT outputs) pairs by perturbing sentences from Wikipedia. A similar method has been shown to be useful when generating pre-training data in a monolingual context (Sellam et al., 2020a). We apply it to  • Word substitution: we randomly mask up to 15 WordPiece tokens, and replace the masks by a multilingual model. We sample the number of tokens to be masked uniformly, and we run beam search with mBERT, using beam size 8. We used the official mBERT model. 6 • Back-translation: we translate the Wikipedia from the source to English, then back in the source language with translation models. We used the Tensor2Tensor framework, 7 using models trained on the corresponding WMT datasets.
• Word dropping: we duplicate 30% of the dataset and randomly drop words from the perturbations.
We generate between 1.8M and 7.3M sentence pairs for each language, for a total of 84M unlabelled examples. 6 https://github.com/google-research/ bert 7 https://github.com/tensorflow/ tensor2tensor C.2 Languages Used in 1-to-N Distillation Table 3 shows the five language clusters used for the 1-to-N distillation experiments. The groups were created by first joining languages based on their linguistic proximity (e.g., Romance or Germanic languages). Since that left multiple languages in their own cluster, we then combined them based on geographic distance (e.g., Tamil is part of the otherwise Indo-Iranian cluster and Japanese part of a cluster of Sino-Tibetan languages).

C.3 Setup and Hyper-parameters
The hyper-parameters we use for distillation are similar to those of fine-tuning, except that we train the models for 500,000 batches of 128 examples, and thus we learn from 64M sentences instead of 640K. Doing so takes about 1.5 days for RemBERT-3 and 6, and 3.5 days for RemBERT-12. We train the models to completion (i.e., no early-stopping).

D Additional Details of Metrics Performance
We report system-level and segment-level performance of the compact metrics on the MWT Metrics shared task 2020, extending the performance analysis of the distilled models. We re-implemented the WMT Metrics benchmark using data provided by the organizers. The results are consistent with the published version (Mathur et al., 2020b) except for segmentlevel to-English pairs, marked with a dagger † in the tables. We ran BLEURT, BLEURT-Tiny, BLEURT-English WMT'20, BLEURT-EXTENDED   ourselves. The first two are available online, 8 the latter two were submitted to the WMT Metrics shared task 2020 and were obtained from the authors. We also report results for three stateof-the-art metrics: COMET (Rei et al., 2020a), PRISM (Thompson and Post, 2020), and YISI-1 (Thompson and Post, 2020), using the WMT Metrics report. We only report results for from-English pairs because the benchmark implemen-8 https://github.com/google-research/ bleurt tations are consistent for these. We also add the baseline N Fine-tuning, which describes the performance of fine-tuning the N models presented in Section C.2 directly on WMT data.
As observed in the past (Mathur et al., 2020a,b) system-and segment-level correlations present very different outcomes: the teacher RemBERT-32 is outperformed by several other metrics on both en-* and * -en, and the impact of the distillation improvements is mixed on to-English. A possible explanation is that system-level involves small sam-ple sizes and that the data is very noisy (Freitag et al., 2021). Another explanation is that systemslevel quality assessment is simply another task, which requires its own set of optimizations. In spite of these divergences, Table 7 shows that our contributions bring solid improvements on en-* (up to 20.2%), which validates our approach.   Table 7: System-level agreement with human ratings on from-English language pairs, excluding outliers where they are available. The metric is Pearson correlation (Mathur et al., 2020b), higher is better. The dagger † indicates that the results were obtained from the WMT report.