An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers

Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that multilingual QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.


Introduction
Quality Estimation (QE) is the task of assessing the quality of a translation without having access to a reference translation (Specia et al., 2009). Translation quality can be estimated at different levels of granularity: word, sentence and document level (Ive et al., 2018). So far the most popular task has been sentence-level QE , in which QE models provide a score for each pair of source and target sentences. A more challenging task, which is currently receiving a lot of attention from the research community, is word-level quality estimation. This task provides more fine-grained information about the quality of a translation, indicating which words from the source have been incorrectly translated in the target, and whether the words inserted between these words are correct (good vs bad gaps). This information can be useful for post-editors by indicating the parts of a sentence on which they have to focus more.
Word-level QE is generally framed as a supervised ML problem (Kepler et al., 2019;Lee, 2020) trained on data in which the correctness of translation is labelled at word-level (i.e. good, bad, gap). The training data publicly available to build wordlevel QE models is limited to very few language pairs, which makes it difficult to build QE models for many languages. From an application perspective, even for the languages with resources, it is difficult to maintain separate QE models for each language since the state-of-the-art neural QE models are large in size (Ranasinghe et al., 2020b).
In our paper, we address this problem by developing multilingual word-level QE models which perform competitively in different domains, MT types and language pairs. In addition, for the first time, we propose word-level QE as a zero-shot crosslingual transfer task, enabling new avenues of research in which multilingual models can be trained once and then serve a multitude of languages and domains. The main contributions of this paper are the following: i We introduce a simple architecture to perform word-level quality estimation that predicts the quality of the words in the source sentence, target sentence and the gaps in the target sentence.
ii We explore multilingual, word-level quality estimation with the proposed architecture. We show that multilingual models are competitive with bilingual models.
iii We inspect few-shot and zero-shot word-level quality estimation with the bilingual and multilingual models. We report how the sourcetarget direction, domain and MT type affect the predictions for a new language pair.
iv We release the code and the pre-trained models as part of an open-source framework 1 .
435 Figure 1: Model Architecture 2 Related Work Quality Estimation Early approaches in wordlevel QE were based on features fed into a traditional machine learning algorithm. Systems like QuEst++ (Specia et al., 2015) and MARMOT (Logacheva et al., 2016) were based on features used with Conditional Random Fields to perform wordlevel QE. With deep learning models becoming popular, the next generation of word-level QE algorithms were based on bilingual word embeddings fed into deep neural networks. Such approaches can be found in OpenKiwi (Kepler et al., 2019). However, the current state of the art in word-level QE is based on transformers like BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) where a simple linear layer is added on top of the transformer model to obtain the predictions (Lee, 2020). All of these approaches consider quality estimation as a language-specific task and build a different model for each language pair. This approach has many drawbacks in real-world applications, some of which are discussed in Section 1.
Multilinguality Multilinguality allows training a single model to perform a task from and/or to multiple languages. Even though this has been applied to many tasks Zampieri, 2020, 2021) including NMT (Nguyen and Chiang, 2017;Aharoni et al., 2019), multilingual approaches have been rarely used in QE . Shah and Specia (2016) explore QE models for more than one language where they use multitask learning with annotators or languages as multiple tasks. They show that multilingual models led to marginal improvements over bilingual ones with a traditional black-box, feature-based approach. In a recent study, Ranasinghe et al. (2020b) show that multilingual QE models based on transformers trained on high-resource languages can be used for zeroshot, sentence-level QE in low-resource languages. In a similar architecture, but with multi-task learning,  report that multilingual QE models outperform bilingual models, particularly in less balanced quality label distributions and lowresource settings. However, these two papers are focused on sentence-level QE and to the best of our knowledge, no prior work has been done on multilingual, word-level QE models.

Architecture
Our architecture relies on the XLM-R transformer model (Conneau et al., 2020) to derive the representations of the input sentences. XLM-R has been trained on a large-scale multilingual dataset in 104 languages, totalling 2.5TB, extracted from the CommonCrawl datasets. It is trained using only RoBERTa's (Liu et al., 2019) masked language modelling (MLM) objective. XML-R was used by the winning systems in the recent WMT 2020 shared task on sentence-level QE (Ranasinghe et al., 2020a;Lee, 2020;. This motivated us to use a similar approach for wordlevel QE. Our architecture adds a new token to the XLM-R tokeniser called <GAP> which is inserted between the words in the target. We then concatenate the source and the target with a [SEP] token and we feed them into XLM-R. A simple linear layer is added on top of word and <GAP> embeddings to predict whether it is "Good" or "Bad" as shown in Figure 1. The training configurations and the system specifications are presented in the supplementary material. We used several language pairs for which word-level QE annotations were available: English-Chinese (En-Zh), English-Czech (En-Cs), English-German (En-De), English-Russian (En-Ru), English-Latvian (En-Lv) and German-English (De-En). The texts are from a variety of domains and the translations were produced using both neural and statistical machine translation systems. More details about these datasets can be found in Table 1 and in Fonseca et al., 2019;.

Evaluation Criteria
For evaluation, we used the approach proposed in the WMT shared tasks in which the classification performance is calculated using the multiplication of F1-scores for the 'OK' and 'BAD' classes against the true labels independently: words in the target ('OK' for correct words, 'BAD' for incorrect words), gaps in the target ('OK' for genuine gaps, 'BAD' for gaps indicating missing words) and source words ('BAD' for words that lead to errors in the target, 'OK' for other words) . In recent WMT shared tasks, the most popular category was predicting quality for words in the target. Therefore, in Section 5 we only report the F1-score for words in the target. Other results are presented in the supplementary material. Prior to WMT 2019, organisers provided separate scores for gaps and words in the target, while after WMT 2019 they produce a single result for target gaps and words. We follow this latter approach.

Results
The values displayed diagonally across section I of Table 2 show the results for supervised, bilingual, word-level QE models where the model was trained on the training set of a particular language pair and tested on the test set of the same language pair. As can be seen in section V, the architecture outperforms the baselines in all the language pairs and also outperforms the majority of the best systems from previous competitions. In addition to the target word F1-score, our architecture outperforms the baselines and best systems in target gaps F1-score and source words F1-score too as shown in Tables 5 and 6. In the following sections we explore its behaviour in different multilingual settings.

Multilingual QE
We combined instances from all the language pairs and built a single word-level QE model. Our results, displayed in section II ("All") of Table 2, show that multilingual models perform on par with bilingual models or even better for some language pairs. We also investigate whether combining language pairs that share either the same domain or MT type can be more beneficial, since it is possible that the learning process is better when language pairs share certain characteristics. However as shown in sections III and IV of Table 2, for the majority of the language pairs, specialised multilingual models built on certain domains or MT types do not perform better than multilingual models which contain all the data.  Table 2: Target F1-Multi between the algorithm predictions and human annotations. Best results for each language by any method are marked in bold. Sections I, II and III indicate the different evaluation settings. Section IV shows the results of the state-of-the-art methods and the best system submitted for the language pair in that competition. NR implies that a particular result was not reported by the organisers. Zero-shot results are coloured in grey and the value shows the difference between the best result in that section for that language pair and itself.

Zero-shot QE
To test whether a QE model trained on a particular language pair can be generalised to other language pairs, different domains and MT types, we performed zero-shot quality estimation. We used the QE model trained on a particular language pair and evaluated it on the test sets of the other language pairs. Non-diagonal values of section I in Table  2 show how each QE model performed on other language pairs. For better visualisation, the nondiagonal values of section I of Table 2 show by how much the score changes when the zero-shot QE model is used instead of the bilingual QE model. As can be seen, the scores decrease, but this decrease is negligible and is to be expected. For most pairs, the QE model that did not see any training instances of that particular language pair outperforms the baselines that were trained extensively on that particular language pair. Further analysing the results, we can see that zero-shot QE performs better when the language pair shares some properties such as domain, MT type or language direction. For example, En-De SMT ⇒ En-Cs SMT is better than En-De NMT ⇒ En-Cs SMT and En-De SMT ⇒ En-De NMT is better than En-Cs SMT ⇒ En-De NMT.
We also experimented with zero-shot QE with multilingual QE models. We trained the QE model in all the pairs except one and performed predic-tion on the test set of the language pair left out. In section II ("All-1"), we show its difference to the multilingual QE model. This also provides competitive results for the majority of the languages, proving it is possible to train a single multilingual QE model and extend it to a multitude of languages and domains. This approach provides better results than performing transfer learning from a bilingual model.
One limitation of the zero-shot QE is its inability to perform when the language direction changes. In the scenario where we performed zero-shot learning from De-En to other language pairs, results degraded considerably from the bilingual result. Similarly, the performance is rather poor when we test on De-En for the multilingual zero-shot experiment as the direction of all the other pairs used for training is different. This is in line with results reported by Ranasinghe et al. (2020b) for sentence level.

Few-shot QE
We also evaluated how the QE models behave with a limited number of training instances. For each language pair, we initiated the weights of the bilingual model with those of the relevant All-1 QE and trained it on 100, 200, 300 and up to 1000 training instances. We compared the results with those obtained having trained the QE model from scratch for that language pair. The results in Figure 2 show that All-1 or the multilingual model performs well above the QE model trained from scratch (Bilingual) when there is a limited number of training instances available. Even for the De-En language pair, for which we had comparatively poor zeroshot results, the multilingual model provided better results with a few training instances. It seems that having the model weights already fine-tuned in the multilingual model provides an additional boost to the training process which is advantageous in a few-shot scenario.

Conclusions
In this paper, we explored multilingual, word-level QE with transformers. We introduced a new architecture based on transformers to perform wordlevel QE. The implementation of the architecture, which is based on Hugging Face (Wolf et al., 2020), has been integrated into the TransQuest framework (Ranasinghe et al., 2020b) which won the WMT 2020 QE task  on sentencelevel direct assessment (Ranasinghe et al., 2020a) 2 .
In our experiments, we observed that multilingual QE models deliver excellent results on the language pairs they were trained on. In addition, the multilingual QE models perform well in the majority of the zero-shot scenarios where the multilingual QE model is tested on an unseen language pair. Furthermore, multilingual models perform very well with few-shot learning on an unseen language pair when compared to training from scratch for that language pair, proving that multilingual QE models are effective even with a limited number of training instances. While we centered our analysis around the F1-score of the target words, these findings are consistent with the F1-score of the target gaps and the F1-score of the source words too. This suggests that we can train a single multilingual QE model on as many languages as possible and apply it on other language pairs as well. These findings can be beneficial to perform QE in low-resource languages for which the training data is scarce and when maintaining several QE models for different language pairs is arduous.