Rethinking Round-Trip Translation for Machine Translation Evaluation

Automatic evaluation on low-resource language translation suffers from a deficiency of parallel corpora. Round-trip translation could be served as a clever and straightforward technique to alleviate the requirement of the parallel evaluation corpus. However, there was an observation of obscure correlations between the evaluation scores by forward and round-trip translations in the era of statistical machine translation (SMT). In this paper, we report the surprising finding that round-trip translation can be used for automatic evaluation without the references. Firstly, our revisit on the round-trip translation in SMT evaluation unveils that its long-standing misunderstanding is essentially caused by copying mechanism. After removing copying mechanism in SMT, round-trip translation scores can appropriately reflect the forward translation performance. Then, we demonstrate the rectification is overdue as round-trip translation could benefit multiple machine translation evaluation tasks. To be more specific, round-trip translation could be used i) to predict corresponding forward translation scores; ii) to improve the performance of the recently advanced quality estimation model; and iii) to identify adversarial competitors in shared tasks via cross-system verification.


Introduction
Thanks to the recent progress of neural machine translation (NMT) and large-scale multilingual corpora, machine translation (MT) systems have achieved remarkable performances on high-to medium-resource languages (Fan et al., 2021;Pan et al., 2021;Goyal et al., 2022a).However, the development of MT technology on low-resource language pairs still suffers from insufficient data for training and evaluation (Aji et al., 2022;Siddhant et al., 2022).Recent advances in multilingual pre-trained language model explore methods trained on monolingual data, using data augmentation and denoising auto-encoding (Xia et al., 2019; Figure 1: Given a corpus D A in Language A, we are able to acquire the round-trip translation (RTT) results D ′ A and forward translation (FT) results D B via machine translation.One question was raised and discussed by machine translation community about two decades ago, "Can RTT results be used to estimate FT performance?".While some early studies show the possibility (Rapp, 2009), however some researchers tend to be against round-trip translation due to the poor correlations between FT and RTT scores.Our work gives a clear and positive answer to the usefulness of RTT, based on extensive experiments and analysis.Liu et al., 2020).However, high-quality parallel corpora are still required for evaluating translation quality.Such requirement is especially resourceconsuming when working on i) hundreds of underrepresented low-resource languages (Bird and Chiang, 2012;Joshi et al., 2019;Aji et al., 2022) and ii) translations for specific domains (Li et al., 2020;Müller et al., 2020).
Standard MT evaluation requires parallel data which includes human translations as references, such that machine translations can be compared to the references with metrics such as BLEU or chrF.In contrast, round-trip translation (RTT), as illustrated in Figure 1, instead uses a translation system to back-translate the machine translation into the source language, after which this roundtripped text can be compared to the original source (using standard reference-based metrics).This approach is compelling, in that it removes the requirement for parallel evaluation corpora, however influential work showed little correlation between evaluation scores measured using RTT versus standard reference-based evaluation (Huang, 1990;Koehn, 2005;Somers, 2005;Zaanen and Zwarts, 2006), when applied to statistical machine translation (SMT) and rule-based machine translation (RMT).Consequently, the RTT method has seen little use, with a few recent notable exceptions in recent years, e.g., used to improve quality estimation methods (Moon et al., 2020;Crone et al., 2021;Agrawal et al., 2022).
In this work, we revisit the dispute on the usefulness of RTT evaluation in the modern era of neural machine translation (NMT).We argue that the main reason for the negative findings was a consequence of historical systems using reversible rules in translation, notably copying, whereby systems copy unrecognized source tokens into target languages, which is often penalized in FT evaluation, but rewarded by RTT evaluation.We conduct extensive experiments to demonstrate the effect of the copying mechanism on SMT.Later, we illustrate strong correlations between FT-SCOREs and RTT-SCOREs on various MT systems, including NMT and SMT without a copying mechanism.
The finding sets the basis of using RTT-SCORE for MT evaluation.Three application scenarios in MT evaluation have been investigated to show the effectiveness of RTT-SCORE.Firstly, RTT-SCOREs can be used to predict FT-SCOREs by training a simple but effective linear regression model on several hundred language pairs.The prediction performance is robust in evaluating multiple MT systems in transferred domains and unseen language pairs including low-resource languages.Then, RTT-SCOREs are proved effective in improving the performance of a recently advanced quality estimation model which further supports the feasibility of RTT-SCORE.Finally, a cross-system check (X-Check) mechanism is introduced to RTT evaluation for real-world MT shared tasks.By leveraging the estimation from multiple translation systems, X-Check manages to identify those adversarial competitors, who know the mechanism of RTT evaluation and thus utilize the copying strategy as a shortcut to outperform other honest participants.

Related Work
Reference-based Machine Translation Evaluation Metric.Designing high-quality automatic evaluation metrics for evaluating translation quality is one of the fundamental challenges in MT research.Most of the existing metrics largely rely on parallel corpora to provide aligned texts as references (Papineni et al., 2002;Lin, 2004).The performance of the translation is estimated by comparing the system outputs against ground-truth references.A classic school of reference-based evaluation is based on string match methods, which calculate the matched ratio of word sequences as strings, such as BLEU (Papineni et al., 2002;Post, 2018), ChrF (Popović, 2015) and TER (Snover et al., 2006).In addition, recent metrics utilize the semantic representations of texts to estimate their relevance, given pre-trained language models, such as BERTScore (Zhang et al., 2020) and BLEURT (Sellam et al., 2020).These methods are demonstrated to be more correlated to human evaluation (Kocmi et al., 2021) than string-based metrics.Some other reference-based evaluation metrics require supervised training to work well (Mathur et al., 2019;Rei et al., 2020) on contextual word embeddings.While these automatic evaluation metrics are widely applied in MT evaluation, they are generally not applicable to low-resource language translation or new translation domains (Mathur et al., 2020).Our work demonstrates that referencefree MT metrics (RTT-SCORE) could be used to estimate traditional reference-based metrics.
Reference-free Quality Estimation.In recent years, there has been a surge of interest in the task of directly predicting human judgment, namely quality estimation (QE), without access to parallel reference translations in the run-time (Specia et al., 2010(Specia et al., , 2013;;Bojar et al., 2014;Zhao et al., 2020).The recent focus on QE is mainly based on human evaluation approaches, direct assessment (DA) and post-editing, where researchers intend to train models on data via human judgment features to estimate MT quality.Among these recent QE metrics, learning-based models, YiSi-2 (Lo, 2019), COMET-QE-MQM (Rei et al., 2021), to name a few, demonstrate their effectiveness on WMT shared tasks.Our work shows that RTT-SCORE promotes a recently advanced QE model.3 Revisiting Round-trip Translation

Evaluation on Round-trip Translation
Given machine translation systems, T A→B and T B→A , between two languages (L A and L B ), and a monolingual corpus FT and BT constitute a round-trip translation (RTT).
On the other hand, traditional MT evaluation on parallel corpus is The main research question is whether FT-SCOREs are correlated to therefore could be predicted by RTT-SCOREs.

RTT Evaluation on Statistical Machine Translation
The previous analysis on the automatic evaluation scores from RTT and FT shows that they are negatively correlated.Such a long-established understanding started from the era of RMT (Huang, 1990) and lasted through SMT (Koehn, 2005;Somers, 2005) and prevented the usage of RTT to MT evaluation.We argue that the negative observations are probably due to the selected SMT models involving some reversible transformation rules, e.g., copying unrecognized tokens in translation.As an example illustrated in Figure 2, the MT System 1 works worse than its competing System 2, as System 1 fails to translate 'reclassified' and 'Biotech'.Instead, it decides to copy the words in source language (En) directly to the target outputs.
During BT, System 1 manages to perfectly translate them back without any difficulty.For System 2, although translating 'Biotechnologie' (De) to 'Biotechnology' (En) is adequate, it is not appreciated by the original reference in this case.Consequently, the rankings of these two MT systems are flipped according to their FT and RTT scores.
Previous error analysis study on SMT (Vilar et al., 2006) also mentioned that the unknown word copy strategy is one of the major causes resulting in the translation errors.We therefore argue that the reversible transformation like word copy could have introduced significant bias to the previous experiments on SMT (and RMT).Then, we conduct experiments to replicate the negative conclusion.Interestingly, removing the copying mechanism can almost perfectly resolve the negation in our experiments.

Experiments and Analysis
We compare RTT and FT on SMT following the protocol by Somers (2005); Koehn (2005).
Moses (Koehn and Hoang, 2009) is utilized to train phrase-based MT systems (Koehn et al., 2003), which were popular in the SMT era. 2 We train SMT systems on News-Commentary v8 (Tiedemann, 2012), as suggested by WMT organiz- ers (Koehn and Monz, 2006).We test our systems on six language pairs (de-en, en-de, cs-en, en-cs, fr-en and en-fr) in the competition track of WMT Shared Tasks (Barrault et al., 2020).RTT-SCOREs and FT-SCOREs are calculated based on BLEU in this section.Then, we use Kendall's τ and Pearson's r to verify the correlation of RTT-SCOREs and FT-SCOREs (Kendall, 1938;Benesty et al., 2009).We provide more detailed settings in Appendix C.
During translation inference, we consider two settings for comparison, one drops the unknown words and the other one copies these tokens to the outputs.Hence, we end up having two groups of six outputs from various SMT systems.
In Table 1, we examine the relevance between RTT-SCOREs and FT-SCOREs on six SMT systems.The performance is measured by Kendall's τ and Pearson's r.The correlation is essentially decided by the copying mechanism.Specifically, their correlation turns to be much stronger for those systems not allowed copying, compared to the systems with default word copy.Now, we discuss the rationality of using RTT evaluation for NMT systems, by comparing the reliance of copying mechanism in NMT and SMT.For NMT, we choose MBART50-M2M (Tang et al., 2020), which covers 50 languages of cross-lingual translation.Exactly matched words in outputs from the input words are considered copying, although the system may not intrinsically intend to copy them.In Table 2, we observe that copying frequency is about two times in SMT than in NMT.Although NMT systems may copy some words during translation, most of them are unavoidable, e.g., we observe that most of these copies are proper nouns whose translations are actually the same words in the target language.In contrast, the copied words in SMT are more diverse and many of them could be common nouns.

Lang. Pair
Avg. Copy (%) SMT NMT de-en 17.39 9.28 en-de 21.47 9.54 Table 2: Comparison of word copy frequency between SMT and NMT on two language pairs.We calculate the average percentage of copied (Avg.Copy) tokens per sentence.The details of selected Moses system are reported in Appendix C.

Predicting FT-SCORE using RTT-SCORE
In this section, we validate whether FT-SCOREs could be predicted by RTT-SCOREs.Then, we examine the robustness of the predictor on unseen language pairs and transferred MT models.

Regression on RTT-SCORE
Here, we construct a linear regressor f to predict FT-SCOREs of a target translation metric M by corresponding RTT-SCOREs, (3) M * indicates that multiple metrics are used to construct the input features.We utilize RTT-SCORE from both sides of a language pair as our primary setting, as using more features usually provides better prediction performance (Xia et al., 2020).We use a linear regressor for predicting FT-SCORE, A⟳B and S M * B⟳A are RTT-SCORE features used as inputs of the regressor.3W 1 , W 2 and β are the parameters of the prediction model optimized by supervised training. 4In addition, when organizing a new shared task, say WMT, collecting a parallel corpus in low-resource language could be challenging and resource-intensive.Hence, we investigate another setting that utilizes merely the monolingual corpora in language A or B to predict FT-SCORE, (5) We will compare and discuss this setting in our experiments on WMT.contains 7 High-resource (H.), 7 Mediumresource (M.) and 6 Low-resource (L.) languages, while involves 9 Medium-resource (M.) and 4 Lowresource (L.) languages.These two sets are used to construct three types of language pairs for the test.Type I and Type III target on translation among and language pairs, respectively.Type II targets on translation between and .The test setting with more is usually more challenging, i.e., Type I < Type II < Type III.

Datasets
We conduct experiments on the large-scale multilingual benchmark, FLORES-101, and WMT machine translation shared tasks.FLORES-AE33 is for training and testing on languages and transferred MT systems.WMT is for testing real-world shared tasks in new domains.

FLORES-AE33.
We extract FLORES-AE33, which contains parallel data among 33 languages, covering 1,056 (33 × 32) language pairs, from a curated subset of FLORES-101 (Goyal et al., 2022a).We select these languages based on two criteria: i) We rank languages given the scale of their bi-text corpora; ii) We prioritize the languages covered by WMT2020-News and WMT2020-Bio.As a result, FLORES-AE33 includes 7 high-resource languages, 16 medium-resource languages and 10 low-resource languages.We show the construction pipeline in Figure 3, with more details in Appendix A.
WMT.We collect corpora from the translation track to evaluate multiple MT systems on the same test sets.We consider their ranking based on FT-SCORE with metric M as the ground truth.We choose the competition tracks in WMT 2020 Translation Shared Tasks (Barrault et al., 2020), namely news track WMT2020-News and biomedical track WMT2020-Bio.We consider news and bio as new domains, compared to our training data FLORES-101 whose contents are mostly from Wikipedia.

Neural Machine Translation Systems
We experiment with five MT systems that support most of the languages appearing in FLORES-AE33 and WMT.Except for MBART50-M2M, we adopt M2M-100-BASE and M2M-100-LARGE (Fan et al., 2021), which are proposed to conduct many-to-many MT without explicit pivot languages, supporting 100 languages.GOOGLE-TRANS (Wu et al., 2016;Bapna et al., 2022) 5 is a commercial translation API, which was considered as a baseline translation system in many previous competitions (Barrault et al., 2020).Meanwhile, we also include a family of bilingual MT models, OPUS-MT (Tiedemann and Thottingal, 2020), sharing the same model architecture MARIAN-NMT (Junczys-Dowmunt et al., 2018).We provide more details about these MT systems in Appendix C.

Automatic MT Evaluation Metrics
We consider BLEU (Papineni et al., 2002), spBLEU (Goyal et al., 2022b), chrF (Popović, 2015) and BERTScore (Zhang et al., 2020) as the primary automatic evaluation metrics (Freitag et al., 2020).All these metrics will be used and tested for both input features and target FT-SCORE.The first two metrics are differentiated by their tokenizers, where BLEU uses Moses (Koehn and Hoang, 2010) and spBLEU uses SentencePiece (Kudo and Richardson, 2018).Both evaluation metrics were officially used in WMT21 Large-Scale Multilingual Machine Translation Shared Task (Wenzek et al., 2021).While BLEU works for most language tokenizations, spBLEU shows superior effectiveness on various language tokenizations, especially the performance on low-resource languages (Goyal et al., 2022a).More details of these metrics are described in Appendix B

Experiments and Analysis
Following our discussion in the last section on SMT, we conduct similar experiments using our new multilingual NMT systems on Type I test set of FLORES-AE33.We observe a highly positive correlation between FT-SCOREs and RTT-SCOREs, measured by Pearson's r (Benesty et al., 2009).Please refer to Appendix G.1 for more details.Then, we train regressors on RTT-SCOREs and conduct experiments to examine their performance in various challenging settings.

Transferability of Regressors
We first investigate the transferability of our regressors from two different aspects, transferred MT systems and unseen language pairs.We also evaluate the regressor on different scales of language resources.
Settings.We train our regressors on Type I train set based on the translation scores from MBART50-M2M.In order to assess system transferability, we test three models on Type I test set.
In terms of language transferability, we consider FT-SCOREs of MBART50-M2M (a seen MT system in training) and M2M-100-BASE (an unseen MT system in training) on Type II and Type III in FLORES-AE33.We further evaluate the transferability of our regressor on language resources in Type I test set, with two MT systems, MBART50-M2M and M2M-100-BASE.
Discussion.In Table 3, we present the performance of the regressor across various translation systems and evaluation metrics.We first analyze the results on MBART50-M2M, which is seen in training.The absolute errors between predicted scores and ground-truth FT-SCOREs are relatively small with regard to MAE and RMSE.Meanwhile, the correlation between prediction and ground truth is strong, with all Pearson's r above or equal to 0.88.This indicates that the rankings of predicted scores are rational.The results of M2M-100-BASE and GOOGLE-TRANS demonstrate the performance of predictors on unseen systems.Although the overall errors are higher than those of MBART50-M2M without system transfer, Pearson's r scores are at the competitive level, indicating a similar ranking capability on unseen systems.Meanwhile, our model obtains adequate language transferability results, as demonstrated in Table 4.
In Table 5, we provide the detailed performance of our regressor on language pairs of different resource categories on FLORES-AE33, with RTT-SCOREs of MBART50-M2M and M2M-100-BASE respectively.Specifically, we split the three categories based on Table 9, which are high, medium and low.The evaluated regressor is the same as the one tested in Sections 4.3.1 and 4.3.2.The results of the two tables show that our regressor is able to predict FT-SCOREs with small errors and reflect the relative orders among FT-SCOREs, with high transferability across language pairs and MT systems.

Predicting FT-SCOREs on WMT
With the basis of the high transferability of the regressors, we conduct experiments on WMT shared tasks, namely WMT2020-News, which includes 10 language pairs.In this experiment, we study spBLEU metric scores.
Settings.We have involved five MT systems 6We are aware of the cases that collecting corpora in target languages for competitions might be significantly complex, which means only a monolingual corpus is available for evaluation.Thus, we train predictors f ′ using single RTT-SCOREs in Equation 5. Note that this experiment covers several challenging settings, such as transferred MT systems, language transferability, single source features, and transferred application domains.Another set of results on WMT2020-Bio can be found in Appendix G.4. Discussion.In Table 6, we display the results on WMT2020-News.Although MAE and RMSE vary among experiments for different language pairs, the overall correlation scores are favorable.
Pearson's r values on all language pairs are above 0.5, showing strong ranking correlations.While prediction performances on A ⟳ B have some variances among different language pairs, the results of the experiments using B ⟳ A are competitive to those using both A ⟳ B and B ⟳ A features, showing the feasibility of predicting FT-SCORE using monolingual data.We conclude that our regression-based predictors can be practical in ranking MT systems in WMT-style shared tasks.

RTT-SCOREs for Quality Estimation
In this section, we demonstrate that the features acquired by round-trip translation benefit quality estimation (QE) models.
Dataset.QE was first introduced in WMT11 (Callison-Burch et al., 2011), focusing on automatic methods for estimating the quality of neural machine translation output at run-time.The estimated quality should align with the human judgment on the word and sentence level, without accessing the reference in the target language.In this experiment, we perform sentence-level QE, which aims to predict human direction assessment (DA) scores.We use DA dataset collected from 2015 to 2021 by WMT News Translation shared task coordinators.More details are provided in Appendix D.
Settings.Firstly, we extract RTT features RTT-BLEU, RTT-spBLEU, RTT-chrF and RTT-BERTScore.Then, we examine whether QE scores could be predicted by these RTT features using linear regression models.We train the regressors using Equation 5 with only A ⟳ B features.Finally, a combination of COMET-QE-DA scores and RTT-SCOREs are investigated to acquire a more competitive QE scorer.
Discussion.Both Kendall's τ and Pearson's r provide consistent results in Table 7.The models merely using RTT-SCOREs could be used to predict DA scores.We also observe that RTT-SCOREs can further boost the performance of COMET-QE-DA.We believe thatRTT-SCORE advances QE research and urges more investigation in this direction.
Lang.Pair portion of preserved words inside the original context via RTT, while its FT-SCOREs remain low.
In order to mitigate the vulnerability, we first validate RTT evaluation on WMT2020-News with A ⟳ B direction.One of the advantages of RTT is that multiple MT systems could be used to verify the performance of other systems via checking the N × N combinational RTT results from these N systems, coined X-Check.Finally, we demonstrate that the predicted automatic evaluation scores could be further improved via X-Check when adversaries are included.

Cross-system Validation for Competitions
, and a regression model M on predicting the target metric, we can estimate the translation quality of i-th FT system on j-th BT system: where S = {S i,j } N ×M .The estimated translation quality of F i is the average score of the i-th column, Note that the same number of FT and BT systems are considered for simplicity, i.e., N = M .

Experiments and Analysis
Settings.We conduct experiments on WMT2020-News similar to Section 4.3.2.We rank the system-level translation quality via the regressor trained on RTT-SCORE spBLEU .We challenge the evaluation paradigm by introducing some adversarial MT systems, e.g., SMT with copying mechanism.Specifically, we introduce basic competition scenarios with 3-5 competitors to the shared task, and we consider different numbers of adversarial systems, namely i) no adversary; ii) one adversarial SMT with word copy; iii) two adversarial SMT systems with word copy.We provide details of two SMT systems in Appendix G.5.The experiments with adversarial systems are conducted on four language pairs, cs-en, de-en, en-cs and en-de, as the corresponding adversarial systems were trained in Section 3.3.
Discussion.From Table 8, we observe that the overall system ranking could be severely affected by the adversarial systems, according to Pearson's r and Kendall's τ .The adversarial systems are stealthy among normal competitors, according to Hit@K and Avg.Rank.X-Check evidently successfully identifies these adversarial systems in all our experiments and manages to improve the correlation scores significantly.With the empirical study, we find that X-Check is able to make RTT evaluation more robust.

Conclusion
This paper revisits the problem of estimating FT quality using RTT scores.The negative results from previous literature are essentially caused by the heavy reliance on the copying mechanism in # Sys.

Method
No Adversary One adversarial SMT Two adversarial SMTs K. τ ↑ P. r ↑ Hit@1 ↑ Avg.Rank ↓ K. τ ↑ P. r ↑ Hit@2 ↑ Avg.Rank ↓ K. τ ↑ P. Table 8: Results of the competition from 3 to 5 honest competitors, with a combination of additional adversarial competing systems (No Adversary, One adversarial SMT (X = 0.1) w/ copy, Two adversarial SMTs (X = 0.1 and X = 0.5) w/ copy).We measure the identifiability of the adversarial MT systems by Hit@K, where K is decided by the number of adversarial systems.We also report the average ranking (Avg.Rank.) of the adversarial systems, and correlation scores, Kendall's τ and Pearson's r.
traditional statistical machine translation systems.Then, we conduct comprehensive experiments to show the corrected understanding of RTT benefits several relevant MT evaluation tasks, such as predicting FT metrics using RTT scores, enhancing state-of-the-art QE systems, and filtering out unreliable MT competitors in WMT shared tasks.We believe our work will inspire future research on reference-free evaluation for low-resource machine translation.

Limitations
There are several limitations of this work.First, while we have observed positive correlations between FT-SCOREs and RTT-SCOREs and conducted experiments to predict FT-SCOREs using RTT-SCOREs, their relations could be complicated and non-linear.We encourage future research to investigate various RTT-SCORE features and more complex machine learning models for better prediction models.Second, we have examined the prediction models on low-resource languages in FLORES-101, but have not tested those very lowresource languages out of these 101 languages.We suggest auditing FT-SCORE prediction models on a small validation dataset for any new low-resource languages in future applications.Third, our assessment has been systematic and thorough, utilizing datasets such as FLORES-101, WMT2020-News, and WMT2020-Bio.Despite this, the nature of our study is constrained by the timeline of the data utilized.The WMT data we used is from 2020, opening up the possibility that more recently proposed metrics could potentially outperform the ones proposed in this work.
A  We provide the statistics of all languages covered by FLORES-AE33, categorized by different scales of the resource (high, medium and low) and usage purpose ( and ) in Table 9. Scale is counted by the amount of bi-text data to English in FLORES-101 (Goyal et al., 2022a).
To construct FLORES-AE33, we partition these 33 languages into two sets, i) the languages that are utilized in training our models (7 ) and ii) the others are employed used for training the predictors but considered for test purpose only ( ).We include 20 languages to , with 7 high-resource, 7 medium-resource and 6 low-resource.The rest 13 languages fall into , with 9 medium-resource and 4 low-resource.Combining these two categories of languages, we obtain three types of language pairs in FLORES-AE33.
Type I contains pairs of languages in , where a train set and a test set are collected and utilized independently.For each language pair, we collect 997 training samples and 1,012 test samples.The test set of Type II is more challenging than that of Type I set, where the language pairs in this set are composed of one language from set and the other language from set.Type III's test set is the most challenging one, as all its language pairs are derived from languages.Type II and Type III sets are designed for test purposes, and they will not be used for training predictors.Overall, Type I, Type II and Type III sets contain 380, 520 and 156 language pairs, respectively.

C Machine Translation Systems
Moses SMT.We train five Moses' (Koehn and Hoang, 2009) statistical machine translation systems using different phrase dictionaries by varying phrase probability threshold from 0.00005 to 0.5.The higher threshold indicates the smaller phrase table and hence a better chance of processing unknown words by the corresponding MT systems.In Table 2, we use Moses with the phrase probability threshold of 0.4 for SMT.
MBART50-M2M.MBART50-M2M (Tang et al., 2020) is a multilingual translation model with many-to-many encoders and decoders.The model is trained on 50 publicly available language corpora with English as a pivot language.
M2M-100-BASE & M2M-100-LARGE.These two models are one of the first non-Englishcentric multilingual machine translation systems, which are trained on 100 languages covering highresource to low-resource languages.Different from MBART50-M2M, M2M-100-BASE and M2M-100-LARGE (Fan et al., 2021) are trained on parallel multilingual corpora without an explicit centering language.
OPUS-MT.OPUS-MT (Tiedemann and Thottingal, 2020) is a collection of one-to-one machine translation models which are trained on corresponding parallel data from OPUS using MARIAN-NMT as backbone (Junczys-Dowmunt et al., 2018).The collection of MT models supports 186 languages.GOOGLE-TRANS.GOOGLE-TRANS (Wu et al., 2016;Bapna et al., 2022) is an online Translation service provided by Google Translation API, which supports 133 languages.The system is frequently involved as a baseline system by WMT shared tasks (Barrault et al., 2020).

D Quality Estimation Dataset
The direct-assessment (DA) train set contains 33 diverse language pairs and a total of 574,186 tuples with source, hypothesis, reference and direct assessment z-score.We construct the test set by collecting DA scores on zh-en (82,692 segments) and en-de (65,045 segments), as two unseen language pairs.

E Implementation Details
Regressor.We use the linear regression model tool by Scikit-Learn11 with the default setting for the API.
Computational Resource and Time.In our experiment, we collect the translation results and compute their FT-SCORE and RTT-SCORE on multiple single-GPU servers with Nvidia A40.Overall, it cost us about three GPU months for collecting translation results by all the aforementioned MT systems.

F Measurement
We evaluate the performance of our predictive model via the following measurements: Mean Absolute Error (MAE) is used for measuring the average magnitude of the errors in a set of predictions, indicating the accuracy for continuous variables.
Root Mean Square Error (RMSE) measures the average magnitude of the error.Compared to MAE, RMSE gives relatively higher weights to larger errors.
Pearson's r correlation (Benesty et al., 2009) is officially used in WMT to evaluate the agreement between the automatic evaluation metrics and human judgment, emphasizing translation consistency.In our paper, the metric evaluates the agreement between the predicted automatic evaluation scores and the ground truth.
Kendall's τ correlation (Kendall, 1938) is another metric to evaluate the ordinal association between two measured quantities.our experiment is beyond English-centric, as all languages are permuted and equally considered.

G Supplementary Experiments
Discussion.The overall correlation scores are reported in Table 10.Our results indicate at least moderately positive correlations between all pairs of RTT-SCOREs and FT-SCOREs.Moreover, we observe that RTT-SCORE B⟳A is generally more correlated to FT-SCORE than RTT-SCORE A⟳B , leading to strong positive correlation scores.We attribute the advantage to the fact that T A→B serves as the last translation step in RTT-SCORE B⟳A .We visualize more detailed results of correlation between FT-SCOREs and RTT-SCOREs on Type I language pairs in FLORES-101, in  Table 11: The results of using auxiliary features to spBLEU for training predictors.We test the performance of MBART50-M2M and M2M-100-BASE cross language pairs in Type I, Type II and Type III of FLORES-AE33.

G.2 Improve Prediction Performance Using More Features
Settings.We introduce two extra features, MAX-4 COUNT and REF LENGTH, 13 to enhance the prediction of spBLEU.MAX-4 COUNT is the count of the correct 4 grams and REF LENGTH is the cumulative reference length.We follow a similar procedure in RQ2, using the same measurements to evaluate the predictor performance on MBART50-M2M and M2M-100-BASE across three types of test sets in FLORES-AE33.
Results.Table 11 shows the results of those models with additional features.Both features consistently improve our basic models, and the performance can be further boosted by incorporating both features.We believe that more carefully designed features and regression models could potentially boost the performance of our predictors.

G.3 WMT2020-News with Synthetic Competitors
We increase the scale of competitors to WMT2020-News by introducing pseudocompetitors.
To mimic the number of a conventional WMT task, we vary 17 forward translation systems by randomly dropping 0% to 80% (with a step of 5%) tokens from the outputs of GOOGLE-TRANS.Then, we utilize the vanilla GOOGLE-TRANS to translate these synthetic forward translation results back to the source language.We conduct experiments on de-fr, en-ta and zh-en, representing those non-En to non-En, En to non-En and non-En to En language pairs.The results in Table 12 demonstrate the predictors' performances on ranking the pseudo competitors on WMT2020-News based on spBLEU fea-13 MAX-4 COUNT and REF LENGTH are "counts" and "ref_len" in https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/ bleu.py.
tures.The overall ranking errors on 17 MT systems are small on all three selected language pairs.

G.4 Ranking Experiments on WMT2020-Bio
We display the experimental results on WMT2020-Bio in Table 13.The overall performance is positive, while it is relatively worse than the results of WMT2020-News reported in Table 6.We attribute this to the fact that the M used on WMT2020-Bio are calculated on documents, while our regression models rely on sentence-level translation metrics in training.The large granularity difference of text may result in a distribution shift.
SMT (X = 0.1).We train the SMT system on News-Commentary v8 with the max phrase length of 4 and the phrase table probability threshold of 0.1.

Figure 2 :
Figure 2: The comparison of the forward translation (FT) and round-trip translation (RTT) performance of two translation systems, System 1 and System 2 are based on Statistical Machine Translation (SMT) and Neural Machine Translation (NMT), respectively.The conflict conclusions by FT Scores (System 1 < System 2) and RTT Scores (System 1 > System 2) are attributed to the translation of the underlined words, 'reclassified' and 'Biotech'.

Figure 3 :
Figure 3: The 33 languages in FLORES-AE33 are separated into two categories, Seen ( ) includes the languages used in both training and testing, and Unseen ( ) is composed of the languages only used in testing.contains7 High-resource (H.), 7 Mediumresource (M.) and 6 Low-resource (L.) languages, while involves 9 Medium-resource (M.) and 4 Lowresource (L.) languages.These two sets are used to construct three types of language pairs for the test.Type I and Type III target on translation among and language pairs, respectively.Type II targets on translation between and .The test setting with more is usually more challenging, i.e., Type I < Type II < Type III.

G. 1 Figure 4 :
Figure 4: The first row is the correlations between RTT-SCORE M A⟳B and FT-SCORE M A→B on MBART50-M2M using (a) BLEU, (b) spBLEU, (c) chrF and (d) BERTScore.The second row is the correlations between RTT-SCORE M B⟳A and FT-SCORE M A→B on MBART50-M2M using (e) BLEU, (f) spBLEU, (g) chrF and (h) BERTScore.All experiments with overall Pearson's r.

Figure 5 :
Figure 5: The first row is the correlations between RTT-SCORE M A⟳B and FT-SCORE M A→B on M2M-100-BASE using (a) BLEU, (b) spBLEU;, (c) chrF and (d) BERTScore.The second row is the correlations between RTT-SCORE M B⟳A and FT-SCORE M A→B on M2M-100-BASE using (e) BLEU, (f) spBLEU, (g) chrF and (h) BERTScore.All experiments with overall Pearson's r.

Table 3 :
The results of predicted FT-SCOREs of MBART50-M2M, M2M-100-BASE and GOOGLE-TRANS on Type I test set based on different translation evaluation metrics (Trans.Metric).Note that MAE: Mean Absolute Error, RMSE: Root Mean Square Error, P. r: Pearson's r.

Table 4 :
RMSE ↓ P. r ↑ MAE ↓ RMSE ↓ P. r ↑ The results of predicted FT-SCOREs of MBART50-M2M (a seen MT system) and M2M-100-BASE (an unseen MT system) on Type II and Type III (with unseen languages) test sets based on different translation evaluation metrics (Trans.Metric).

Table 5 :
The results of predicted FT-SCOREs of MBART50-M2M and M2M-100-BASE on nine sets of language pairs, categorized by different scales of the resources, High (H. ), Medium (M.) and Low (L.).The three categories in rows are source languages, and the ones in columns are target languages.We report Mean Average Error (MAE), Root Mean Square Error (RMSE) and Pearson's r.

Table 6 :
The results of our predictors on ranking the selected MT systems on WMT2020-News shared tasks.

Table 7 :
Comparisons of RTT-SCORE for QE.RTT-ALL refers to the combination of RTT-BLEU, RTT- spBLEU and RTT-chrF.COMET-QE-DA + RTT-ALL incorporates both COMET-QE-DA and all RTT-SCOREs.

Table 9 :
The statistics of FLORES-AE33.20 languages are used in both training and test ( ), the other 13 languages are used in test only ( ).
is used as the backbone pre-trained language model, as it is reported to have a satisfactory correlation with human evaluation in WMT16.While BLEU, spBLEU and chrF are string-based metrics, BERTScore is model-based.The selection of these metrics is on the basis that they should directly reflect the translation quality.We calculate those scores via open-source toolboxes, EASYNMT 8 , SACREBLEU-TOOLKIT 9 and BERTSCORE 10 .We use word-level 4-gram for BLEU and spBLEU, character-level 6-gram for chrF, and F 1 score for BERTScore by default.

Table 10 :
Pearson's r between FT-SCORE M A→B and RTT-SCORE M (both A ⟳ B and B ⟳ A) using different automatic evaluation metrics M on Type I test set of FLORES-AE33.

Table 12 :
Langauge Pair MAE ↓ RMSE ↓ K. τ ↑ P. r ↑ Results of prediction and ranking on translation quality of WMT2020-News synthetic data for three language pairs.