UScore: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to cases where such supervision signals are not available. In this work, we develop fully unsupervised evaluation metrics. To do so, we leverage similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems. In particular, we use an unsupervised evaluation metric to mine pseudo-parallel data, which we use to remap deficient underlying vector spaces (in an iterative manner) and to induce an unsupervised MT system, which then provides pseudo-references as an additional component in the metric. Finally, we also induce unsupervised multilingual sentence embeddings from pseudo-parallel data. We show that our fully unsupervised metrics are effective, i.e., they beat supervised competitors on 4 out of our 5 evaluation datasets. We make our code publicly available.


Introduction
Evaluation metrics are essential for judging progress in natural language generation (NLG) tasks such as machine translation (MT) and summarization, as they identify the state-of-the-art in a key NLP technology. Despite their wide dissemination, classical lexical overlap evaluation metrics like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) have difficulties judging the quality of modern NLG systems (Mathur et al., 2020a;Marie et al., 2021), necessitating novel metrics that correlate better with humans. Lately, this has been a very active research area (Zhang et al., 2020;Zhao et al., 2019Zhao et al., , 2022Colombo et al., 2021;Yuan et al., 2021). 2 Recently, more and more supervised metrics are being proposed. E.g., BLEURT (Sellam et al., 1github.com/potamides/unsupervised-metrics 2Of course, the search for high quality metrics dates back at least to the invention of BLEU and its predecessors. ( Figure 1: Relationship between metrics m, vector spaces, parallel data, and MT systems: Metrics build on (potentially deficient) multilingual vector spaces (a), and can be used to mine (pseudo-)parallel sentences (b), which in turn can be used to improve deficient vector spaces (c). (Pseudo-)parallel data can also be used to train MT systems (d), which can generate pseudo-references (e). Conversely, metrics can also be optimization criteria for MT systems, which in turn can generate additional pseudo-parallel data through translation (f & g; not explored in this work). 2020) trains on human annotated datasets ranging from 5k-150k pairs, the COMET (Rei et al., 2020) models regress on 12k-370k data points and UniTE (Wan et al., 2022), before fine-tuning on the same data as COMET, pre-trains on 5m-10m parallel sentences. Of course, training on larger amounts of data leads to better metrics (measured on in-domain data), but also increases the risk of learning biases from the data (Poliak et al., 2018)and limits the applicability to domains and language pairs where supervision is available. Here, we go the opposite route and try to minimize the amount of supervision as much as possible.
We classify existing metrics making use of different types of supervision as follows (cf. Table 1). TYPE-1 metrics are trained on human assessments such as Direct-Assessment (DA) or Post-Editing (PE) scores, and compare system outputs to either human references (reference-based; Sellam et al., 2020;Rei et al., 2020) or directly to source texts (reference-free; Ranasinghe et al., 2021). TYPE-2 metrics, by comparison, do not use human assessments for training but still require  human references, i.e., are untrained and referencebased (Yuan et al., 2021;Zhao et al., 2019;Zhang et al., 2020). Finally, TYPE-3 metrics are untrained (unlike TYPE-1) and reference-free (unlike TYPE-2), i.e., do not use supervision as in TYPE-1 or 2. However, to work well, they still rely on parallel data (Zhao et al., 2020;Song et al., 2021), which is considered a form of supervision, e.g., in the MT community (Artetxe et al., 2018;Lample et al., 2018). In contrast, we aim for fully unsupervised evaluation metrics (for MT) that do not use any form of supervision (cf. Table 1). In addition, subject to the constraint that no supervision is allowed, our metrics should be of maximally high quality, i.e., correlation with human assessments. We have two use cases in mind: (a) Such sample efficiency3 is a prerequisite for the wide applicability of the metrics. This is especially important when we want to overcome the current English-centricity (Anastasopoulos and Neubig, 2020) of MT systems and evaluation metrics and also cover low-resource languages like Nepali or Sinhala (Fomicheva et al., 2021) and low-resource pairs like Yoruba-German.4 (b) Our fully unsupervised evaluation metrics should be considered strong lower bounds for any future work that uses (mild) forms of supervision for metric induction, i.e., we want to push the lower bounds for newly developed TYPE-k metrics.
To achieve our goals, we employ selflearning (He et al., 2020;Wei et al., 2021) and in particular, we leverage the following dualities to make our metrics maximally effective, cf. Figure 1: 3We use the term sample efficiency in a generalized sense to denote the amount of supervision required.
4Neither Yoruba (a language spoken in Nigeria) nor German are classical low-resource languages. For German, this is clear and Yoruba is even included in mBERT, i.e., belongs to the languages with 100+ largest Wikipedias. Nonetheless, from own experience, we find it inherently difficult to obtain highquality annotations for the language pair, as a result of few competent parallel speakers as well as technical difficulties (e.g., lack of adequate compute infrastructure in Nigeria).
(1) Evaluation metrics and NLG systems are closely related; e.g., a metric can be an optimization criterion for an NLG system (Böhm et al., 2019), and a system can conversely generate pseudo references (a.o.) from which to improve a metric. (2) Evaluation metrics and parallel corpus mining (Artetxe and Schwenk, 2019) are closely related; e.g., a metric can be used to mine parallel data, which in turn can be used to improve the metric (Zhao et al., 2020), e.g., by remapping deficient embedding spaces.
Our contributions are: (i) We show that effective unsupervised evaluation metrics can be obtained by exploiting relationships with parallel corpus mining approaches and MT system induction; (ii) to do so, we explore ways to (a) make parallel corpus mining efficient (e.g., overcome cubic runtime complexity) and (b) induce unsupervised multilingual sentence embeddings from pseudo-parallel data; (iii) we show that pseudo-parallel data can rectify deficient vector spaces such as mBERT; (iv) we show that our metrics beat three state-of-the-art supervised metrics on four of five datasets we evaluate on.

XMoverScore
Central to XMoverScore is the use of Word Mover's Distance (WMD) as a similarity between two sentences (Zhao et al., 2020). WMD and further enhancements are discussed below.
WMD WMD is a distance function that compares sentences at the token level (Kusner et al., 2015), by leveraging word embeddings which in XMoverScore's case come from mBERT (Devlin et al., 2019). From a source sentence and an MT hypothesis , WMD constructs a distance matrix C ∈ R | |, | | , where C is the distance between two word embeddings, C = ∥ E( ) − E( )∥; , index respective words in , . WMD uses this distance matrix to compute the similarity of the two sentences. This can be defined as the linear programming problem where F ∈ R | |, | | is an alignment matrix with F denoting how much of word travels to word . Additional constraints prevent it from becoming a zero matrix.
Vector space remapping Zhao et al. (2020), akin to similar earlier and subsequent work (Cao et al., 2020;Schuster et al., 2019), argue that the monolingual subspaces of mBERT are not well aligned. As a remedy, they investigate linear projection methods which post-hoc improve cross-lingual alignments.
We refer to this approach as vector space remapping. XMoverScore explores two different remapping approaches, CLP and UMD. They both leverage parallel data on sentence-level from which they extract word-level alignments using fast-align, which are then used for remapping. We give more details in the Appendix A.
Language Model XMoverScore linearly combines WMD with the perplexity of a GPT-2 language model (Radford et al., 2019). Allegedly this penalizes ungrammatical translations. This updates XMoverScores scoring function to Here, xlng , lm are weights for the cross-lingual WMD and LM components of XMoverScore.

DistilScore
Reimers and Gurevych (2020) show that the cosine between multilingual sentence embeddings captures semantic similarity and can be used to assess cross-lingual semantic textual similarity. Their approach to inducing embedding models is based on multilingual knowledge distillation. We refer to this metric as DistilScore. Their approach requires supervision at multiple levels. First, parallel sentences are needed to induce multilingual models, and second, NLI and STS corpora are required to induce teacher embeddings in the source language.

SentSim
A key difference between XMoverScore and Dis-tilScore is that one approach is based on wordand the other on sentence embeddings. Song et al. (2021) and Kaster et al. (2021) show that combining approaches based on word-level and sentence-level representations can substantially improve metrics. The metric of Song et al. (2021), which is called SentSim, combines supervised DistilScore with one of two word embedding-based metrics. The first one is quite similar to XMoverScore, as it is also based on WMD. The other one is a multilingual variant of BERTScore (Zhang et al., 2020).

Methods
In this section, we introduce our fully unsupervised metric UScore. UScore builds upon the existing metrics XMoverScore, DistilScore, and SentSim, but eliminates all supervision signals and instead leverages the dualities shown in Figure 1. In particular, we mine pseudo-parallel data from unsupervised metrics, which we use (iteratively) to (a) rectify deficient vector spaces (for XMoverScore) and to (b) train unsupervised MT systems which can generate pseudo-references (as pseudo-references are in the same language as the hypothesis, this eliminates problems of cross-lingual deficiency). Furthermore, we use pseudo-parallel data to (c) induce an unsupervised sentence embedding model analogous to DistilScore, which we can then (d) integrate with the unsupervised word based model analogous to SentSim. We now give details.

UScore wrd
XMoverScore uses sentence-parallel data to extract word pairs for vector space remapping. (i) We replace this parallel data with pseudo-parallel data.5 (ii) In addition, we use pseudo-references to address the issue of deficient vector spaces. We now give details on (i) and (ii) below.

Efficient WMD Pseudo-Parallel Data Mining
Metrics such as XMoverScore could in principle be used for pseudo-parallel corpus mining since 5To extract the word pairs from the sentence-parallel data, XMoverScore uses fast-align (Dyer et al., 2013), but since this depends directly on how well sentences are aligned, we first replace it with unsupervised awesome-align (Dou and Neubig, 2021) which only relies on pre-trained language models. they can compare arbitrary sentences. However, when WMD-based metrics are scaled to corpus mining, algorithmic efficiency problems arise: (a) the computational complexity of WMD scales cubically with sentence length (Kusner et al., 2015); (b) to compare source to target sentences, 2 WMD invocations are necessary, which quickly becomes intractable. Thus, we explore ways to improve the performance of WMD to mine efficiently. In particular, Kusner et al. (2015) define a linear approximation of WMD called word centroid distance (WCD) and a mining algorithm that first sorts all target samples according to their WCD to a given query and computes exact WMD for the nearest neighbors. We use this algorithm for efficient WMD-based pseudo-parallel data mining.
In our work, we apply this approach iteratively (cf. Figure 1): we start out with an initial WMD metric (based on mBERT), obtain sentence-level pseudo-parallel data with it via the efficient approximation algorithm described, and obtain a dictionary of word pairs from unsupervised awesome-align from the sentence pseudo-parallel data. We use the pseudo-parallel word pairs with UMD and CLP to remap mBERT. From this, we obtain a better WMD metric; then we iterate.
Pseudo References Apart from remapping, pseudo-parallel data could be used to overcome problems of deficient vector spaces in other ways. Specifically, we want to mine enough pseudoparallel data to train an unsupervised MT system to translate source sentences into the target language to create pseudo references (Albrecht and Hwa, 2007;Gao et al., 2020;Fomicheva et al., 2020b). This would allow for a comparison with the hypothesis in the target language, similar to reference-based metrics, circumventing alignment problems in multilingual embeddings. This approach updates UScore wrd to ( , , ′ ) = xlng WMD ( ) ( , ) + lm LM( ) where denotes the iterations of remapping, and pseudo is a new weight to control the influence of the pseudo reference ′ . All components of UScore wrd are illustrated in Figure 2.

UScore snt
Besides a word-based metric, we use pseudoparallel data to induce an unsupervised sentence level metric, UScore snt = cos( , ), based on the cosine similarity between sentence embeddings. One could, similarly to DistilScore, perform knowledge distillation but since our initial experiments showed that this doesn't work well with pseudo-parallel data, we chose another approach.

Contrastive Learning
We explore contrastive learning for unsupervised multilingual sentence embedding induction, which has recently been successfully used to train unsupervised monolingual sentence embeddings (Gao et al., 2021). In our context, the basic idea is to pull semantically close sentences together and to push distant sentences apart in the embedding space. Let and be the embeddings of two sentences that are semantically related and an arbitrary batch size. The training objective for this pair can be formulated as where is a temperature hyperparameter that can be used to either amplify or dampen the assessed distances. For each sentence , all remaining sentences ≠ in the current batch should be pushed apart in the embedding space. For positive sentences that should be pulled together, we again use pseudo-parallel sentence pairs. Since noisy data is beneficial for contrastive learning (Gao et al., 2021), we expect this paradigm to work well with pseudo-parallel data. We use pooled XLM-R embeddings as sentence representations, and, as with unsupervised remapping, we experiment with multiple iterations of successive mining and sentence embedding induction operations.

Ratio Margin Pseudo-Parallel Data Mining
As UScore snt is based on sentence embeddings, we cannot use the WMD-based mining algorithm to obtain pseudo-parallel sentences since it requires access to word-level representations. An alternative would be to just use cosine similarity for mining, but that approach is susceptible to noise in the data (Artetxe and Schwenk, 2019). Instead, we follow Artetxe and Schwenk (2019) and use a ratio margin function defined as where and are the nearest neighbors of sentence embeddings and in the respective language. Informally, this ratio margin function divides the cosine similarities of the nearest neighbor by the average similarities of the neighborhood.

UScore wrd ⊕ snt
Inspired by SentSim, which combines word and sentence embeddings, we similarly ensemble UScore wrd and UScore snt . We refer to this final metric as UScore = UScore wrd ⊕ snt with two new weights wrd and snt = 1 − wrd :

Experiments
In this section, we evaluate all UScore variants at the segment level6 and compare them to TYPE-1/2/3 upper bounds. We detail additional hyperparameters in Appendix D.

Datasets
We use various datasets to assess the performance of our metrics on MT evaluation, i.e., computing the correlation with human assessments using Pearson's r correlation, and parallel sentence matching, a standard evaluation measure in the corpus mining field where a set of shuffled parallel sentences is searched to recover correct translation pairs (Guo et al., 2018;Kvapilíková et al., 2020). For this we report Precision at N (P@N).

MT evaluation
In WMT-16 and WMT-17, each language pair consists of tuples of source sentences, hypotheses and references. Each tuple was annotated with a direct assessment (DA) score, which quantifies the adequacy of the hypothesis given the reference translation. Following Zhao et al. (2020) and Song et al. (2021), we use these DA scores to assess the adequacy of the hypothesis given the source. MLQE-PE has been used in the WMT 2020 Shared Task on Quality Estimation (Specia et al., 2020), and only provides source sentences and hypotheses for its language pairs, with no references. Each source sentence and hypothesis pair was annotated with cross-lingual direct assessment (CLDA) scores. In terms of annotation, Eval4NLP is very similar to MLQE-PE but focuses on non-Englishcentric language directions, especially de-zh and ru-de. WMT-MQM uses fine-grained error annotations from the Multidimensional Quality Metrics 6We do not evaluate at the system level since metrics there often perform very similarly, making it difficult to determine the best metric (Mathur et al., 2020b;Freitag et al., 2021b).
Parallel sentence matching To evaluate on parallel sentence matching, we use the News Com-mentary7 dataset. It consists of parallel sentences crawled from economic and political data.

Fine-grained analysis on de-en
To gain an understanding of the properties of the iterative techniques and the influence of individual parameters / components, we conduct a fine-grained analysis on the de-en language direction of WMT-16 (for MT evaluation) and News Commentary v15 (for parallel sentence matching). We list examples of pseudo-parallel data used during training in the appendix in Table 3. The mined sentences are often semantically similar, but contain factuality errors (e.g., have wrong places or numbers in hypotheses).

Vector space remapping
We explore if remapping works with pseudo-parallel data. We use News Crawl for mining. We randomly extract 40k monolingual sentences per language, and select the top 5% sentence pairs with the highest metric scores for remapping. This gives us the same number of sentences (2k pairs) as were used for remapping XMoverScore.
The results for UMD and CLP-based remapping on de-en can be seen in Figure 3 (top). The figure contains two graphs, one for correlation with human judgments and one for precision on parallel sentence matching. Each graph illustrates model performance before remapping (Iteration 0) and after remapping one to five times. After remapping once, both UMD and CLP improve substantially in Pearson's r correlation. The improvement of CLP, however, is noticeably larger. For subsequent iterations, UMD seems to continue to improve slightly, but the correlations of CLP seem to drop. This can be explained by the results for precision where the P@1 of CLP drops each iteration, meaning the remapping capabilities of the metrics decrease. UMD does not exhibit this problem. Thus, UMD 7data.statmt.org/news-commentary could be a more robust choice for metrics that should perform reasonably well on both tasks.

Pseudo References & Language Model
Next, we add a language model to the metric and investigate pseudo-parallel corpus mining to train an MT system for pseudo references. Tran et al. (2020) show that fine-tuning mBART using pseudo-parallel data leads to very promising results, so we use mBART for our own experiments as well. Since finetuning for MT is a very resource-intensive undertaking requiring many parallel sentence pairs (Barrault et al., 2020), especially compared to our vector space remapping experiments, we need considerably more training data. On average, Tran et al. (2020) use around 200k pseudo-parallel sentence pairs for training. To obtain the same amount with our extraction rate of 5%, we now use a pool of 4m sentences per language for mining. Our results on the de-en data of WMT-16 are reported in Figure 4, which is similar to an ablation study. On the x-axis, we vary the weight pseudo ∈ {0.0, 0.1, . . . , 0.9, 1.0} for UScore wrd with pseudo references, and on the y-axis, we explore different weights lm ∈ {0.0, 0.1, . . . , 0.9, 1.0} for the language model. We set xlng = 1− pseudo − lm . The best correlation uses pseudo = 0.4, lm = 0.1, and xlng = 0.5. The improvement when pseudo references and a language model are included is substantial (over only using WMD)-e.g., we improve from 28% correlation with humans ( pseudo = lm = 0) to 49% with the best weight combination, an improvement of 75%.
Contrastive Learning For UScore snt , we also use 4m monolingual sentences per language for mining but only retain the top 100k sentence pairs, as for the contrastive training objective we additionally have to filter out duplicate sentences. The results of UScore snt are shown in Figure 3 (bottom). The P@1 scores seem to steadily improve every two training iterations. Beginning with the sixth iteration, the precision seems to converge.

Other Languages: Results & Analysis
We now test our metrics on other languages and datasets. For UScore snt , we train its sentence embedding model for six iterations. For UScore wrd , we remap mBERT once with UMD and make use of a language model and pseudo references obtained from an MT system. Based on Section 4.2, we set pseudo = 0.4, lm = 0.1, and xlng = 0.5. Additionally, based on analogous, unreported experiments, we set wrd = 0.6 and snt = 0.4 for ensembling. Since determining the weights for UScore wrd and UScore wrd ⊕ snt this way constitutes a form of supervision, we also evaluate weights chosen independently from our conducted experiments. Namely, we also evaluate UScore + with wrd = snt = 0.5. For lm , we follow XMover-Score and set it to 0.1 (setting lm lower makes sense because the LM only addresses the hypoth-esis without considering the source); accordingly, we set xlng = pseudo = 0.45. Since lm = 0.1 coincides with our findings in Section 4.2, we also evaluate UScore ++ where each component uses entirely uniform weights, i.e., wrd = snt = 1 2 and lm = xlng = pseudo = 1 3 . Correlations with human judgments averaged over language pairs are shown in Table 2 (individual results are in the appendix). We also present the results of the popular TYPE-2 metric BLEU, where possible, and the recent TYPE-1 metrics MonoTransQuest (Ranasinghe et al., 2020b,a) and COMET-QE (Rei et al., 2021). Finally, as more direct competitors, we compare to the TYPE-3 metrics XMoverScore, SentSim, and DistilScore. We compute all reported scores ourselves.
Overall, the tuned weights of UScore perform marginally better than UScore + on most datasets, but UScore + is usually a very close second. UScore ++ performs worse, however, and only competitively on two of the five datasets. This indicates that the language model should be set to a lower value, a choice that makes intuitively sense.
Expectedly, DistilScore, which uses parallel data, is always better than UScore snt , which uses pseudo-parallel data. In contrast, UScore wrd is generally on par with XMoverScore, even though XMoverScore uses real parallel data-the difference is that UScore wrd also leverages pseudoreferences which XMoverScore does not. Indeed, from Figure 4, we observe that the pseudoreferences can make an improvement of up to 1-11 points in correlation (comparing 'column' labeled pseudo = 0 to the columns pseudo > 0). Our metrics beat reference-based TYPE-2 BLEU across the board. TYPE-1 metrics, which are finetuned on human scores, are generally the best. Intriguingly, the only two language pairs where our metrics are on par with them are the non-English de-zh and ru-de from Eval4NLP. These languages are outside the training scope of the current TYPE-1 metrics and thus test their generalization abilities. For example, on ru-de our best metric outperforms MonoTransQuest by 5 points correlation and COMET-QE by 9 points ( Table 7 in the appendix).
UScore and UScore + also outperform the TYPE-3 upper bounds on four of five datasets. On WMT-16, WMT-17 and Eval4NLP, they have the best overall results. On WMT-MQM, UScore wrd alone is best. The drop in performance for the combined metric is caused by UScore snt , which on its own performs very badly. As supervised DistilScore exhibits the same issues, this could be a general problem for sentence embeddings based metrics on this dataset. We identify further reasons in Appendix E.
For MLQE-PE, the SentSim metrics perform best on average among TYPE-3 and our metricsalthough our reproduced scores for this dataset differ noticeably from the authors' results, due to issues in their original code (Chen et al., 2022). Among our self-learned metrics, the combined variant performs best on average again, but still is 3-5 points below SentSim and DistilScore, even though it outperforms both XMoverScore variants by over 6 points. Interestingly, UScore snt works better than UScore wrd , unlike for the other datasets. Similarly, DistilScore clearly outperforms XMoverScore. This could be because MLQE-PE contains Sinhala, a language mBERT was not trained on. Another explanation is the data collection scheme for ru-en, which uses different sources of parallel sentences, mainly colloquial data and Russian proverbs, which use rather unconventional grammar (Fomicheva et al., 2022). This apparently confuses the language model and MT system which have been trained on data from other domains. When we exclude si-en and ru-en from MLQE-PE, UScore wrd ⊕ snt performs best, with an average Pearson's r of 44.22 for tuned weights and 44.45 for default weights vs. 43.82 for SentSim (BERTScore). In Appendix C, we show that incorporating real parallel data (in addition to pseudo-parallel data) at an order of magnitude lower than SentSim allows us to outperform SentSim on MLQE-PE also.

Discussion
Throughout, we have presented a mix of results and analysis, which we now summarize and discuss. In Figure 4, we conducted an ablation study on the individual components of UScore wrd (on de-en). This showed that all three components (pseudoreferences, language model, WMD) matter; by itself, the LM is more important than the pseudoreferences which are more important than WMD. However, in combination, the LM is least important. We also showed that pseudo-parallel data can successfully rectify deficient multilingual vector spaces, similar to real parallel data, see Figure 3. We note, however, that pseudo-parallel data may introduce an important bias in our data sampling: namely, it may mine factually incorrect parallel  sentences, see Table 3 in the appendix, which may amplify issues of adversarial robustness; see our discussion in Section 8. We remark that, depending on the annotation scheme, better correlations with human judgments do not necessarily entail better metrics (Freitag et al., 2021a). The datasets in this work were annotated using either DA, CLDA, or MQM scores, with MQM explicitly addressing this problem. Since our metrics are consistent regardless of the annotation scheme, they are unlikely to overfit a particular one.
We finally note that combining word and sentence-level models is meaningful, because they offer complementary views (Song et al., 2021). Kaster et al. (2021) also show that they capture orthogonal linguistic factors to varying degrees. Such complementarity may also stem from different underlying vector spaces, i.e., mBERT vs. XLM-R that we use in sentence-and word-level metrics.

Related Work
All metrics in this work presented so far treated the MT model generating the hypotheses as a blackbox. There also exists a recent line of work of so-called glass-box metrics, which actively incorporate the MT model under test into the scoring process (Fomicheva et al., 2020a,b (2020b) are all trained on parallel data, which makes their approach a supervised metric in our sense.
Other recent metrics that leverage the relationship between metrics and (MT) systems are Prism (Thompson and Post, 2020) and BARTScore (Yuan et al., 2021). We do not classify them as unsupervised, however, as Prism is trained from scratch on parallel data and BARTScore uses a BART model fine-tuned on labeled summarization or paraphrasing datasets.
There are also multilingual sentence embedding models which are highly relevant in our context. Kvapilíková et al. (2020), for example, fine-tune XLM-R on synthetic data translated with an unsupervised MT system. Similar to our contrastive learning approach, the resulting embedding model is completely unsupervised. Important differences are that our sentence embedding model can be improved iteratively and does not rely on an MT system. We leave a comparison to future work.
Finally, the idea of fully unsupervised text generation systems has originated in the MT community (Artetxe et al., 2018;Lample et al., 2018;Artetxe et al., 2019). Given the similarity of MT systems and evaluation metrics, designing fully unsupervised evaluation metrics is an apparent next step, which we take in this work.

Conclusion
In this work, we aimed for sample efficient evaluation metrics that do not use any form of supervision.
In addition, our novel metrics should be maximally effective, i.e., of high quality. To achieve this, we leveraged pseudo-parallel data obtained from fully unsupervised evaluation metrics in three ways: we (i) remapped deficient vector spaces using the pseudo-parallel data, (ii) trained an unsupervised MT system from it (yielding pseudo references), and (iii) induced unsupervised multilingual sentence embeddings. To enable our approach, we also explored efficient pseudo-parallel corpus mining algorithms based on our metrics as an orthogonal contribution. Finally, we showed that our approach is effective and can outperform three supervised upper bounds (making use of parallel data) on 4 out of 5 datasets we included in our comparison.
In future work, we want to aim for algorithmic efficiency, include pseudo source texts as additional components (using the MT system in backward translation), and address the missing dualities discussed in Figure 1 (i.e., use of metrics as optimization criteria and MT systems to generate additional pseudo-parallel data). Further, our approach has substantial room for improvement given that we selected hyperparameters completely unsupervised or based on one high-resource language pair (de-en). Thus, it will be particularly intriguing to explore weakly-supervised approaches which leverage minimal forms of supervision.
(1) Some of the components of UScore wrd (mainly the MT system) have high computational costs. For example, XMoverScore and SentSim (BERTScore) take less than 30 seconds to score 1000 hypotheses on an Nvidia V100 GPU. UScore wrd , on the other hand, takes over 2.5 minutes. This algorithmic inefficiency trades off with our sample efficiency, by which we did not use any supervision signals. In future work, we aim to experiment with efficient MT architectures to reduce computational costs (Kamal Eddine et al., 2022;Grünwald et al., 2022).
(2) Similarly to XMoverScore, MonoTrans-Quest or SentSim, our metrics use high-quality encoders such as mBERT, which are not only memory and inference inefficient but also leverage large monolingual resources. Future work should thus not only investigate using smaller mBERT models but also models that leverage smaller amounts of monolingual resources. Wang et al. (2020), for example, propose a competitive LSTM-based approach that completely forgoes monolingual resources and instead uses small parallel corpora (i.e., a few hundred parallel sentences as a weak supervision signal). Similarly, we give a recipe for improving mBERT for unseen languages using limited amounts of parallel data in Appendix C.
(3) Using unsupervised MT approaches, as we do via pseudo references, may be less effective for truly low-resource languages (Marchisio et al., 2020). However, this remains a very active research field with a constant influx of more powerful solutions (Ranathunga et al., 2022;. (4) As indicated in Sections 4.2 and 5, our mined pseudo-parallel data tends to contain factual inconsistencies such as "Uruguay was seventh" vs. (a translation of) "Russia was second". As a consequence, our induced metrics may be less robust than existing metrics (Chen and Eger, 2022;Rony et al., 2022). An approach to address this inconsistency would be to retain only high probability aligned words in parallel sentences (recall that we infer word-level parallel data from sentence-level parallel data).

Acknowledgments
We thank all reviewers for their valuable feedback, hard work, and time. The last author was supported by DFG grant EG 375/5-1. A Vector Space Remapping Zhao et al. (2020) explore two different remapping approaches for XMoverScore, which are defined as follows: Procrustes alignment Mikolov et al. (2013) propose to compute a linear transformation matrix W which can be used to map a vector of a source word into the target language subspace by computing W . The transformation can be computed by solving the problem

References
Here X, Y are matrices with embeddings of source and target words, respectively, where the tuples ( , ) ∈ X × Y come from parallel word pairs. XMoverScore constrains W to be an orthogonal matrix such that W ⊺ W = I, since this can lead to further improvements (Xing et al., 2015). Zhao et al. (2020) call this remapping Linear Cross-Lingual Projection remapping (CLP).

De-biasing
The second remapping method of XMoverScore is rooted in the removal of biases from word embeddings. Dev and Phillips (2019) explore a bias attenuation technique called Universal Language Mismatch-Direction (UMD). It involves a bias vector , which is supposed to capture the bias direction. For each word embedding , an updated word embedding ′ is computing by subtracting their projections onto , as in where · is the dot product. To obtain the bias vector , Dev and Phillips (2019) use a set E of word pairs that should be de-biased (e.,g. man and woman). The subtractions of the embeddings of the words in each pair are then stacked to form a matrix Q, and the bias vector is its top-left singular vector. Zhao et al. (2020) use the same approach for XMoverScore, but E instead consists of parallel word pairs. Zhao et al. (2020) show that these remapping methods lead to substantial improvements of their XMoverScore metric (on average, up to 10 points in correlation). The required parallel word pairs were extracted from sentences of the EuroParl corpus (Koehn, 2005) using the fast-align (Dyer et al., 2013) word alignment tool. The best results were obtained when remapping on 2k parallel sentences.

B Filtering
Since large corpora tend to include low-quality data points, we follow Artetxe and Schwenk (2019) and Keung et al. (2021) and apply three simple filtering techniques. We first remove all sentences from each monolingual corpus for which the fastText language identification tool (Joulin et al., 2017) predicts a different language. We then filter all sentences which are shorter than 3 tokens or longer than 30 tokens. As the last step, we discard sentence pairs sharing substantial lexical overlap, which prevents degenerate alignments of, e.g., proper names. We remove all sentence pairs for which the Levenshtein distance detects an overlap of over 50%.

C Fine-Tuning on Parallel Data
To examine whether and by how much we can further improve our metrics using forms of supervision, we experiment with a fine-tuning step on parallel sentences and treat self-learning on pseudo-parallel data as pre-training (He et al., 2020). We use the parallel data to fine-tune the contrastive sentence embeddings of UScore snt and the MT system of UScore wrd , which is responsible for generating pseudo references. Further, we also compute new remapping matrices for UScore wrd . Since CLP is superior to UMD when parallel data is used (see Section 4.2), we compute these remapping matrices using CLP instead of UMD. To assess how different amounts of parallel sentences affect performance, we fine-tune our metrics on 10k, 20k, 30k, and 200k parallel sentences. We use WikiMatrix (Schwenk et al., 2021) and the Nepali Translation Parallel Corpus (Duwal et al., 2019) to obtain parallel sentences.
Pearson's r correlations with human judgments for individual and averaged language pairs are shown in Figure 5; we focus on MLPQE-PE, where our metrics performed worst. Overall, introducing parallel data into the training process consistently improves performance for the majority of language directions; more parallel data leads to better results. The relatively biggest improvements are achieved for the si-en language direction, which is in accordance with our discussion above. When fine-tuning with 30k parallel sentences, the performance of our metrics is roughly on par with the SentSim variants (see Table 2). With 200k parallel sentences, our metrics clearly outperform SentSim, which uses millions of parallel sentences and NLI data as supervision signals.

D Hyperparameters
For efficient WMD pseudo-parallel mining, we set = 20 for remapping, and = 1 for training mBART. For ratio-margin-based pseudo-parallel mining, we use = 5. With regard to training UScore snt , we follow Gao et al. (2021), and iteratively train XLM-R for one training epoch with a learning rate of 5e-5, a batch size of 256, and a temperature coefficient of = 0.05 utilizing the AdamW optimizer (Loshchilov and Hutter, 2019). Fine-tuning of mBART was performed for three epochs with a batch size of four and using the same learning rate of 5e-5 as well as AdamW optimizer.
We decided to continue using mBERT in UScore wrd for two reasons. Firstly, we want UScore wrd to remain directly comparable to XMoverScore which is based on mBERT. Secondly, in our own experience, vanilla mBERT is very robust in terms of layer choice, especially compared to vanilla XLM-R. Intuitive choices like the first or last layer work very well for a lot of problems. This is an important property for unsupervised metrics, since we can't easily justify a (supervised) hyperparameter search in an unsupervised setting to determine the best layer or even a linear combination of those. Table 3 shows examples of pseudo-parallel data obtained with UScore wrd and UScore snt . Tables 4, 5, 6, and 7 show segment-level Pearson's r correlations with human judgments on WMT-16, WMT-17, MLQE-PE, as well as WMT-MQM and Eval4NLP, respectively. Table 8 provides additional statistics for each dataset.

E Supplementary Data and Results
A surprising finding is the poor performance of UScore snt and DistilScore on the German-English language pair in Table 7. It is well known that high-resource language directions, such as English-German, can be affected by a lack of lowquality translations (Fomicheva et al., 2022), and with only high-quality translations available, there is little variation in the scores, which makes a meaningful assessment of correlation difficult (Specia et al., 2020). Further, since both UScore snt and DistilScore are based on sentence embeddings of XLM-R, and sentence embedding-based metrics are known to be a bit worse on average than their word embedding-based counterparts, we believe that both aspects combined could be the root cause for this.

Source Target
Top-WRD Uruguay belegt mit vier Punkten nur Platz Sieben. Russia was second with four gold and 13 medals. Top-WRD Soweit lautet zumindest die Theorie.
That, at least, is the theory. Rnd-WRD Die USA stellen etwa 17.000 der insgesamt 47.000 ausländischen Soldaten in Afghanistan.
Currently, there are about 170,000 U.S. troops in Iraq and 26,000 in Afghanistan. Rnd-WRD "Das ist eine schwierige Situation", sagte Kaczynski. "It seemed like a ridiculous situation," Vanderjagt said.
Parliamentary elections are to be held by January.
Those attending the Soil Forensics International Conference work in the fields of science, policing, forensic services as well as private industries.
Rnd-SNT Frankfurt soll WM-Finale der Frauen ausrichten The women's tournament gets underway on Sunday. Table 3: Pseudo-parallel data obtained via UScore wrd and UScore snt ; top and random sentence pairs. The mined sentences are semantically similar, but contain factuality errors (e.g., have wrong places or numbers in hypotheses).
Metric de-en en-ru ru-en ro-en cs-en fi-en tr-en    Table 6: Segment-level Pearson's r correlations with human judgments on the MLQE-PE dataset.

WMT-MQM Eval4NLP
Metric en-de zh-en de-zh ru-de   Table 8: Statistics of our used datasets averaged over language pairs. For each dataset we report the number of sentences we evaluated on, the amount of tokens in these, and the type of human annotation. The number of tokens refers to source sentences.