Automatic Text Evaluation through the Lens of Wasserstein Barycenters

A new metric BaryScore to evaluate text generation based on deep contextualized embeddings (e.g., BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, i.e., Wasserstein distance and barycenter. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; this framework provides a natural way to aggregate the different outputs through the Wasserstein space topology. In addition, it provides theoretical grounds to our metric and offers an alternative to available solutions (e.g., MoverScore and BertScore). Numerical evaluation is performed on four different tasks: machine translation, summarization, data2text generation and image captioning. Our results show that BaryScore outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.


Introduction
Automatic Evaluation (AE) of Natural Language Generation (NLG) is a key problem towards better systems (Specia et al., 2010). It allows to assess the quality of generated text without relying on human evaluation campaigns that are expensive and time consuming (Belz and Reiter, 2006;Sai et al., 2020). For instance, it becomes crucial to design automatic and effective metrics with simultaneous goals: (i) to be able to compare, to control and to debug systems without relying on human annotators (Peyrard, 2019a,b); and (ii) to improve the learning phase of models by deriving losses that are better surrogate of human judgment than the widely used cross-entropy loss (Clark et al., 2019).
A plethora of automatic metrics has been introduced these last few years and may be grouped into two general classes: trained (Ma et al., 2017;Shimanaka et al., 2018;Lowe et al., 2016;Lita et al., 2005) and untrained metrics (Doddington, 2002;Popović, 2015). In this paper, we mainly focus on untrained metrics that can be further split into three subgroups: string matching (Papineni et al., 2002;Lin, 2004;Banerjee and Lavie, 2005;Doddington, 2002;Popović, 2015), edit based (Leusch et al., 2006;Snover et al., 2006;Wang et al., 2016) and embedding based metrics (Chow et al., 2019;Kusner et al., 2015;Lo and Wu, 2011;Lo, 2019). Both string matching and edit based metrics fail to assign reliable scores when reference and candidate convey the same meaning with distinct surface forms (Reiter and Belz, 2009) (e.g., case of synonyms and paraphrases). These shortcomings have been addressed by metrics based on continuous representations. Recently, they have benefited from contextual embeddings such as ELMO (Peters et al., 2018) and BERT (Devlin et al., 2018;. Perhaps, the most known are BERTScore (Zhang et al., 2019), MoverScore (Zhao et al., 2019b) or Sentence Mover (Clark et al., 2019) which optimise Word Mover Distance (WMD) (Kusner et al., 2015) 1 , a particular instance of optimal transport (OT) problem. Originally introduced by Kusner et al. (2015) the WMD is used to compute the Wasserstein distance between text documents relying on a single layer embedding such as GloVe (Pennington et al., 2014) or Wor2Vect (Mikolov et al., 2013). To apply WMD with multi layers embedding several recipes have been proposed. BertScore selects the best layer based on a validation set. However, the selection of the validation set is arbitrary while on the other hand, a single layer selection does not exploit the information available in other layers (Voita et al., 2019;Hewitt and Liang, 2019;Liu et al., 2019a). MoverScore and Sentence Mover attempt to leverage the information available in other layers by aggregating the layers using a power mean (Rücklé et al., 2018b). In addition to adding extra hyper-parameters, this aggregation method relies on euclidean topology which induces a geometrical discrepancy as the final cost is computed using a Wasserstein distance. Our contributions. We introduce BaryScore a novel metric, which addresses aforementioned aggregation pitfalls by relying on Wasserstein barycenters and evaluate its performance on four different tasks: neural machine translation, text summarization, image captioning and data2text generation. Our main contributions can be summarized as follow: 1. A novel metric to measure the semantic equivalence between two texts. This metric relies on the embedding geometry of the layers induced in Wasserstein spaces. In order to overcome the geometric distortion generated by the aggregation techniques used to compute WMD with deep embeddings (e.g., BERT, ELMo and Roberta), we aggregate layers information using the Wasserstein barycenter. This new formulation offers a topological advantage, i.e., using barycenters giving meaning to the use of OT based distance afterwards and is parameter-free, i.e., it avoids choosing by hand the best layer (as for BertScore) or selecting the exponent in the power means (as for MoverScore). Our formulation provides an alternative and a generalization to the WMD formulation (Kusner et al., 2015) (originally introduced for Word2Vec) when applied to embeddings which are coming from multi-layer neural networks and thus, it provides theoretical motivation to BaryScore 2 a new metric that aggregates deep contextualized embedding using Wasserstein barycenters.
2. Applications and numerical results. We demonstrate that BaryScore provides better results than a large variety of state-of-art untrained metrics on four text generation tasks: namely NMT, summarization, image captioning, data2text suggesting that Wasserstein barycenters offer a promising direction moving forward.

Related Work
The goal of NLG is to generate coherent, readable and informative text from some input data (e.g., texts, images and tables). However, the exact definition of each of these three criteria remains task-dependent and thus, making it hard to provide a unique metric for all tasks. As an example, NMT focuses on fluency, fidelity and adaquatie (Hovy, 2 Bary stands for Barycenter. 1999;White et al., 1994) in contrast to summarization where annotators have to focus on coherence, content, readability, grammatically, coherence and conciseness (Mani, 2001). In the following, we describe for each of the four considered tasks (i.e., NMT, text summarization, image captioning and data2text generation) the most used metrics. Metrics for NMT. Most of the metrics commonly used in NMT rely on comparing surface form (e.g., word, subword, n-gram overlap and edit based distances (Levenshtein, 1966)) between text and candidates. Perhaps the most popular metrics are the ones used for WMT shared tasks (Mathur et al., 2020;Ma et al., 2019Ma et al., , 2018Bojar et al., 2017b) which include SENTBLEU, BLEU (Papineni et al., 2002), CHARACTER (Wang et al., 2016), COMET (Rei et al., 2020), YISI (Lo et al., 2018), MEE (Mukherjee et al., 2020), EED (Stanchev et al., 2019), CHRF (Popović, 2015(Popović, , 2017, ESIM (Chen et al., 2016), PRISM (Thompson and Post, 2020) to only mention a few among others. A new family of metrics based on pretrained transformers (i.e., BertScore, MoverScore) has recently emerged with very good performance in NMT, incorporating deeper semantic information through contextualized representations. Metrics for summarization. Designing better summarization metrics is an active area of research (Scialom et al., 2021) and many of these metrics can be further optimized to produce better summaries (Böhm et al., 2019). Popular metrics include machine translation metrics (i.e., CHRF, BLEU, METEOR (Banerjee and Lavie, 2005;Guo and Hu, 2019;Denkowski and Lavie, 2014), BertScore, MoverScore or SentenceMover (Clark et al., 2019)), ROUGE (Lin, 2004;Ganesan, 2018) or data statistics (e.g., density and compression ratio) (Grusky et al., 2018). Metrics for data2text. Data2text generation aims at generating text from structured data (Kim and Mooney, 2010;Chen and Mooney, 2008;Wiseman et al., 2017). In the present work, we focus on the WebNLG 2020 challenge (Perez-Beltrachini et al., 2016;Gardent et al., 2017) which ranks the system using five automatic metrics: BLEU, METEOR, BERTScore, TER and CHRF++. Metrics for image captioning. Task specific metrics for image captioning include CIDEr (Vedantam et al., 2015) that rely on n-grams, LEIC (Cui et al., 2018) using scene graph similarity and pretrained metrics such as SPICE (Anderson et al., 2016). In recent work by (Zhang et al., 2019;Zhao et al., 2019b), these metrics are compared with NMT specific metrics (e.g., BLEU and METEOR).

Background on Optimal Transport
The Wasserstein distance (i.e., Earth Mover Distance) which arises from the idea of optimal transport provides a way to measure dissimilarities between two probability distributions. Due to its appealing geometric properties, it has found many applications in machine learning such as generative models (Arjovsky et al., 2017;Tolstikhin et al., 2018;Gulrajani et al., 2017), domain adaptation (Courty et al., 2017), clustering (Ho et al., 2017Ye et al., 2017), adversarial examples (Wong et al., 2019), robustness (Staerman et al., 2021) or NLP (Kusner et al., 2015;Zhao et al., 2019b;Singh et al., 2020). First designed as an optimal transport optimization problem, it relies on minimizing a transport cost between points drawn from all possible coupling measures. Its ability to take into account the underlying geometry of the space as well as capture information from distributions with non-overlapping supports makes it a powerful alternative to several dissimilarity measures such as the family of f -Divergences. Wasserstein distance. Let M 1 + (R d ) denote the space of all probability distributions defined on R d with d ∈ N * . The Wasserstein distance between two arbitrary measures µ ∈ M 1 + (X ) and ν ∈ M 1 + (Y) is defined through the resolution of the Monge-Kantorovitch mass transportation problem (Villani, 2003;Peyré and Cuturi, 2019): where U(µ, ν) = {π ∈ M 1 + (X × Y) : π(x, y)dy = µ(x); π(x, y)dx = ν(y)} is the set of joint probability distributions with marginals µ and ν. In the remainder of this paper, we focus on the Wasserstein distance associated with the quadratic cost, i.e., p = 2. Thus, the Wasserstein distance aims to find the best possible way to transfer the probability mass from µ to ν while minimizing the transportation cost defined by the euclidian distance. Wasserstein barycenters. Because optimal transport is based on mass displacement, it also defines an interesting way to interpolate between several input measures. The Wasserstein barycenter, first introduced and studied in (Agueh and Carlier, 2011), defines an interpolation measure between several probability distributions. The main asset of Wasserstein barycenters is to take into account the geometry of the space where input measures live in (cf. Fig. 1). Given N probability distributions: µ 1 , . . . , µ N ∈ M 1 + (R d ) and weights (α 1 , . . . , α N ) ∈ R + , the Wasserstein barycenter optimization problem of these distributions w.r.t. the weights is defined as: where the support of µ may be unknown. Equation 2 defines a weighted average in the Wasserstein space. To make it computationally tractable, the measure µ is often constrained to be a discrete measure with free (Cuturi and Doucet, 2014; Álvarez Esteban et al., 2016;Cuturi and Peyré, 2016;Luise et al., 2019) or fixed support (Benamou et al., 2015;Dvurechenskii et al., 2018;Lin et al., 2020;Janati et al., 2020). For the purpose of our approach, we focus on free support barycenter with fixed weights in the following.

BaryScore Metric
The construction of automatic metrics usually relies on two paradigms depending on the availability of a reference sentence for each candidate (Specia et al., 2010). Here and throughout the paper, we assume that at least one reference is available for each candidate. Denote by C = {ω c 1 , . . . , ω c nc } the candidate and R = {ω r 1 , . . . , ω r nr } the reference composed of n c and n r words, respectively. Our goal is to design a metric m : (C, R) → m(C, R) ∈ R + such that the closer to zero the better candidate is. Algorithm. Our metric m, named Baryscore, can be summarized in two steps: (i) Find the Wasserstein barycentric distributions of contextual encoder layers for C and R; (ii) Evaluate these barycentric distributions using the Wasserstein distance. Wasserstein barycenters. Assume that a contextual encoder, (e.g. BERT and ELMo), is composed of L layers, i.e., φ 1 , . . . , φ L functions that map a candidate text C and a reference text R respectively into φ (C) ∈ R nc×d and φ (R) ∈ R nr×d , for every 1 ≤ ≤ L. In our approach, we consider the discrete probability distributions induced by φ (C) and φ (R), where φ (C) i and φ (R) j represent the embedding of the i-th token and j-th token of the candidate and reference text, respectively. Precisely, 2L empirical measures are constructed from these layers functions . . , α nc } and β = {β 1 , . . . , β nr } are the vector of inverse document frequencies of each word ω i of C and R, respectively, and δ x is the dirac mass at point x. Further, Wasserstein barycenters (see Eq. (2)) are computed on the candidate C and the reference R leading to two barycentric embedding measureŝ µ C andμ R with fixed sizes n c and n r , respectively. Considering weights of the barycentric measures as uniform, as for the layers weights, the optimization problem is equivalent to find locations such that: The previous formulation is similar for the reference text, replacing C by R and α by β in notations. The final embedding, denoted by Φ, is then considered as the locations of the barycentric measures, i.e., Φ(C) = {x c 1 , . . . , x c nc } and Φ(R) = {x r 1 , . . . , x r nr }. Computing the Wasserstein distance. The last step of our approach is then to evaluate discrete measures induced by the final embeddings, i.e., the candidate and reference barycentric measureŝ µ C andμ R , using the Wasserstein distance, leading to the Baryscore given by m(C, R) = W(μ C ,μ R ). This step boils down to computing the WMD and is similar to the final step in Mover-Score or SentenceMover. The entire procedure is summarized in Algorithm 1.
Compute measures: {μ C, ,μ R, } L =1 . Compute Wasserstein barycenters: Parameters of BaryScore metric. Our metric is dependant on the choice of the continuous representations (e.g., BERT, ELMo and Roberta) and then its performance will be influenced by the choice of the model. As it is common in concurrent work (Zhang et al., 2019;Zhao et al., 2019b;Clark et al., 2019) all the results in the paper are obtained with one single model: namely BERT-base-uncased. Additionally, we report results obtained with the BERT fine-tuned on NLI release in (Zhao et al., 2019b), this model is referred as BaryScore + in the following. In contrast to previous work (e.g Sen-tenceMover) that integrates a preprocessing step by removing the stopwords based on a static list, we keep all words. Also, our framework provides a natural way to exploit all available layers of the model while previous work relies on a specific subset of them (e.g., MoverScore and BertScore). We believe this strengthens the robustness of our approach. Comparison with the Moverscore. Following the footstep of (Zhang et al., 2019), the Moverscore (Zhao et al., 2019b) applied optimal transport to the output of Contextualized Encoders (CE) such as BERT or ELMo. Precisely, let's assume that a CE is composed of L layers, the Moverscore's context representation is given for each word ω j by Φ(ω j ) = T (φ 1 (ω j ), . . . , φ L (ω j )), where the transformation T is either power means (Rücklé et al., 2018a) or aggregation routines depicted in (Zhao et al., , 2019a. The score is then defined by the Wasserstein distance between the empirical distributions given by Φ(C) and Φ(R).
The main weakness of this approach is the aggregation step. Taking into account the role of the underlying geometry of the probability distribution as well as the interpretability of the transportation flow are high benefits of Optimal transport. However, performing Wasserstein distance after applying power means, i.e., an aggregation in an euclidian space (see e.g., Figure 1), does not allow a proper evaluation of the geometry induced by the CE layers in the Wasserstein space. Indeed, Moverscore evaluates a distorted geometry inducing wrong interpretability of the transportation flow. The advantage of exploiting Wasserstein barycenter over euclidean aggregation relies on rehabilitating this geometry, as shown in Section 6.

Experimental Settings
In this section, we present our evaluation methods as well as the various dataset used to benchmark our metric. Extension of notations. In the previous section, we have only considered a candidate and a reference sentence. In order to evaluate and compare different metrics, we need to extend the previous notations to include the system that generates each sentence. To this end, we will assume that we have i is the i-th text generated by the j-th system, h(C j i ) is human score assign to C j i and R i the reference text associated to C j i ; N is the number of available texts; S the number of different systems.

Evaluating automatic evaluation of NLG
The quality of the evaluation metric is measured by its correlation with the human judgment (Chatzikoumi, 2020; Specia et al., 2010;Koehn, 2009;Banerjee and Lavie, 2005). Three correlation measures can be considered: Pearson (Leusch et al., 2003), Spearman (Melamed et al., 2003) or Kendall (Kendall, 1938). In addition, two different levels of granularity are considered to compute theses correlation coefficients. System level correlation. These can be considered when assessing the discrimination capability between two systems. This level of correlation tries to answer the question: "Can the metric be used to compare the performance of two systems?". Formally, the system level correlation K sys measures the quality of a metric m defined as: where K is the considered correlation coefficient.
Text level correlation. This is computed to evaluate the ability of a metric to measure the semantic equivalence between a candidate and a reference sentence. Such a level of correlation aims at providing an answer to the question: "Can the metric be used as a loss or reward of a system?". By introducing similar notations to those in Equation 4, we obtain the text level correlation K text : Significance testing. To ensure that observed improvement is statistically significant we follow common consensus (Deutsch et al., 2021;Graham, 2015;Graham et al., 2015;Graham and Baldwin, 2014) in NMT and rely on William test (Steiger, 1980) as considered observations are correlated 3 .

Choice of datasets
We motivate our choice of datasets for each tasks.
Translation. Multiple translation datasets are available from the WMT translation shared tasks (Bojar et al., 2014(Bojar et al., , 2015(Bojar et al., , 2016(Bojar et al., , 2017a. Keeping in mind the work by (Card et al., 2020) stressing the importance of the size of the dataset, we follow (Zhang et al., 2019) and choose to work with WMT16 and additionally report results on WMT15 that both offer over 500 sentences per language (in contrast to new versions Barrault et al., 2019Barrault et al., , 2020) that rely on a lower number-around 50-of annotated texts). following criteria: (1) data coverage which measures if the generated text contains all the available information present in the input data, (2) relevance which characterizes if the generated text is solely composed with information available in the input,

Numerical Results
In this section, we study the performance of BaryScore on the four aforementioned tasks.

Translation
Overall results. Table 1 and Table 2 gather correlation to human judgments on WMT15 and WMT16. We conduct a statistical analysis to ensure that the observed improvements are statistically meaningful (see Figure 3). We observe that BaryScore + is the best performing metric on both datasets for all languages. Similarly to (Zhao et al., 2019b), we observe an improvement when using their pretrained version of BERT on MNLI (Wang et al., 2018). By comparing the best performance achieved by BaryScore compared to MoverScore, we hypothesize that Wasserstein barycenter preserves more geometric properties of the information learnt by BERT. Correlation analysis. Figure 4 reports the intercorrelation across metrics according to the Kendall   τ . We observe that the metrics based on BERT (e.g BertScore, MoverScore and BaryScore) obtain medium-high correlation demonstrating that both the aggregation mechanism (e.g., one layer selection, power mean or Wasserstein barycenter) as well as the choice of similarity metric (e.g., cosine similarity, Wasserstein distance) affects the ranking of the predictions. Takeaways: Overall BaryScore is particularly suitable to compare two examples and thus could be used as an alternative to the standard cross-entropy loss to train NMT systems. On the other hand, our implementation made based on POT (Flamary and Courty, 2017) makes the speed comparable with MoverScore. We are able to process over 180 sentence pairs per second with BaryScore compared to 195 sentence pairs per second with MoverScore on an NVIDIA-V100 GPU. Figure 5 reports results on the summarization task.

Summarization
We are able to reproduce the performance reported in the original paper (Bhandari et al., 2020). Contrarily to MT, we observe that there is no metric that can outperform all others on all correlation measurements. We can also notice that the improvement induced by the BERT fine-tuned on MNLI is not observed on this dataset for both BaryScore and MoverScore. Figure 7: Correlation at the system level with human judgement along five different axis: correctness, data coverage, fluency, relevance and text structure for the WebNLG task. Overall best result is bolted.
Consistency and robustness of BaryScore.
In contrast to what is observed on abstractive systems, we observe a strong inconsistency in the behavior of the previous metrics based on BERT for extractive systems. Indeed, at the text level BertS-R, MoverScore and MoverScore + achieve good medium/high correlation whereas at the system level the achieved correlation collapses (correlation scores below 20 points). BertS-F, on the contrary, under-performs at the text level for extractive systems but achieves competitive performance with Rouge (the best performing metric) at the system level. We observe that using Wasserstein barycenter is a better way to aggregate the layer and provides better robustness as it alleviates the aforementioned problem. Indeed, the performance achieved by BaryScore is competitive at both the text and system levels.
Takeaways: Overall, the two versions of BaryScore are among the best performing metrics, outperform current BERT based metrics on 3/4 configurations, and achieve consistent performance on the 4th configuration. The consistent behavior of BaryScore demonstrates the validity of our approach for summarization. Whereas, a simpler and lighter alternative to BaryScore, as well as other BERT-based metrics to compare systems on summarization, remains the ROUGE score for 3/4 configurations.

Data2Text
Figure 7 reports results on data2text task using the WebNLG2020 data. To the best of our knowledge this is one of the first study using this dataset. Figure 7 shows a strong correlation between the three evaluation dimensions with correlation r and ρ higher than 90. We observe that BaryScore consistently metrics based on BERT and achieves best results on 6/9 configurations for BaryScore and 2/9 for BaryScore + .

Image Captioning
We follow (Zhang et al., 2019;Zhao et al., 2019b) and report in Figure 6 5 Pearson correlation coefficients between prediction and system level judgment. Although we were unable to reproduce exactly the results by (Zhao et al., 2019b), we obtain comparable numbers and similar orderings.
Takeaways: BaryScore outperforms current metrics except for LEIC that rely on information extracted from both image and text. These results validate the use of BaryScore to compare the performance of image captioning systems.

Summary and Concluding Remarks
In this paper, we present a metric named BaryScore which relies on optimal transport and solves the geometric discrepancies present in existing metrics that use contextualized embedding with WMD. The present work is carried out in the context of NLG but it introduces a generic theoretically-grounded framework that could be extended to other NLP studies. In particular, it illustrates applications of Wasserstein barycenters to combine the different views offered by different layers of a deep neural network. Specifically, futur work includes testing Wasserstein Barycenters in a multimodal setting (Garcia et al., 2019;Colombo et al., 2021a), for classification (e.g. emotion (Witon et al., 2018), dialog act (Chapuis et al., 2020;Colombo et al., 2020;Chapuis et al., 2021), stance (Dinkar et al., 2020) and controlling style in NLG (Jalalzai et al., 2020;Colombo et al., 2019Colombo et al., , 2021b .

Acknowledgment
Pierre is funded by IBM. This work was also granted access to the HPC resources of IDRIS under the project 2021-101838 made by GENCI.

Appendix
We gather additional experimental results. In particular, a statistical analysis on WMT16 and WebNLG2020.

Statistical analysis of WMT16
We report on Figure 11 the correlation coefficients the inter-correlation across metrics on WMT16.

Statistical analysis of data2text
We report in Figure 15 the results of the William test on data2text generation. (Lhoest et al., 2021)