Reproducibility Issues for BERT-based Evaluation Metrics

Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).


Introduction
Reproducibility is a core aspect in machine learning (ML) and natural language processing (NLP).It requires that claims and results of previous work can independently be reproduced and is a prerequisite to trustworthiness.The last few years have seen vivid interest in the topic and many issues of non-reproducibility have been pointed out, leading to claims of a "reproducibility crisis" in science (Baker, 2016).In the field of evaluation metrics for natural language generation (NLG), the seminal work of Post (2018) has demonstrated how different preprocessing schemes can lead to substantially different results when using the dominant metric at the time, BLEU (Papineni et al., 2002).Thus, when researchers employ such different preprocessing steps (a seemingly innocuous decision), this can directly lead to reproducibility failures of (conclusions regarding) metric performances.
Even though BLEU and similar lexical-overlap metrics still appear to dominate the landscape of NLG (particular MT) evaluation (Marie et al., 2021), it is obvious that metrics which measure surface level overlap are unsuitable for evaluation, especially for modern text generation systems with better paraphrasing capabilities (Mathur et al., 2020).As a remedy, multiple higher-quality metrics based on BERT and its extensions have been proposed in the last few years (Zhang et al., 2019;Zhao et al., 2019).In this work, we investigate whether these more recent metrics have better reproducibility properties, thus filling a gap for the newer paradigm of metrics.We have good reason to suspect that reproducibility will be better: (i) as a response to the identified problems, recent years have seen many efforts in the NLP and ML communities to improve reproducibility, e.g., by requiring authors to fill out specific check lists.1 (ii) Designers of novel evaluation metrics should be particularly aware of reproducibility issues, as reproducibility is a core concept of proper evaluation of NLP models (Gao et al., 2021).
Our results are disillusioning: out of four metrics we tested, three exhibit (severe) reproducibility issues.The problems relate to (i) heavy use of (undocumented) preprocessing, (ii) missing code, (iii) reporting lower results for competitors, and (iv) correlating with the wrong columns in the evaluation csv file.Motivated by the findings on the role of preprocessing and following Post (2018), we then study its impact more closely in the second part of the paper (for those metrics making use of it), finding that it can indeed lead to substantial performance differences also for BERT-based met-rics.The code for this work is available at https: //github.com/cyr19/Reproducibility.

Related Work
Relevant prior work to this work includes BERTbased evaluation metrics (Section 2.1) and reproducibility in NLP (Section 2.2).

BERT-based Evaluation Metrics
In recent years, many strong automatic evaluation metrics based on BERT (Devlin et al., 2018) or its variants have been proposed.It has been shown that those BERT-based evaluation metrics correlate much better with human judgements than traditional evaluation metrics such as BLEU (Papineni et al., 2002).Popular supervised BERTbased evaluation metrics include BLEURT (Sellam et al., 2020) and COMET (Rei et al., 2020), which are trained on segment-level human judgements such as DA scores in WMT datasets.Unsupervised BERT-based evaluation metrics such as BERTScore (Zhang et al., 2019), MoverScore (Zhao et al., 2019), BaryScore (Colombo et al., 2021) and XMoverScore (Zhao et al., 2020) do not use training signals, thus potentially may generalize better to unseen language pairs (Belouadi and Eger, 2022).MoverScore, BaryScore, and BERTScore are reference-based evaluation metrics.In contrast, reference-free evaluation metrics directly compare system outputs to source texts.For MT, popular such metrics are Yisi-2 (Lo, 2019), XMoverScore, and SentSim (Song et al., 2021).

Reproducibility in NLP
Cohen et al. (2018) define replicability as the ability to repeat the process of experiments and reproducibility as the ability to obtain the same results.They further categorize reproducibility along 3 dimensions: (1) reproducibility of a conclusion, (2) reproducibility of a finding, and (3) reproducibility of a value.In a more recent study, Belz et al. (2021) categorize reproduction studies according to the condition of the reproduction experiment: (1) reproduction under the same condition, i.e., reusing as similar as possible resources and mimicking the authors' experimental procedure as closely as possible; (2) reproduction under varied conditions, aiming to test whether the proposed methods can obtain similar results with some changes in the settings; (3) multi-test and multi-lab studies, i.e., reproducing multiple papers using uniform meth-ods and multiple teams attempting to reproduce the same paper, respectively.
In the first part of this work, our reproductions follow the first type described by Belz et al. (2021), i.e., we adhere to the original experimental setup and re-use the resources provided by the authors whenever possible, aiming at exact reproduction.The second part falls into the second category of reproduction study described by Belz et al. (2021), i.e., to change some settings on purpose to see if comparable results can be obtained.
According to Fokkens et al. (2013) and Wieling et al. (2018), the main challenge for reproducibility is the unavailability of the source code and data.Dakota and Kübler (2017) study reproducibility for text mining.They show that 80% of the failed reproduction attempts were due to the lack of information about the datasets.To investigate the availability of source data, Mieskes (2017) conducted quantitative analyses on the publications from five conferences.They found that though 40% of the papers reported having collected or changed existing data, only 65.2% of them provided the links to download the data; 18% of them were invalid.Similarly, Wieling et al. (2018) assessed the availability of both source code and data of papers from two ACL conferences (2011 and 2016).When comparing 2016 to 2011, the availability of both data and code increased, suggesting a growing trend of sharing resources for reproduction.However, even using the same code and data, they could only recreate identical values for one paper.More recently, Belz et al. (2021) analyzed 34 reproduction studies under the same condition (re-using the original resources when possible) for NLP papers.They found that only a small portion (14.03%) of values could be exactly reproduced and the majority (59.2%) of the reproduced values lead to worse results.Moreover, 1/4 deviations are >5%.
In NLG, Post (2018) attests to the noncomparability of BLEU (Papineni et al., 2002) scores across different papers.He argues that there are four causes.First, BLEU is a parameterized approach; he shows that on WMT17 (Bojar et al., 2017), the BLEU score for en-fi, increases by roughly 3% Pearson from changing parameters regarding multiple references.The second issue, which is regarded as the most critical, is the use of different preprocessing schemes.Among these, tokenization of the references plays a key role.The third problem is that preprocessing details are of-ten omitted in papers.The fourth problem is different versions of datasets, in his case a particular problem with the en-de language pair in WMT14 (Macháček and Bojar, 2014).The reproducibility issue of BLEU has also been verified by Belz et al. (2022), using their novel approach, which is designed to quantify the degree of reproducibility.
Metrics MoverScore measures semantic similarity between reference and hypothesis by aligning semantically similar words and computing the distance between these words using the Word Mover Distance (Kusner et al., 2015).BERTScore calculates the cosine similarity (of BERT representations) for each token in the reference with each token in the hypothesis, and uses greedy alignment to obtain the similarity scores between sentences.It has three variants: Recall, Precision, and F1.BaryScore computes the Wasserstein distance (i.e., Earth Mover Distance (Rubner et al., 2000)) between the barycentric distribution (Agueh and Carlier, 2011) of the contexual representations of reference and hypothesis to measure the dissimilarity between them.SentSim has both reference-free and -based versions; we experiment with its referencefree version in this work, which combines sentence-(based on Reimers and Gurevych (2020)) and wordlevel models (extending a.o.BERTScore to the multilingual case) to score a pair of source text and hypothesis.

Reproduction Attempts
Our main focus will be to reproduce the results on machine translation (MT) reported in Zhang et al. (2019), Zhao et al. (2019), Colombo et al. (2021) and Song et al. (2021).
We evaluate the three metrics with the same BERT model (BERT-base-uncased) on all MT datasets mentioned above, using the reproduction resources provided by the authors of each metric.We also evaluate MoverScore and BaryScore on a BERT model finetuned on NLI (Wang et al., 2018) (as in the original papers).The code and data for reproduction were released on their respective githubs. 4In our reproduction experiments, we use the metrics with the configurations found in their evaluation scripts or papers.Although Zhao et al. (2019) also reported the results for BERTScore-F1, they did not provide information about the used parameter settings.Similarly, Colombo et al. (2021) evaluated the other two metrics on WMT15-16, but except for the model choice, all other settings are unclear.Moreover, except for Zhang et al. (2019), who explicitly state which results were obtained using IDF-weighting, the authors of the other two approaches did not mention this in their papers.For unclear metric configurations, we keep them at default.The configurations used here are: • BERTScore We report the reproduced results for BERTScore-F1 that uses BERT-base-uncased, with the default layer 9 of the BERT representation for this model, and IDF-weighting.
• MoverScore We report the reproduced results for unigram MoverScore (MoverScore-1) using BERT-base-uncased or its finetuned version on MNLI, the last five layers from BERT aggregated by power means (Rücklé et al., 2018), IDFweighting, punctuation removal and subwords removal (only keep the first subword in each word).
• BaryScore We report the reproduced results for BaryScore5 that makes use of BERT-baseuncased or its finetuned version on MNLI6 , the last five layers aggregated using Wasserstein Barycenter, and IDF-weighting.
The metrics with finetuned models are marked with + in the following.

Results
As Table 1 shows, we do not obtain identical results for BERTScore-F1 with Zhang et al.
(2019) on WMT18 to-English language pairs.The maximal deviation between the reported and reproduced results can be seen on the evaluated data for de-en -around 0.003 absolute Pearson's r.Most of the deviations are about 0.001.This might be because of tiny differences in rounding strategies and random seeds7 etc. Further, among the three evaluation metrics, BERTScore-F1 performs best, whereas BaryScore is worst.Table 2 displays the reproduction results on WMT17 to-English language pairs, leveraging the resources from Zhao et al. (2019).As for MoverScore-1 + , 5 out of 7 values can be perfectly reproduced (excluding the average value).The unreproducible results on fi-en and lv-en are 0.012 and 0.031 lower than the reported, respectively.On personal communication, the authors told us that they changed the preprocessing for these settings, which is impossible to identify from the released paper or code.We obtain comparable average value for BERTScore-F1 with Zhao et al. (2019) (0.718 vs. 0.719), but the results on individual language pairs differ.Except for fi-en, MoverScore-1 + correlates better with humans than BERTScore-F1, which is in line with the observation from Zhao et al. (2019).When applying the same BERT model, BaryScore performs slightly worse than the other two metrics, except for tr-en.
Table 3 shows the results of the reproduction attempts on WMT15-16 based on the code and data provided by Colombo et al. (2021).Colombo et al. (2021) reported Pearson, Spearman and Kendall correlation with human ratings; we relegate the reproduction results for Kendall and Spearman correlation, which are similar to those for Pearson correlation, to Section A.2.We are not able to reproduce identical values for any evaluation metric, even for BaryScore.However, the reproduced results for BaryScore and BaryScore + are comparable with the reported -around 0.001 Pearson off the reported average values in 3 out of 4 cases.For BERTScore-F1, the reproduced average values are around 0.005 Pearson better than the reported, while for MoverScore/MoverScore + , they are about 0.05 Pearson better.Colombo et al. (2021) observed that BaryScore + performs best on all language pairs in WMT15-16, which is inconsistent with our observation: MoverScore-1 + outperforms BaryScore + on half the language pairs in these two datasets.With BERT-base-uncased, BaryScore performs best among the three evaluation metrics on these two datasets, however -it achieves the highest correlation on 6 out of 10 language pairs.Summary We can rarely reconstruct identical values but obtained comparable results for the three discussed metrics, even when some of the metric configurations are missing.However, we can overall not reproduce the conclusions for three main reasons: (i) authors report lower scores for competitor metrics; (ii) authors selectively evaluate on specific datasets (maybe omitting those for which their metrics do not perform well?); (iii) unlike the au-thors of BERTScore, the authors of BaryScore and MoverScore do not provide a unique hash, making reproduction of the original values more difficult; (iv) undocumented preprocessing involved.
Following the three reproduction attempts, we cannot conclude that the newer approaches are better than the prior ones (BertScore), as Zhao et al. (2019) and Colombo et al. (2021) claim.We also point out that the three metrics perform very similar when using the same underlying BERT model; using a BERT model fine-tuned on NLI seems to have a bigger impact.This casts some doubt on whether the more complicated word alignments (as used in BaryScore and MoverScore) really have a critical effect.
SentSim For reference-free evaluation, Song et al. (2021) use MLQE-PE as their primary evaluation dataset.They compare SentSim to socalled glass-box metrics which actively incorporate the MT system under test into the scoring process (Fomicheva et al., 2020a).
Using the original model configuration, we were able to exactly reproduce the reported scores for all SentSim variants on MLQE-PE.However, we noticed that the provided code for loading the dataset does not retrieve human judgments but averaged log-likelihoods of the NMT model used to generate the hypotheses.Since computing correlations with model log-likelihoods is not meaningful and the z-standardized means of the human judgments that should have been used instead are in an adjacent column of the dataset, we assume that this is an off-by-one error.
Table 4 shows how much fixing this error affects the achieved correlations of BERTScore-and WMD-based SentSim.The baselines were not af- fected by this, as Song et al. (2021) copied their scores from their original papers.Evaluation on human judgments leads to vast score differences on many language pairs.This is especially noticeable for English-German and English-Chinese language pairs, where the correlations achieved with our fixed implementation are substantially worse.This result is much more in line with the findings of related research, which also notes very poor performance for these languages on this dataset (Fomicheva et al., 2020b;Specia et al., 2020).We note that after fixing the error, SentSim falls below the baselines, which it had otherwise outperformed.

Reproduction for other tasks
In Section A.3, we reproduce results for other tasks, especially summarization, image captioning and data-to-text generation, with a focus on Mover-Score.We find that we can only reproduce the reported results for summarization, and our results are on average 0.1 Pearson's r (-12.8%) down for IC and 0.06 Spearman's ρ (-27.8%) down for D2T generation.A reason is that the authors of Mover-Score did not release their evaluation scripts and we can only speculate as to their employed preprocessing steps.As long as these are not reported in the original papers or released code, claims regarding performance of the metrics are hard to verify.8

Sensitivity Analysis
In the previous section, we have seen that preprocessing may play a vital role for obtaining stateof-the-art results (at least for some of the metrics).Similar to the case of BLEU (Post, 2018), we now examine this aspect in more detail.
According to the papers and evaluation scripts, MoverScore uses the following main preprocessing steps (besides those handled by BERT): (i) Subwords Removal: discard BERT representations of all subwords except the first.(ii) Punctuation Removal: discard BERT representations of punctuations.(iii) Stopwords Removal: discard BERT representations of stopwords (only for summarization). 9The preprocessing steps for BERTScore and BaryScore are only related to lowercasing and tokenization, both of which are handled by BERT.We observe that (i) MoverScore uses much more preprocessing than BERTScore and BaryScore on WMT datasets; (ii) authors may take different preprocessing steps for different tasks, e.g., Zhao et al. (2019) remove stopwords for summarization but not for MT.
Besides preprocessing in a narrower sense, all three considered evaluation metrics use parameters.This makes them more flexible, but also complicates reproduction: the difference in one parameter setting can lead to reproduction failure.We study the impact of the parameters related to IDFweighting.IDF-weighting measures how critical a word is to a corpus; thus, it is corpus-dependent.The choice of corpus may lead to deviations of metric scores.
MoverScore is the main experiment object in the remainder.Compared to the other metrics, its authors took more preprocessing steps to achieve the results in their paper, suggesting that it is more likely to obtain uncomparable scores across different users when using MoverScore.We will also investigate the sensitivity of BERTScore to the factors discussed above; we omit BaryScore and SentSim from further consideration.Impor-tantly, we move beyond English-only evaluation, as reported in the original MoverScore paper.This will estimate how much uncertainty there is from preprocessing when a user applies MoverScore to a non-English language pair, which requires new IDF corpora, new stopword lists and may have higher morphological complexity (which is related to choice of subwords).
We use two statistics to quantify the sensitivity of the evaluation metrics.When there are only two compared values a, b, we compute Relative Difference (RD) to reflect the relative performance variation regarding a certain parameter.When there are more than two compared values, we compute Coefficient of Variation (CV) to reflect the extent of variability of the metric performance: where σ is the standard deviation and µ is the mean of a set of values x.Larger absolute values of the statistics indicate higher sensitivity of the evaluation metrics.
We only consider MT and summarization evaluation in this part.In each experiment, we only adjust the settings of the tested factors and keep the others default (given in Section A.5).In addition to English ("to-English"), we consider MT evaluation for other 6 languages ("from-English"), for which we use multilingual BERT: Chinese (zh), Turkish (tr), Finnish (fi), Czech (cs), German (de), and Russian (ru).Note that in these cases, we compare a Chinese reference to a Chinese hypothesis and analogously for the other languages.

Stopwords Removal
In this experiment, we consider 4 stopword settings including disabling stopwords removal and applying 3 different stopword lists for the examined languages.We obtain the stopword lists from the resources listed in Section A.6.We inspect the sensitivity of MoverScore-1, MoverScore-2 (MoverScore using bigrams) and BERTScore-F1 to stopword settings, despite that BERTScore does originally not employ stopwords.
For English MT, we calculate CV of the correlations with humans over the 4 stopword settings for each language pair in the datasets, then average CVs over the language pairs in each dataset x-en denotes the average results on all to-English language pairs (where metrics operate on English texts).
to obtain the average CV per dataset.For summarization, we calculate CV of the correlations over the 4 stopword settings for each criterion on each dataset. 10esults On segment-level MT, as Figure 1 (top) shows, the sensitivity varies across datasets and languages.Most of the CV STOP are in range of 2-4%.This leads to 6-11% absolute variation of the metric performance when the average correlation is, for example, 0.7 (95% confidence interval).For some datasets and languages, the variation is even more pronounced: for example, for Russian on WMT17, the CV STOP is above 10%.
Among the examined metrics, MoverScore-2 behaves slightly more sensitively than MoverScore-1, whereas BERTScore-F1 is much more sensitive than MoverScore-1 on Chinese and English.Compared to other tasks, stopwords removal has the largest (but negative) impact in segment-level MT evaluation (cf.Section A.7). Figure 2: RD(dis,ori) (for IDF weighting), RD(dis,pr) (for punctuation removal).WMT17-19, segment-level evaluation, MoverScore-1.The top graphs in (a) and (b) are the results on to-English language pairs (where metrics operate on English texts), whereas the bottom ones are those on from-English language pairs (where metrics operate on texts in other languages).

IDF-weighting
In this test, we first disable IDF-weighting for the evaluation metrics (idf dis ), and compare the metric performance to that when applying original IDFweighting11 (idf ori ) by calculating the RD between them.We denote this statistic as RD(dis,ori); negative values indicate idf ori works better and vice versa.Next, to inspect the sensitivity to varying IDF-weighting corpora, we apply IDF-weighting from four randomly generated corpora to the evaluation metrics additionally (idf rand ): each corpus consists of 2k English segments randomly selected from the concatenated corpus of all tested datasets.The corresponding variability of the metric performance is quantified by the CV of the correlations with humans over the 5 IDF-weighting corpus selections (idf ori + 4 idf rand ), marked with CV IDF .We examine the sensitivity regarding IDF-weighting of MoverScore-1, MoverScore-2, and BERTScore-F1.Subsequently, we test the IDF-weighting from large-scale corpora (idf large ).These corpora are obtained from Hugging Face Datasets. 12esults As seen in Figure 2(a), RD(dis,ori) is positive on only one to-English language pair (WMT19 kk-en), but on three from-English language pairs (WMT17 en-de, en-zh, and en-tr).Overall, IDF-weighting is thus beneficial.The maximal performance drops are on WMT19 deen ( >35%) and en-de ( >10%), respectively.Most RD(dis,ori) have absolute values <5%.This means, suppose the correlation is 0.7, the performance can fall by around 0.035 because of disabling IDFweighting.
Next, CV IDF for segment-level MT is presented in Figure 1 (middle).In English evaluation, the maximal variation is also caused by the result for de-en in WMT19, where idf ori yields considerably better result than idf rand (0.22 vs. 0.17 Kendall's τ ).While en-de has CV values above 4.5%, most CV IDF are smaller than 1%.
BERTScore-F1 is less sensitive to IDFweighting than both MoverScore variants.Among the evaluation tasks, the metrics are again most sensitive on segment-level MT, where for English, idf ori works best for MoverScore (even idf large cannot improve its performance), while idf rand and idf ori are almost equally effective for BERTScore-F1 (cf.Section A.9).

Subwords & Punctuation
In this experiment, we evaluate the sensitivity to (i) subword selection and (ii) punctuation removal (PR).(i) In addition to the original two selections of subwords (keeping the first subword and keeping all subwords), we also average the embeddings of the subwords in a word to get the word-level BERT representations.To quantify the sensitivity to subword selection, we calculate CV of the correlations with humans over the 3 subword selections, denoted as CV SUB .(ii) We measure the performance change from using to disabling PR by calculating the RD between them, which we denote as RD(dis,pr); negative values indicate MoverScore with PR performs better and vice versa.We inspect the corresponding sensitivity of MoverScore-1.
Results Figure 2(b) shows that most RD(dis,pr) have absolute values <1%, while both values for en-tr are >3%.Further, the CV SUB for segmentlevel MT is presented in Figure 1 (bottom).The average CV SUB over all datasets for most languages are <2%, whereas highly inflectional languages such as Turkish and Russian are considerably more sensitive, with average values >4%.
Similar as for stopwords and IDF weighting, MoverScore-1 behaves most sensitively on segment-level MT, where the default configuration of PR and subwords, which uses the first subword and removes punctuations, works best for English.However, for other languages, only in 2 out of 16 cases is it best to select the default configuration (cf.Section A.10).As the authors of MoverScore only reported the results on English data, they may thus select an optimal preprocessing strategy only for that case.

Discussion
We summarize the findings from the previous experiments along 4 dimensions.
Evaluation Tasks: Among the considered NLG tasks, BERT-based evaluation metrics are more likely to generate inconsistent scores in segmentlevel MT evaluation.Their sensitivity is less pronounced in system-level MT and summarization.In the latter two cases, average scores are considered, over the translations within one system or over the multiple references.Thus, some of the variation in metric scores will cancel out, leading to a less fluctuating metric performance from varying preprocessing schemes.Evaluation metrics: Among the two variants of Mover-Score, MoverScore-2 are more sensitive to parameter settings.BERTScore-F1 behaves less sensitively to IDF-weighting than MoverScore while it behaves much more sensitively to stopwords in the evaluation of Chinese and English compared with MoverScore-1.Languages: Overall, the considered evaluation metrics have different sensitivities in different languages.Furthermore, highly inflectional languages such as Turkish and Russian as well as German often become "outliers" or obtain extrema in our experiments.Importance of fac-tors: Stopwords removal has the largest but mostly negative impact.IDF-weighting positively impacts evaluation metrics in English evaluation but its contribution is much less stable in the evaluation of other languages.MoverScore benefits from subwords and punctuation removal in segment-level MT evaluation for English, but on other tasks or for other languages, no configuration of PR and subword selection consistently performs best.

Conclusion
We investigated reproducibility for BERT-based evaluation metrics, finding several problematic aspects, including using heavy undocumented preprocessing, reporting lower scores for competitors, selective evaluation on datasets, and copying correlation scores from wrong indices.Our findings cast some doubts on previously reported results and findings, i.e., whether more the complex alignment schemes are really more effective than the greedy alignment of BERTScore.In terms of preprocessing, we found that it can have a large effect depending (a.o.) on the languages and tasks involved.For a fairer comparison between metrics, we recommend to (1) additionally report the results on the datasets that the competitors used, (2) check whether the used versions of the competitor metrics can obtain comparable results as in the original papers, and (3) minimize the role of preprocessing (ideally employing uniform preprocessing across metrics).On the positive side, as authors are nowadays much more willing to publish their resources, it is considerably easier to spot such problems, which may also be one reason why critique papers such as ours have become more popular in the last few years (Beese et al., 2022).In a wider context, our paper contributes to addressing the "cracked foundations" of evaluation for text generation (Gehrmann et al., 2022) and to better understanding their limitations (Leiter et al., 2022).
In the future, we would like to reproduce more recent BERT-based metrics -e.g., with other aggregation mechanisms (Chen et al., 2020), normalization schemes (Zhao et al., 2021), different design choices (Yuan et al., 2021;Chen and Eger, 2022), or metrics that use supervision (Rei et al., 2020;Sellam et al., 2020;Rony et al., 2022) -to obtain a broader assessment of reproducibility issues in this context.We would also like to quantify, at a larger scale, the bias in research induced from overestimating one's own model vis-à-vis competitor models.

Limitations
Limitations of our work include (1) a limited number of explored evaluation metrics, (2) a restricted focus on MT only and (3) reliance on authorprovided reproduction resources.
(1) Although we did point out very important issues, we only reproduced four metrics.Further, the sensitivity analysis only concerned two evaluation metrics.In the future, we would like to include more reproducibility studies on recent BERT-based evaluation metrics for a broader analysis.It is possible that our particular sample is representative of more severe underlying problems in the community or that it is particularly affected by reproducibility issues.
(2) Our reproduction attempts, with the exception of MoverScore, focused only on MT.For example, the authors of BaryScore also reported results on summarization, IC, and D2T generation, which (for computational costs) we did not considered in this work.While we believe that our findings generalize from MT to other tasks, we did not confirm this expectation experimentally.
(3) Our reproduction attempts were mainly based on the author-provided resources, such as the code and datasets they released, with which we could obtain comparable results in most instances.Nevertheless, we did not investigate their legitimacy, e.g., whether the implementation of the approach is in accordance with the description in its paper or whether the datasets uploaded by the authors are the official ones, etc.  Bojar et al. (2017).On those datasets, DA always serve as the golden truth for system-level evaluation.For segment-level evaluation, WMT18 and WMT19 use DArr, WMT15 and WMT16 rely on DA, and WMT17 uses DA for all to-English and 2 from-English languages pairs (en-ru and en-zh) and DArr for the remaining from-English language pairs.

A.1.2 Text Summarization
Each TAC dataset contains several clusters each with 10 news articles.There are more than 50 system and 4 reference summaries with fewer than 100 words for each article.Each system summary receives 4 human judgements according to two criteria: 1) Pyramid, which reflects the level of content coverage of the summaries; and 2) Responsivenss, which measures the response level to the overall quality of linguistic and content of the summaries.The difference between these two datasets is the fact that TAC2008 contains 48 clusters and summaries from 57 systems, while TAC2009 contains 44 clusters and summaries from 55 systems.Zhao et al. (2019) calculated Pearson and Spearman correlation with summary-level human judgments when evaluating MoverScore.In addition, We compute Kendall correlation as well, allowing for a comparison among the three correlations.
In the reproduction experiment, following the experiment setup of Zhao et al. (2019), We calculate Pearson correlation with M1 and M2 scores, which refer to the ratio of captions better or equal to human captions and the ratio of captions indistinguishable from human captions, respectively.

A.1.4 Data-to-Text Generation
There are 202 Meaning Representation (MR) instances in BAGEL and 398 MR instances in SFHO-TEL datasets.Multiple references and about two system utterances exist for each MR instance.The datasets provide utterance-level human judgments according to 3 criteria: 1) informativeness, which measures how informative the utterance is; 2) naturalness, which refers to the similarity extent between a system utterance and an native speakergenerated utterance; 3) quality, which reflects the fluency and grammar level of a system utterance (Novikova et al., 2017).In the reproduction experiment, We follow Zhao et al. (2019) to calculate Spearman correlation with utterance-level human judgements about these 3 criteria.

A.2 Reproduction on WMT15-16
Table 5 and 6 display the reproduced Spearman and Kendall correlations on WMT15 and WMT16.
A.3 Reproduction of other tasks Zhao et al. (2019) released the evaluation scripts for WMT17 and TAC2008/2009 and the corresponding datasets on a github13 .We take them as the resources for reproduction.As for IC and D2T generation evaluation, we write our own evaluation scripts and download those datasets on our own.We obtained MSCOCO, BAGEL, and FSHOTEL datasets from an open question14 on its Github page, where Zhao et al. (2019) provided the links to download them.Since Zhao et al. (2019) did not provide much information about how they evaluated on MSCOCO, we also inspect the BERTScore paper (Zhang et al., 2019), where the authors gave details of the evaluation process.As each system caption in MSCOCO has multiple references, it  (2019) released the evaluation scripts.All of the facts mentioned imply the importance of sharing code and data for reproducibility.However, even with the author-provided code and datasets, there is no guarantee that the results can be perfectly reproduced.The authors may ignore some details of the evaluation setup or metric configurations.

A.4 Subwords, Stopwords, Punctuation
Subword Removal BERT leverages a subwordlevel tokenizer, which breaks a word into subwords when the full word is excluded from its built-in vocabulary (e.g., smarter → smart, ##er).BERT automatically tags all subwords except the first one with ##, so we can easily remove them.There are two advantages to doing so.Firstly, it can speed up the system due to the smaller number of embeddings to process.Secondly, it is sometimes equally effective to lemmatization or stemming.E.g., the suffix er of the word smarter can be removed with this.In some cases, it may keep a less informative part; e.g., the prefix un in the word unhappy.

Stopwords Removal & Punctuation Removal
Both of these two common preprocessing techniques aim to remove less relevant parts of the text data.A typical stopword list consists of function words such as prepositions articles and conjunctions.As an example, MoverScore achieves a higher correlation with human judgments when removing stopwords on text summarization.

A.5 Default configuration of evaluation metrics
• MoverScore For English evaluation, we use the released version of MoverScore, which makes use of 1) BERT base uncased model finetuned on MNLI dataset, 2) the embeddings of the last five layers aggregated by power means, 3) punctuation removal and the first subword, and 4) IDF-weighting from references and hypotheses separately.we disable stopwords removal in the whole experiment except stopwords tests.For other languages, we replace the model with multilingual BERT base uncased, to keep in line with English evaluation.
• BERTScore For English evaluation, we use BERTScore incorporating with BERT base uncased model, the default layer 9, and IDFweighting from the references.For other languages, similar to MoverScore, we replace the model with multilingual BERT base uncased model.

A.6 Stopword lists
For English, the first stopword list is obtained from the Github repository of MoverScore15 , which contains 153 words.Since users may first choose existing stopword lists from popular libraries, we consider the stopword lists from NLTK (Bird et al., 2009) and SpaCy (Honnibal and Montani, 2017), which consist of 179 and 326 words, respectively.We obtain the stopword lists for other languages from: I. NLTK (Bird et al., 2009)  Figure 4 illustrates the distribution of the best stopword settings for English.In segment-level MT evaluation (Figure 4(a)), there is only one case that the best result is achieved by removing stopwords, which takes place on MoverScore-1.In contrast, the best stopword lists for system-level MT evaluation can be any of the settings for all evaluation metrics (Figure 4(b)).However, in about 50% of the test cases, MoverScore still performs best when disabling stopwords removal.In Pyramid evaluation (Figure 4(c)), MoverScore-1 achieves the best results using the original stopword list for all test cases, whereas disabling stopwords removal is still the best choice for MoverScore-2 and BERTScore-F1.In the evaluation of Responsiveness ((Figure 4(d))), two cases (33.3%) can be seen that MoverScore-1 applying the original stopword list performs best; this happens only once on MoverScore-2 (16.7%).BERTScore-F1 never benefit from stopwords removal on all evaluation tasks.
Further, in Table 10, we present the best stopword setting for all examined languages in segment-level MT evaluations.Except Finnish and Turkish, disabling stopwords removal is always the best choice for all other languages.For Finnish, only on one dataset, MoverScore-1 performs better using stopwords removal, whereas, for Turkish, both evaluation metrics achieve the best performance applying the same stopword lists.The reason might be that both Turkish and Finnish belong to agglutinative languages, and those lan-  The rings from the inside to the outside represent MoverScore-1, MoverScore-2 and BERTScore-F1.For MT, each language pair in WMT datasets is regarded as a test case, resulting in 21 test cases (3 datasets times 7 language pairs).For summarization tasks, each type of correlation is regarded as a test case for each criterion, resulting in 6 test cases (3 correlations times 2 datasets).The MoverScore (153) and SpaCy (179) stopword lists yield exactly the same results.
Further, Table 13 presents the results for idf large in English evaluation.First, the size of those corpora is much larger than the original corpora, but MoverScore still performs better with original IDFweighting.Secondly, the results for Wikipedia shows that the metric performance does not enhance with the increasing size of IDF corpora.Thirdly, although those corpora contain articles in many domains, they do not provide more applicable IDF-weighting neither.In conclusion, no IDFweighting from large-domain and large-scale corpora works as well as the original IDF-weighting in segment-level MT evaluation for English, where MoverScore-1 behaves most sensitively to IDF.       Figure 13: RD(dis,ori).WMT17-19, system-level evaluation, BERTScore-F1.

Figure 1 :
Figure 1: From top to bottom: CV STOP , CV IDF , CV SUB .WMT17-19, segment-level evaluation, MoverScore-1.x-endenotes the average results on all to-English language pairs (where metrics operate on English texts).

Figure 4 :
Figure 4: Distribution of the best stopword setting of each evaluation metric on each evaluation task for English.The rings from the inside to the outside represent MoverScore-1, MoverScore-2 and BERTScore-F1.For MT, each language pair in WMT datasets is regarded as a test case, resulting in 21 test cases (3 datasets times 7 language pairs).For summarization tasks, each type of correlation is regarded as a test case for each criterion, resulting in 6 test cases (3 correlations times 2 datasets).The MoverScore (153) and SpaCy (179) stopword lists yield exactly the same results.

Table 2 :
Zhao et al. (2019)ent-level Pearson's r on WMT17 to-English language pairs using evaluation script provided byZhao et al. (2019).Reported results are cited fromZhao et al. (2019).+ refers to using the finetuned BERT-based-uncased model on MNLI.Values in green/red denote the reproduced results are better/worse than the reported.Bold values refer to the best results with BERT-base-uncased model.Values with * denote the best reproduced/reported results.

Table 3 :
Colombo et al. (2021)-level Pearson's r on WMT15-16 using evaluation script provided byColombo et al. (2021).Reported values are cited fromColombo et al. (2021).+ represents using the fine-tuned BERT-base-uncased model on MNLI.Values in green/red denote the reproduced results are better/worse than the reported.Bold values refer to the best results with BERT-base-uncased model.Values with * denote the best reproduced/reported results.

Table 4 :
Correlations of SentSim on MLQE-PE with model log-likelihoods (Reported), as erroneously done in the official paper, and with human judgments (Fixed).The green and red highlighted results on human judgments indicate that they are better or worse than the corresponding results computed with log-likelihoods.We cite baseline scores fromFomicheva et al. (2020a).

Table 8 :
Zhao et al. (2019))m-level Pearson correlations of MoverScore-1 and BERTScore-R on MSCOCO dataset.Original results are citet fromZhao et al. (2019)andZhang et al. (2019).Bold values refer to the reproduced resutls that are better than the original.valuesarehigherthantheoriginal, which are the results for informativeness on SFHOTEL dataset.Besides, the reproduced values also deviate least in the assessment of this criterion on both datasets.As for IC, as Table8shows, the correlations for Mover-Score are down by over 0.1 across all evaluation setups.Nevertheless, BERTScore-Recall performs even on average 0.03 better in our evaluation.This kind of inconsistency between the reproduction results for these two evaluation metrics may suggest thatZhao et al. (2019)did more preprocessing in the evaluation of IC, which is impossible for others to identify if the authors neither document them nor share the relevant code.In contrast, although different preprocessing schemes were applied to MT and summarization evaluation, it is possible to reproduce most of the values becauseZhao et al.

Table 10 :
Distribution of the best stopword settings for all tested languages in segment-level MT evaluation.Values indicate the size of the stopword lists.