Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

Evaluation metrics are a key ingredient for progress of text generation systems. In recent years, several BERT-based evaluation metrics have been proposed (including BERTScore, MoverScore, BLEURT, etc.) which correlate much better with human assessment of text generation quality than BLEU or ROUGE, invented two decades ago. However, little is known what these metrics, which are based on black-box language model representations, actually capture (it is typically assumed they model semantic similarity). In this work, we use a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap. We show that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to lexical overlap, just like BLEU and ROUGE. This exposes limitations of these novelly proposed metrics, which we also highlight in an adversarial test scenario.


Introduction
Evaluation metrics are a key ingredient in assessing the quality of text generation systems, be it machine translation, summarization, or conversational AI models. Traditional evaluation metrics in machine translation and summarization, BLEU and ROUGE (Papineni et al., 2002a;Lin, 2004), have measured lexical n-gram overlap between system prediction and a human reference. While simple and easy to understand, early on, limitations of such lexical overlap metrics have been recognized (Callison-Burch et al., 2006), e.g., in that they can only measure surface level similarity, and they are especially inadequate when it comes to assessing current high-quality text generation systems (Rei et al., 2020;Mathur et al., 2020;Marie et al., 2021).
Recently, a class of novel evaluation metrics based on BERT and its variants has been explored that correlates much better with human assessments of translation quality. For example, BERTScore (Zhang et al., 2020), MoverScore (Zhao et al., 2019), BLEURT (Sellam et al., 2020), XMover-Score , and COMET (Rei et al., 2020) all use large-scale pretrained language models, but differ in whether they compare hypotheses to references, to source texts, to both, on the one hand, and whether they use human scores for supervision or not, on the other. Since these models all leverage large-scale language models which have pushed the state-of-the-art in many areas of NLP, their success comes with little surprise.
To better understand these novel metrics based on black-box language representations is a prerequisite for identifying their limitations, e.g., to adversarial inputs. For example, if an evaluation metric is sensitive to lexical overlap, it can be fooled by using the same words but in different order.
In this work, we fill the existing 'explainability gap' and introspect linguistic properties encoded in BERT-based evaluation metrics. Although there is already considerable work on introspecting and understanding BERT (see Rogers et al. (2020) for an overview), e.g., via probing (Tenney et al., 2019), analyzes by Hewitt and Liang (2019); ; Ravichander et al. (2021) indicate that probing results (based on supervision) are not always trustworthy. More importantly, the modern evaluation metrics sketched above rely on at least two factors: BERT (or its variants) and different aggregation schemes, such as Earth Mover Distance (Kusner et al., 2015;Zhao et al., 2019) or greedy alignment (Zhang et al., 2020), on top of BERT. Understanding BERT alone is thus not sufficient for explaining BERT-based evaluation metrics.
Here, we present a simple global explanation technique of BERT-based evaluation metrics which disentangles them on prominent linguistic factors, viz., syntax, semantics, morphology and lexical overlap. We find that all metrics capture these linguistic aspects to certain (but differing) degrees, and they are particularly sensitive to lexical overlap, which makes them prone to similar adversarial fooling (cf. Li et al., 2020;Keller et al., 2021) as BLEU-based lexical overlap metrics. Overall, our contributions are: • We disentangle a multitude of current BERTbased evaluation metrics on four linguistic factors using linear regression.
• We show that all metrics are sensitive to all factors and especially to lexical overlap, as confirmed both by the linear regression and an adversarial experiment.
• Based on the insight that different metrics capture different linguistic factors to varying degrees, we ensemble metrics and identify an average improvement of between 8 and 13% for the most heterogeneous metrics.

Related work
Our work concerns reference-based and referencefree metrics, on the one hand, and model introspection (or 'explainability'), on the other.
Evaluation metrics for Natural Language Generation In the last few years, several strong performing evaluation metrics have been proposed, the majority of which is based on BERT and similar high-quality text representations. They can be differentiated along two dimensions: (i) the input arguments they take, and (ii) whether they are supervised or unsupervised. Referencebased metrics compare human references to system predictions. Popular metrics are BERTScore (Zhang et al., 2020), MoverScore (Zhao et al., 2019), and BLEURT (Sellam et al., 2020). In contrast, reference-free metrics directly compare source texts to system predictions, thus they are more resource-lean. Popular examples are XMover-Score , Yisi-2 (Lo, 2019), KoBE (Gekhman et al., 2020), and SentSim (Song et al., 2021). Rei et al. (2020) use all three information signals: source text, hypotheses and human references. There are also reference-free metrics outside the field of machine translation; for example, SUPERT  for summarization. Supervised metrics train on human sentence-level scores, e.g., Direct Assessment (DA) scores or postediting effort (HTER) for MT. These metrics include BLEURT and COMET (Rei et al., 2020). In MT, most metrics from the so-called 'Quality Estimation' (QE) tasks are also supervised reference-free metrics, e.g., TransQuest (Ranasinghe et al., 2020) and BERGAMOT-LATTE (Fomicheva et al., 2020b). Unsupervised metrics require no such supervisory signal (e.g., MoverScore, BERTScore, XMoverScore, SentSim).
Model introspection There has been a recent surge in interest in explaining deep learning models. The techniques for explainability differ in whether they provide justification or information for model outputs on individual instances (local explainability) or focus on a model as a whole and disclose its internal structure (global explainability) (Danilevsky et al., 2020). Popular examples for local explainability are LIME (Ribeiro et al., 2016) and SHAP (Lundberg and Lee, 2017) that find features from the input (such as particular words) relevant to model outputs.
Concerning (global) interpretability of text representations, previous works (Adi et al., 2017;Conneau et al., 2018) introspect the properties encoded in vector representations through probing classifiers-trained on external data to perform a certain linguistic task, such as inducing the dependency tree depth from a text representation (of a sentence). Tenney et al. (2019) extend this idea by inspecting BERT representations layer-by-layer, and find that BERT captures more semantic information in its higher layers and more syntactic and morphological information in its lower layers. However, probing results are not always trustworthy due to the sensitivity to probing design choices, e.g., data size and classifier choices , and data artefacts (Ravichander et al., 2021). More importantly, evaluation metrics use BERT differently: some are supervised and others are unsupervised, some fine-tune BERT on semantic similarity datasets, and they generally differ on how they aggregate and compare BERT representations. This means to understand these metrics it does not suffice to understand BERT alone.
In our work, we disentangle BERT based evaluation metrics along linguistic factors as a form of global explainability of those metrics. This yields insights into which linguistic information signals specific metrics use, in general, and may expose their limitations.

Our approach
In our scenario, we consider different metrics m taking two arguments and assigning them a real-valued score m : (x, y) → s m ∈ R where x and y are source text and hypothesis text, respectively (for so-called reference-free metrics) or alternatively x and y are reference and hypothesis text, respectively (for so-called reference-based metrics). The scores s that metrics assign to x and y can be considered the similarity between x and y or adequacy of y given x. In our experiments in Section 4, we will focus on machine translation (MT) as use case; it is arguably the most popular and prominent text generation task. Thus, x is either a (sentence-level) source text in one language and y the corresponding MT output, or x is the human reference for the original source text.
To better understand evaluation metrics, we decompose their scores s m along multiple linguistic factors. An example is outlined in Table 1.
We follow a long line of research in the applied sciences, and use a linear model to explain a target variable (in our case, the metric score), also called response variable, in terms of multiple regressors, also called explanatory variables. That is, we estimate the linear regression m(x, y) = α · sem(x, y) + β · syn(x, y) Here, sem(x, y), syn(x, y), morph(x, y) and lex(x, y) are scores which describe the semantic, syntactic, morphological, and lexical similarity between the two argument sentences. The real coefficients α, β, γ and δ are the regressors' weights, estimated from data. Finally, is an error term.
Linear regression assumes a linear relationship between the target variable and the regressors. It may fail when the relationship is non-linear, but its simple model structure provides interpretability: a larger positive coefficient means the respective regressor has higher positive impact on the target variable (fixing all other variables), a coefficient close to zero means no linear relationship, and a larger negative coefficient means an inverse linear relationship between regressor and target variable.
The coefficient of determination R 2 describes how well the regression model reflects the data. It is defined as where SSE denotes the sum of squared errors and SST denotes the sum of squared totals. They are defined as respectively, whereŷ i is the prediction of the model, y i is the true score, and y i is the mean, y i = 1 N N i y i . R 2 is 1 for a perfect fit, 0 if it always predicts the mean and negative if the model is worse than this baseline.
To ensure comparability of the different regressions, we normalize the scores of our regressors and the response variable with the z-normalization, i.e., subtracting the mean and dividing by the standard deviation, per variable. In the following, we define our regressors.
Semantic score (SEM) The semantic scores are provided by the datasets and were annotated by humans who rated e.g. the translation quality. See Section 4.2 for details.
Syntactic score (SYN) To measure the syntactic similarity of the argument sentences x and y, we compare their dependency trees. Both sentences are parsed by the Stanford dependency parser (Chen and Manning, 2014). Then, the tree edit distance (TED) (Bille, 2005) between the resulting trees is calculated. As an extension of string edit distance, TED measures how many operations are necessary to transform one tree into the other. Only the structures of the trees are considered in the calculation. We ignore the actual words.
We normalized the TED to ensure comparability between sentences with different lengths (Zhang and Shasha, 1989). The final score is calculated as where l x and l y are the lengths of the sentences. Figure 1 shows an example of the TED calculation for sentences in the same language (monolingual) and Figure 2 shows a cross-lingual example.
Lexical overlap score (LEX) We measure the lexical overlap between x and y by the BLEU score (Papineni et al., 2002b): BLEU n calculates the precision based on how many n-grams of one sentence can be found in the other sentence. In the experiments below, we use BLEU 1 . Using unigrams assures that word order is ignored. The simple precision count is modified so that identical words are only counted once.
It is a boy , likes to sport , but it cannot do it because of their very.
He is a boy, he likes sports but he can't take part because of his knee.
- He X X X X X X X X X X X X X X X unknowable very was he Figure 1: Monolingual tree edit distance example. The left-most tree is the first sentence of the sentence pair and the right-most tree is the second. To transform the left-most sentence into the right-most sentence, two leaves are removed. The unnormalized tree edit distance is thus 2. The normalized score is 1 − 2 6+4 = 0.8.
For monolingual reference-based metrics, the BLEU score is calculated directly on the sentence pairs. To use BLEU for cross-lingual referencefree metrics, we choose to translate the non-English sentences into English via Google Translate, as it remains unclear how else to define lexical overlap between sentences from different languages. We compute BLEU scores on original English and translated sentence pairs.  Morphological score (MOR) We introduce a morphological score morph(x, y). To do so, we use static FastText word embeddings (Bojanowski et al., 2017) and increase morphological information in the original embeddings: (a) first, we produce two morphological lexicons based on words from WMT and STS, each containing word pairs with identical UD morphological tags (Nivre et al., 2020) (see Table 2). (b) Then, we finetune/retrofit the embeddings on the morphological lexicons using the method described in Faruqui et al. (2015), so that words with the same morphological tags have more similar representations. The final morphological score for a sentence pair is the cosine similarity between two averaged sentence embeddings over refined word vectors of each sentence. Note that, while we refer to these embeddings as morphological, they actually capture multiple linguistic factors and can only be considered more morphological than standard static vector spaces.
If the overlap of morphological features between a language pair is very low, the morphological score will not be meaningful. We exclude the morphological score for such language pairs.

Experiments
We analyze different evaluation metrics by calculating their score for sentence pairs. We use both reference-based metrics, which operate in a monolingual space, and reference-free metrics, which operate in a cross-lingual space.

Metrics
Reference-based metrics We consider the following reference-based metrics.
• BERTScore (Zhang et al., 2020) aggregates and compares BERT embeddings by determining a greedy alignment between words in two sentences and summing up the cosine similarities of representations of aligned words.
• MoverScore (Zhao et al., 2019) computes an optimal alignment between words in the two sentences using word mover distance (Kusner et al., 2015) on top of BERT representations.
• Sentence BERT (SBERT) (Reimers and Gurevych, 2019) fine-tunes Siamese BERT networks on NLI data, and produces sentence embeddings by using pooling on top of BERT mother great a is it X X X X X X X X X X X X X X X ist Mutter großartige eine Sie Figure 2: Cross-lingual tree edit distance example. To transform the left-most into the right-most sentence, two leaves (shown in red and blue) are moved to other locations. This takes 2 operations. The normalized score is 1 − 2 5+5 = 0.8.

representations.
We compute the cosine similarity between SBERT representations.
• SBERT-WK (Wang and Kuo, 2020) is a variant of SBERT which weighs different layers of BERT.
• In contrast to the others, BLEURT (Sellam et al., 2020) is a supervised metric and fine-tunes BERT on the WMT datasets with available human assessment of translation quality.
Reference-free metrics We consider the following reference-free metrics: • Multilingual Sentence BERT (mSBERT) (Reimers and Gurevych, 2020 • LASER (Artetxe and Schwenk, 2019) is a BiL-STM encoder trained on parallel corpora. It produces language-agnostic representations. The encoder-decoder architecture is trained jointly on different languages.
• XMoverScore  extends Mover-Score to operate in the cross-lingual setup, and relies on re-aligned multilingual BERT representations. Note that we exclude a target-side lan-guage model integrated in XMoverScore to have a similar setup as for MoverScore.
Except for XMoverScore, all metrics are based on calculation of cosine similarity between the sourcelanguage and target-language sentence embeddings. Except for MUSE and LASER, all metrics are based on BERT representations. Note that some multilingual reference-free metrics can also be used in the monolingual reference-based case, especially those based on calculating cosine similarity on top of sentence embeddings, thus we will include them in both settings.

Datasets
We use the datasets of the WMT shared task and the Semantic Text Similarity Benchmark (STSB) in our experiments. In the appendix, Table 8 shows the statistics in each dataset, and Table 9 shows examples for the sentences of the dataets.
WMT The WMT datasets contain an input sentence in the source language, the hypothesis translation of an MT system and a human reference sentence in the target language. Humans have rated the similarity between human reference and MT hypothesis using so-called 'direct assessment' (DA) scores which are framed in terms of one sentence 'adequately expressing the meaning' of another. We use these ratings as semantic scores in our setup. For the reference-based case, we use the hypotheses and the references as sentence pairs. This data is collected over multiple language pairs which have English as target language (so both the human reference and the hypothesis are in English) of WMT15-WMT17. For the reference-free scenario, we pair the source texts with MT hypotheses and use the corresponding reference-to-hypothesis DA scores for similarity score. For German, we take the data from WMT15 (Bojar et al., 2015), WMT16 (Bojar et al., 2016) and WMT17 (Bojar et al., 2017). Chinese is only available in WMT17. (Cer et al., 2017) consists of English sentence pairs and a semantic similarity score for each pair. The scores were annotated by humans. The score is used as semantic score in the regression. In contrast to WMT, some sentence pairs in STSB are designed to have a different structure but a similar meaning.

STSB The Semantic Text Similarity Benchmark
While we use the sentence pairs directly for the monolingual case, we translate the sentences in the cross-lingual case using Google Translate, following Chidambaram et al. (2018). Table 3 shows the results for the reference-based metrics. The R 2 values range from 0.43 to 0.76 on WMT, and from 0.58 to 0.91 on STS. This means we can reasonably well explain the metrics from our four explanatory variables and using a linear model. mSBERT can best be explained with an R 2 value of 0.91 on STS; however, since it has been trained on STS, this merely indicates overfitting. All metrics have positive coefficients for SEM, indicating that they all reflect semantic similarity and (semantic) 'adequacy' (as measured by DA), respectively: the weights range from 0.19 to 0.48 on WMT and from 0.12 to 0.76 on STS (ignoring mSBERT). The SYN coefficients are much lower and range from -0.05 to 0.16 for WMT and from -0.03 to 0.24 on STS. Mover-Score and BERTScore are most affected by syntactic similarity (0.11 to 0.24), while the sentence embedding based techniques have coefficients around zero. The coefficients for MOR are low on STS, except for SBERT-WK, and moderate for WMT.

Reference-based Metrics
All metrics have comparatively large coefficients for lexical overlap, especially on WMT: the coefficient values range from 0.24 to 0.64 on WMT and from 0.14 to 0.67 on STS. Especially LEX dominates for MoverScore and BERTScore, indicating that these two metrics are most sensitive to lexical adversaries, potentially making them vulnerable to inputs such as 'man bites dog' vs. 'dog bites man'. Table 4 shows the results for ZH-EN in the reference-free setup (omitting the score for MOR as indicated earlier). Many SYN coefficients are now zero or negative, meaning that a larger syntactic difference between the input arguments leads to a higher metric score, indicating that metrics are sensitive to syntactic language differences. LEX is still significant in all cases. SEM has higher coefficient values than LEX in 6 out of 10 cases, and when it 'wins', it wins by a large margin. However, we note that the R 2 are low: they range from 0.3-0.39 on WMT and from 0.24-0.59 on STS. This means we can either not (well) explain the metrics given our current regressors or the relationship is not well explained by a linear model. The results for DE-EN are similar; we provide them in Table 10 (appendix).

Reference-free Metrics
To explore why R 2 scores are now lower, we note that reference-free metrics based on BERT might contain a form of cross-lingual bias (CLB) in that they do not properly score mutual translations, as Cao et al. (2020) and many others have shown that the multilingual subspaces induced by BERT are mis-aligned. We thus include a factor CLB as a regressor to measure how significant this bias is in different metrics. Note that the metrics use either different BERT variants or other representations such as LASER, which points to different sources of CLB. Therefore, we realize CLB differently across metrics. For each metric regression, we use the same metric but take a parallel sentence, i.e., source text and Google translation (as we assume that Google Translate has very high quality in general), as input arguments, and take the metric score as a proxy of the CLB factor. If a metric does not contain cross-lingual bias, it should assign almost full scores to parallel sentences; this constant would then be meaningless in the regression.
In Table 5, we show that including the CLB factor in the regression improves the R 2 . We substantially improve the R 2 for XMoverScore (from 0.39 to 0.68), but observe little improvements for the remaining metrics (especially on STS). This is because XMoverScore is more problematic than the other metrics in terms of properly scoring mutual translations, given that the other metrics use BERT variants (or LASER) that have been finetuned (or trained) on parallel sentences. Apart from CLB, both SEM and LEX are the dominating factors in the regression. The DE-EN results are similar-see Table 11 (appendix).
Limitations The R 2 scores for the WMT dataset are almost always lower than the corresponding STS scores. One may not forget that STS sentences are in a sense artificial sentences of the form 'a girl is playing a guitar' while WMT contains more realistic sentences as well as their (possibly faulty, non-grammatical) translations. The WMT datasets are furthermore inhomogeneous in that we used different years from 2015 to 2017, which has dif-    Table 5: Regression results for Chinese-English reference-free metrics. We add the CLB factor in the regression. ferent participating MT systems as well as slightly different task definitions, corresponding to an aggregation of different domains. The STS dataset is monolingual and was translated by Google Translate for the cross-lingual scenario. The latter may lower the quality of the data, but WMT data also contains translations.
The WMT scores measure the similarity between the reference and the hypothesis but we compare the source with the hypothesis in the cross-lingual scenario, which reflects a mismatch. Fomicheva et al. (2020a) provide a dataset which gives human DA scores between source and hypothesis. We repeated the experiments with this dataset. The full results are shown in Table 12 in the appendix (omitting the CLB factor). The R 2 scores of 3 out of 5 metrics improve (slightly) compared to the WMT dataset for German-English but all R 2 scores are lower for Chinese-English. This means that mismatched DA scores are apparently not the main reason for our low regression fits. With the new DA scores, all coefficients for SEM are considerably lower. They are in the range of range 0.06 to 0.14 compared to the maximum of 0.47 for WMT. In contrast, all SYN (0-0.12) scores are higher especially vof German. All MOR (0.16-0.33) and most LEX scores are higher but they are still in a similar range as for the original DA scores.   In the following, we analyze two observations from our previous experiments in more depth: (i) the sensitivity of metrics to lexical overlap; (ii) the orthogonality of metrics in that they capture different linguistic signals. Kovacic did a quick give-and-go at midfield.
Kovacic managed a quick one-two in midfield.
Kovacic did a quick midfield at giveand-go.

Adversarial experiments
According to our results, all metrics rely on lexical overlap, which indicates that they may not be robust to adversarial examples. We check this by an additional experiment, for reference-based metrics, in which we query their pairwise preferences over three sentences: Sentence A is the anchor sentence; sentence B is a paraphrase of sentence A with little lexical overlap; sentence C is a non-paraphrase with high lexical overlap. A good metric m would have m(A, B) > m(A, C) but the high lexical overlap between A and C makes this task difficult.
Freitag et al. Sentences A are taken as source sentences from WMT19.  provided alternative references for WMT19; they instructed human professional translators to paraphrase the references as much as possible in terms of lexical choice and sentence structure but keep the same semantics. We take these as sentences B. We produce sentences C from sentences A: for each sentence A, we detect the nouns within the sentence using the NLTK POS tagger, and then we randomly permute them to produce a sentence C. Since  provided sentences in German, we translate all sentences into English using Google translate. We note that by inspection the translations are generally of high quality and satisfy our constraints of inducing paraphrases with low lexical overlap and non-paraphrases with high lexical overlap. Examples are shown in Table 7 and statistics in Table 6.
PAWS We complement the analysis with the native English PAWS dataset (Zhang et al., 2019) which consists of paraphrase and non-paraphrase pairs that have high lexical overlap. For each sentence in the dataset, there are multiple paraphrases and non-paraphrases. For a given sentence A, we use the paraphrase with the smallest amount of lexical overlap as sentence B and sentence C is the non-paraphrase with the highest amount of lexical overlap with A. We note that PAWS is more problematic as even the paraphrases B with minimum amount of lexical overlap do have considerable lexical overlap. Therefore, we select the 100 sentence pairs with the smallest amount of lexical overlap between sentence A and B. Table 6 shows the lex-ical overlap and the size of the datasets. Indeed, for PAWS, sentences B have only a little less lexical overlap with A than sentences C, while the dataset of  has a much clearer separation between B and C in this respect. Figure 3 shows the distribution of m(A, B) and m(A, C) for selected metrics m. Overall, the adversarial results on translated and non-translated datasets point in a similar direction. We see that metrics clearly prefer the high lexical overlap sentences which are non-paraphrases (sentences C) in the translated dataset of . For non-translated PAWS, metrics are at least to some degree indifferent, but tend to prefer B on average, with MoverScore and BERTScore having higher preference for C than mSBERT and SBERT, which confirms our linear regression results.
Overall, these experiment show that metrics are indeed not robust to lexical adversarial examples.

Ensemble of Models
Our experiments in Section 4 show that the different metrics use different information signals, even when they use the same underlying BERT representations. For example, BERTScore relies more on lexical overlap and mSBERT relies more on semantics; BERTScore and MoverScore both capture syntax, while the other metrics are less sensitive to it. This means that the metrics are to some degree orthogonal. Thus, we suspect that a combination of metrics yields a considerably better metric. We check this hypothesis through an extra experiment. We combine especially BERTScore and Mover-Score with SBERT and mSBERT based metrics.
We evaluate the metrics on segment-level. In segment-level evaluation, each sentence pair gets a score from m. The Pearson correlation coefficient is then calculated between these scores and the human judgement, disregarding the systems which generated the translations. To combine metrics, we simply average their scores. In the evaluation, we use the best performance of the single metrics as baseline and compare it to our combined metrics.
Table 13 (appendix) shows the improvements for different language pairs. In the reference-free case, the two best ensembles combine XMover-Score with mSBERT and LaBSE-the latter two rely less on lexical overlap than the first-leading to big improvements of 11-13% over the best individual metric. Combining metrics relying on similar factors shows less improvement, and often even leads to worse results. In the reference-based case, we combine BERTScore with mSBERT and observe an improvement of 8%, more than for any other combination we tested. These results show that combing metrics relying on different factors can largely improve their performance. 1

Conclusions
We disentangled BERT-based evaluation metrics along four linguistic factors: semantics, syntax, morphology, and lexical overlap. The results indicate that (i) the different metrics capture these different aspects to different degrees but (ii) they all rely on semantics and lexical overlap. The first observation indicates that combining metrics may be helpful, which we confirmed: simple parameterfree averaging of hetereogenous metric scores can improve correlations with humans by up to more than 13% in our experiments. The second observation shows that these metrics may be prone to adversarial fooling, just like BLEU and ROUGE, which we confirmed in an additional experiment in which we queried metric preferences over paraphrases with little lexical overlap and non-paraphrases with high lexical overlap. Future metrics should especially take this last aspect into account, and improve their robustness to adversarial conditions. There is much scope for future research, e.g., in developing better global explanations for referencefree metrics (as we cannot yet well explain these metrics), better linguistic factors (e.g., a clearer conceptualization of morphological similarity of two sentences) and in developing local explainability techniques for evaluation metrics (Fomicheva et al., 2021a,b). 2

Appendix
The following tables contain remaining experimental results.