Evaluating the Efficacy of Summarization Evaluation across Languages

While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall). Based on this, we evaluate 19 summarization evaluation metrics, and find that using multilingual BERT within BERTScore performs well across all languages, at a level above that for English.

In proposing these metrics, the authors measured correlation with human judgments based on English datasets that are not representative of modern summarization systems. For instance, Lin (2004) use DUC 1 2001-2003 for ROUGE (meaning summaries were generated with largely outdated extractive summarization systems); Zhao et al. (2019) use the TAC 2 dataset for MoverScore (again, featuring summaries from largely defunct systems; see Peyrard (2019) and Rankel et al. (2013)); and Zhang et al. (2020b) developed BERTScore based on a machine translation corpus (WMT). In contemporaneous work, Bhandari et al. (2020) address this issue by annotating English CNN/DailyMail summaries produced by recent summarization models, and found disparities over results from TAC. Equally troublingly, ROUGE has become the default summarization evaluation metric for languages other than English (Hu et al., 2015;Scialom et al., 2020;Ladhak et al., 2020;Koto et al., 2020b), despite there being no systematic validation of its efficacy across other languages. The questions we ask in this study, therefore, are twofold: (1) How well do existing automatic metrics perform over languages other than English? and (2) What automatic metric works best across different languages?
In this paper, we examine content-based summarization evaluation from the aspects of precision and recall, in the form of focus and coverage to compare system-generated summaries to groundtruth summaries (see Figure 1). As advocated by Koto et al. (2020a), focus and coverage are more interpretable and fine grained than the harmonic mean (F1 score) of ROUGE. This is also in line with the review of Hardy et al. (2019) on linguistic properties that have been manually evaluated in recent summarization research, who found precision and recall to be commonly used to complement ROUGE F1.
While it may seem more natural and reliable to evaluate focus and coverage based on the source document than the ground-truth summary, we use the ground-truth summary in this research for the following reasons. First, historically, validation of automatic evaluation metrics for summarization has been based primarily on ground-truth summaries (not source documents). Second, ROUGE (Lin, 2004) was initially motivated and assessed based on coverage over the DUC datasets 3 (Lin and Hovy, 2002) using annotations based on reference summaries (not source documents). Third, although it is certainly possible to generate different summaries for the same source document, we argue that the variance in content is actually not that great, especially for single-document summarization. Lastly, basing human evaluation (of focus and coverage) on the source article leads to more complicated annotation schemes, and has been shown to yield poor annotations (Nenkova and Passonneau, 2004;Fabbri et al., 2020).
In summary, this paper makes three contributions: (1) we carry out the first systematic attempt to quantify the efficacy of automatic summarization metrics over 8 linguistically-diverse languages, namely English (EN), Indonesian (ID), French (FR), Turkish (TR), Mandarin Chinese (ZH), Russian (RU), German (DE), and Spanish (ES); (2) we evaluate an extensive range of traditional and model-based metrics, and find BERTScore to be the best metric for evaluating both focus and coverage; and (3) we release a manually-annotated multilingual resource for summarization evaluation comprising 4,320 annotations. Data and code used in this paper is available at: https://github.com/ fajri91/Multi SummEval.

Related Work
As with much of NLP, research on automatic summarization metrics has been highly Englishcentric. Graham (2015) comprehensively evaluated 192 ROUGE variations based on the DUC-2004 (English) dataset. Bhandari et al. (2020) released a new (English) evaluation dataset by annotating CNN/DailyMail using simplified Pyramid (Nenkova and Passonneau, 2004). First, semantic content units (SCUs) were manually extracted from the reference, and crowd-workers were then asked to count the number of SCUs in the system summary. Their annotation procedure does not specifically consider focus, but is closely related to the coverage aspect of our work. Similarly, Fabbri et al. (2020) annotated the (English) CNN/DailyMail dataset for the four aspects of coherence, consistency, fluency, and relevance. While their work does not specifically study focus and coverage, relevance in their work can be interpreted as the harmonic mean of focus and coverage. 3 DUC 20013 DUC , 20023 DUC , 2003 There is little work on summarization evaluation for languages other than English, and what work exists is primarily based on summaries generated by unsupervised extractive models dating back more than a decade, for a small handful of languages. Two years prior to ROUGE, Saggion et al. (2002) proposed a summarization metric using similarity measures for English and Chinese, based on cosine similarity, unit overlap, and the longest common subsequence ("LCS") between reference and system summaries. In other work, Saggion et al. (2010) investigated coverage, responsiveness, and pyramids for several extractive models in English, French, and Spanish.
To the best of our knowledge, we are the first to systemically quantify the panlinguistic efficacy of evaluation metrics for modern summarization systems.

Evaluation Metrics
We assess a total of 19 different evaluation metrics that are commonly used in summarization research (noting that lesser-used metrics such as FRESA (Saggion et al., 2010) and RESA (Cohan and Goharian, 2016) are omitted from this study).
ROUGE (Lin, 2004) measures the lexical overlap between the system and reference summary; based on the findings of Graham (2015), we consider 7 variants in this paper: ROUGE-1 (unigram), ROUGE-2 (bigram), ROUGE-3 (trigram), ROUGE-L (LCS), ROUGE-S (skipbigram), ROUGE-SU (skip-bigram plus unigram), and ROUGE-W (weighted LCS). 4 METEOR (Lavie and Agarwal, 2007) performs word-to-word matching based on word-alignment, and was originally developed for MT but has recently been used for summarization evaluation (See et al., 2017;Chen and Bansal, 2018;Falke and Gurevych, 2019;Amplayo and Lapata, 2020). 5 BLEU (Papineni et al., 2002) is a precisionbased metric originally developed for MT, which measures the n-gram match between the reference and system summary. Based on the findings of Graham (2015), we use BLEU-4 according to the SacreBLEU implementation (Post, 2018). 6 MoverScore (Zhao et al., 2019) measures the Euclidean distance between two contextualized BERT representations, and relies on soft align-ments of words learned by solving an optimisation problem. 7 We adapt use the default configuration (n-gram=1) over 5 different pre-trained models, as detailed below. Note that MoverScore is symmetric (i.e. MoverScore(x, y) = MoverScore(y, x)), and as such is not designed to separately evaluate precision and recall.
BERTScore (Zhang et al., 2020b) computes the similarity between BERT token embeddings of system and reference summaries based on soft overlap, in the form of precision, recall, and F1 scores. 8 Zhang et al. (2020b) found that layer selection (i.e. which layer to source the token embeddings from) is critical to performance. Since layer selection in the original paper was based on MT datasets, we perform our own layer selection using a similar methodology as the authors, specifically considering precision and recall for focus and coverage, respectively.
For both MoverScore and BERTScore, we experiment with two classes of BERT-style model: (1) multilingual models, in the form of cased and uncased multilingual BERT (Devlin et al., 2019), and base and large XLM-R (Conneau et al., 2020), for a total of 4 models; 9 and (2) a monolinguallytrained BERT for the given language, as listed in the Appendix. While we expect monolingual BERT models to perform better, we also focus on multilingual models, both to confirm whether this is the case, and to be able to draw findings for languages without monolingual models.

Experimental Setup
For each language, we sample 135 documents from the test set of a pre-existing (single-document) summarization dataset: (1)   for each reference-system summary pair. 11 The motivation for using BERT-based systems is that our study focuses on non-English summarization, where BERT-based models dominate. 12 The total number of resulting annotations is: 8 languages × 135 documents × 2 models × 2 criteria (= focus and coverage) × 3 annotators = 12,960. For annotation, we used Amazon Mechanical Turk 13 with the customized Direct Assessment ("DA") method (Graham et al., 2015;Graham et al., 2017), which has become the de facto for MT evaluation in WMT. For each HIT (100 samples), DA incorporates 10 pre-annotated samples for quality control. Crowd-sourced workers are given two texts and asked the question (in the local language): How much information contained in the second text can also be found in the first text? We combine focus and coverage annotation into 1 task, as the only thing that differentiates them is the ordering of the system and reference summaries, which is opaque to the annotators. 14 Workers responded by scoring based via a slider button (continuous scale of 1-100). 15 For each HIT, we create 10 samples for quality control: 5 samples are random pairs (should be 11 Summaries for all datasets except LCSTS were sourced from the authors of the dataset. For LCSTS, we trained the two models ourselves based on the training data. 12 BERT-based summaries are representative of transformerbased model, and the ROUGE score gap over state-of-the-art models (Zhang et al., 2020a) for English is only ∼2 points. Coverage   EN  ID  FR  TR  ZH  RU  DE  ES Avg  EN  ID  FR  TR  ZH  RU  DE    scored 0) and the remaining 5 samples are repetitions of the same summary with minor edits (should be scored 100). For each language, we asked a native speaker to translate all instructions and the annotation interface. For a single HIT, we paid USD$13, and set the HIT approval rate to 95%. For HITs to be included in the annotated data, a quality control score of at least 7 out of 10 needed to be achieved. HITs below this threshold were re-run (ensuring they were not completed by a worker who had already completed that HIT), until three above-threshold annotations were obtained. 16 For each language, the HIT approval rate is set to 95% (with the number of HITs approved varying across languages). The annotation for English 16 We approved all HITs with at least 30 minutes working time and a minimum quality control score of 5, irrespective of whether they passed the higher quality-control threshold required for the ground truth. was restricted to US-based workers, and for other languages except Chinese was based on countries where the language is an official language. 17 To obtain focus and coverage values, we follow standard practice in DA in z-scoring the scores from each annotator, and then averaging.

Annotation Results
In Table 1, we present the results of the human annotation. We first normalize the ratings from each HIT into a z-score, and one-vs-rest Pearson correlation (excluding quality control items) to provide an estimate of human agreement/performance. 18 For all languages, we observe that the average quality and human agreement is moderately high. However, the agreement does vary, and it affects the interpretation of the correlation scores when we assess the automatic metrics later. Note also that we get the lowest score for English, meaning the results for non-English languages are actually more robust. 19 Although focus and coverage are positively correlated in Table 1, the distribution of scores varies quite a bit between languages: English annotation variance is higher than the other languages, and has the lowest correlation between focus and coverage (r = 0.57); for French, Russian, and Spanish, summaries generally have low focus and coverage (for more details, see scatterplots of focus-coverage in Figure 2 of the Appendix).

Correlation with Automatic Evaluation
In Table 2 we present the Pearson correlation between the human annotations and various automatic metrics, broken down across language and focus vs. coverage, and (naively) aggregated across languages in the form of the average correlation. We also include the one-vs-rest annotator correlation (Section 5.1) in the last row, as it can be interpreted as the average performance of a single annotator. Recognizing the sensitivity of Pearson's correlation to outliers (Mathur et al., 2020), we manually examined the distribution of scores for all languagesystem combinations for outliers (and present all scatterplots in Figure 2 of the Appendix).
The general pattern is consistent across languages: BERTScore performs better than other metrics in terms of both focus and coverage. This finding is consistent with that of Fabbri et al. (2020) wrt expert annotations of relevance (interpreted as the harmonic mean of our focus and coverage). ROUGE-1 and ROUGE-L are overall the best versions of ROUGE, while BLEU-4 performs the worst. For coverage, METEOR tends to be competitive with ROUGE-1, especially for EN, FR, DE, and ES, in large part because these languages are supported by the METEOR lemmatization package.
For some pre-trained models, MoverScore is competitive with BERTScore, although the average correlation is lower, especially for coverage.
We perform layer selection for BERTScore by selecting the layer that produces the highest correlation. For monolingual BERT the selection is based on the average correlation across the two summarization models, while for the multilingual models it is based on overall result across the 8 languages × 2 models. of the multilingual models. 20 We observe that BERTScore with monolingual BERT performs the best, at an average of 0.72 and 0.77 for focus and coverage, resp., but only marginally above the best of the multilingual models, namely mBERT uncased (0.72 and 0.76, resp.). Given that layer selection here was performed universally across all languages (to ensure generalizability to other languages), our overall recommendation for the best metric to use is BERTScore with mBERT uncased.
When we compare the metric results to the onevs-rest single-annotator performance from Table 1, we see a positive correspondence between the relative scores for annotator agreement and metric performance, which we suspect is largely an artefact of data quality (i.e. the metrics are assessed to perform better for languages with high agreement because the quality of the ground-truth is higher), but further research is required to confirm this. Generally the best metrics tend to outperform single-annotator performance substantially (>0.10), suggesting these metrics are more reliable than a single annotator.

Conclusion
In this work, we developed a novel dataset for assessing automatic evaluation metrics for focus and coverage across a broad range of languages and datasets. We found that BERTScore is the best metric for the vast majority of languages, and advocate that this metric be used for summarization evaluation across different languages in the future, supplanting ROUGE.      EN  ID  FR  TR  ZH  RU  DE  ES Avg  EN  ID  FR  TR  ZH  RU  DE Table 7: Pearson correlation (r) between automatic metrics and human judgments for coverage. We compute the recall for ROUGE and BERTScore. BERTScore uses the optimized layer, and other metrics are computed by using default configuration of the original implementation.