SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation.


Introduction
Evaluating the quality of generated text is an increasingly difficult problem as large language models produce text of rapidly improving quality (Radford et al., 2019;Ouyang et al., 2022;Chowdhery et al., 2022).In spite of the improvements, such models often generate text that includes hallucinations and other subtle errors (Wiseman et al., 2017;Maynez et al., 2020;Parikh et al., 2020;Ji et al., 2023;Borji, 2023), making reliable evaluation essential for driving progress.
Common n-gram metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) are often not well correlated with human judgments 1 Data and metrics are available at https://goo.gle/seahorse Figure 1: Two summaries from the SEAHORSE dataset paired with human ratings for 6 dimensions of quality.In the second summary, the word in bold has a grammatical error in Russian; it uses the wrong aspect.The rater has noted this error, along with several others.for many natural language generation (NLG) tasks such as machine translation (Kocmi et al., 2021;Freitag et al., 2021a), summarization (Kryscinski et al., 2020), and dialogue (Dziri et al., 2022).Consequently, human evaluation is often necessary to reliably evaluate NLG systems.However, designing human annotation pipelines and obtaining annotations is resource-intensive, time-consuming, and not easily reproducible.Developing more reliable automatic evaluation metrics would make model development faster and more efficient.With this in mind, much recent work has focused on learnt metrics, i.e., neural classification or regression models that aim to directly predict scores that evaluate the quality of generated text (Zhang* et al., 2020;Sellam et al., 2020;Rei et al., 2020;Liu et al., 2023), often trained with human ratings.
As a result, large-scale collections of human evaluations serve two critical roles in NLG metric development: (1) a source of training data for learnt metrics and (2) a meta-evaluation benchmark for the performance of these learnt metrics.The large potential of such datasets is exemplified by the WMT metrics shared task,2 which has enabled rapid development of learnt metrics for machine translation that exhibit considerably higher correlation to human judgment than BLEU (Bojar et al., 2016;Freitag et al., 2021b).
However, outside of machine translation, the existence of such collections of human judgments is limited.Human annotations collected in NLG evaluations are rarely released (Gehrmann et al., 2022), and even when they are, they tend to cover a single language (typically English) and are from a single dataset or task, limiting the robustness of models and metrics trained on these annotations.Moreover, such annotations are often based on the test split of existing datasets (e.g., Fabbri et al., 2021;Aharoni et al., 2023), which can be problematic for training learnt metrics.This is because the primary advantage of reliable automatic evaluation is to help model development, e.g., hyperparameter selection on the validation set; therefore a neural metric trained on test set annotations would, in general, lead to overfitting.
In this work, we propose SEAHORSE,3 a largescale dataset for multilingual summarization evaluation.Our dataset consists of 96K summaries with ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, in 6 languages, for 9 systems (8 models plus the human-authored reference summaries) across 4 summarization datasets (see examples in Figure 1).The training and validation splits of the dataset come from the validation sets of the original summarization corpora to prevent test set contamination when training metrics.This permits us to train a learnt metric for each quality dimension that can be used for offline model evaluation.
We evaluate the metrics learned from SEA-HORSE on the SEAHORSE test set, as well as other existing meta-evaluation benchmarks, such as mFACE (Aharoni et al., 2023) and TRUE (Honovich et al., 2022).Our experiments show that the metrics generalize across datasets, tasks, and languages.For example, we demonstrate that although SEAHORSE includes data in 6 languages, the resulting learnt metrics achieve strong performance on the mFACE benchmark, which consists of 45 languages, exhibiting their zero-shot multi-lingual generalization potential.To summarize, the contributions of this paper are: • We conduct a comprehensive, large-scale human evaluation for summarization across six languages, six quality facets, nine systems and four datasets, resulting in over 96K humanrated summaries.To the best of our knowledge, this is the largest multilingual, multifaceted summarization evaluation resource.
• We train a learnt metric for each of the evaluated quality facets, and show that the metrics outperform strong baselines across our in-domain test set and previously published out-of-domain benchmarks, highlighting the quality of the human annotations we collect and the broad utility of our learnt metrics.
• We release our dataset and metrics to foster future work on multilingual, multifaceted summarization.

The SEAHORSE dataset
The SEAHORSE dataset consists of 96,645 summaries annotated with human ratings along 6 quality dimensions.In this section, we describe the SEAHORSE dataset, how we generated the summaries, and how we collected the annotations.

The summaries
The examples in SEAHORSE are in 6 languages: German (de), English (en), Spanish (es), Russian (ru), Turkish (tr), and Vietnamese (vi).We chose these languages by considering geographic and typological diversity and the availability of summarization datasets in those languages.
The summaries are based on articles from 4 different datasets in the GEM benchmark (Gehrmann et al., 2021): • XSum (Narayan et al., 2018): An English dataset where the task is to generate a onesentence summary of a BBC News article.
• XL-Sum (Hasan et al., 2021): Similar to XSum, the goal of this dataset is to generate a single-sentence summary of a BBC news article, but it covers 44 languages excluding English.
A breakdown of SEAHORSE across languages and datasets is in Table 1.
For each dataset, we randomly selected articles from their validation splits to comprise the SEA-HORSE training and validation sets, and articles from the test splits to make up the SEAHORSE test set.This distinction is important when using the dataset for training evaluation metrics (discussed in §4), because learnt metrics are typically used for model development, and hyperparameter selection is done on the validation set.Using a metric that was trained on test data would lead to overfitting.Our dataset construction ensures that a learnt metric can be trained on SEAHORSE data without concerns of test set leakage.
Next, we generate summaries for each article in the dataset.The summaries come from a subset of 9 different systems, which we will denote as follows: • reference: The human-authored summaries associated with each article from the original datasets.
• t5_base: The 220M-parameter version of the T5 model (Raffel et al., 2020).(This model is English-only, so we only use it to generate summaries with our en datasets.) • t5_base_250: The t5_base model with an under-trained checkpoint, trained for only 250 steps (en only).
• mt5_small_250: The same mt5_small model but using the checkpoint after training 250 steps.
Our choice of systems covers a range of expected system performances in order to capture a large diversity of system outputs and model error types.For instance, an under-trained small model (mt5_small_250) would likely have different errors than a 1-shot large language model (palm_1shot).Details about how the summaries are generated from these models are in Appendix A.

Annotation methodology
For each summary, we collect annotations along 6 dimensions, also referred to as Q1-6: Q1 comprehensible: The summary can be read and understood by the rater.(If "No," the rest of the questions will be skipped.) Q2 repetition: The summary is free of unnecessarily repeated information.
Q3 grammar: The summary is grammatically correct.
Q4 attribution: All the information in the summary is fully attributable to the source article, as defined in Rashkin et al. (2021).
Q5 main ideas: The summary captures the main idea(s) of the source article.
Q6 conciseness: The summary concisely represents the information in the source article.
For the first 3 questions, annotators see only the summary.The article is revealed when the raters are answering questions 4-6.They can answer "Yes," "No," or "Unsure" to each question and have the option to leave comments or flag any issues they see in the article.The annotation interface is shown in Figure 2. Note that our annotation process is referenceless, i.e., the annotator is never comparing a modelgenerated summary with the reference summary.They evaluate each summary on its own.Given the subjectivity of summarization, we believe this approach allows us to adequately reward models that generate relevant summaries that may be different than the reference.Moreover, this enables us to train reference-less metrics in §4, which have an added benefit of being able to be used at inference time for re-ranking.
The raters are paid, full-time annotators who were trained for this specific task and worked under the supervision of a project manager.For the non-English languages, the raters are bilingual, proficient in both the annotation language and English.They received a detailed set of instructions in English describing the 6 dimensions of quality and positive and negative examples of each in the target language.We created a set of 109 summaries with gold ratings, which we used to train the raters.Each annotator rated 20-30 summaries from this gold set.If the rater performed well on this subset, they were qualified to move forward with the annotation task.Otherwise, the annotator received feedback and were asked to complete another 10-20 ratings.This training process was repeated as needed.
A small number of approved annotators were removed during the annotation process, due to issues flagged by the annotation team and the authors.The ratings from the removed annotators are not included in the dataset.

Dataset analysis
We first analyze the dataset's composition and the quality of the collected annotations.The 1-shot PaLM model is particularly likely to copy from the article as its output, obtaining the highest ROUGE-L4 (Lin, 2004) scores between the summary and the article.In 14% of cases, the beginning of the 1-shot summaries (the first 20% of the summary) exactly matched the beginning of the reference article.
Table 3 shows the percent of summaries from each summarization system that received a positive (i.e., "Yes") rating from annotators.While there is variation across models and datasets, most summaries are rated positively for questions 1-3 (comprehensibility, repetition, and grammar).The rate of positive responses drops for questions 4-6 (attribution, main ideas, and conciseness), indicating that these areas remain a challenge for summarization models.A more detailed break down of the positive response rates is in Appendix B.
Note that the reference summaries do not always receive the highest rate of positive responses.The way in which reference texts are collected may limit their quality along some dimensions.For example, the text that was collected as a reference summary may not have been intended to be read as a standalone replacement for the article, and therefore may not be fully attributable to the article, as Rashkin et al. ( 2021) point out.
We can use the positive response rates to inspect the quality of the dataset by verifying the presence of 3 patterns we expect to see in the data: 1) higher positive response rates for better summarization models, 2) high correlation between the responses to Q4&6 and Q5&6, and 3) annotator agreement across the 6 dimensions.
Order of model quality Our first expectation is that summaries generated by better summarization models should receive more positive responses from raters.We have 3 types of model pairs where we can expect one model to generate better summaries than the other: 1) a larger model should outperform a smaller model (the xxl vs. the small model), 2) a fully trained model should outperform an under-trained model (the small vs. the small_250 model), and 3) a finetuned model should outperform a 1-shot prompted model (the finetuned vs. 1-shot PaLM models).
We compare how often these model pairs pro-duce low-quality summaries, i.e., summaries that are unintelligible to readers.In Table 3, we see that mt5_xxl produces fewer incomprehensible (Q1) summaries than mt5_small, which produces fewer than mt5_small_250.The same holds true for the T5 models, and palm_finetuned produces fewer incomprehensible summaries than palm_1shot, reflecting the expected relationship in quality between model pairs.While these results are averaged over the entire dataset, we see the same result when controlling for the source article and only considering items that have summaries generated by all 9 systems (see Appendix B).This pattern generally holds across the other dimensions of quality as well.There is one notable exception: PaLM's perfomance on attribution (Q4).For 4 languages, palm_1shot is more often rated as being faithful to the input article than palm_finetuned, which is likely due to its tendency to copy the article directly.
Generally, however, the SEAHORSE ratings capture the relative differences in model quality we expect to see when evaluating two models with known differences.
Correlation between dimensions Conciseness (Q6) is related to two other dimensions in our annotation: attribution (Q4) and main ideas (Q5).A  summary cannot be considered a "concise representation of the information in the article" if it has information that is not in the article (i.e., a "No" response for Q4) or if does not represent the main points in the article (i.e., a "No" response for Q5), which was a detail pointed out to evaluators in the task instructions.Therefore, we expect Q6 to be positively correlated with both of these dimensions if the annotators understood the task and the relationship between the dimensions of quality.
In > 99% of cases when the annotator says a summary is not attributable (Q4) or they say it lacks the main ideas from the article (Q5), they also say it is not concise (Q6).This is also reflected in Figure 3, which shows that the strongest correlation between questions is between questions 4&6 and questions 5&6.These results show the pattern we expect to see in the data given the task definition and instructions, and it demonstrates the annotators' ability to understand and execute the annotation task.

Q1
Q2 Q3 Q4 Q5 Q6 0.49 0.87 0.35 0.47 0.4 0.41 Annotator agreement While most items in the dataset were annotated once, we collected 2 additional ratings for a subset of the data to compare annotators' scores.Out of 8,920 duplicated annotations, the overall pairwise agreement between raters was 82%.Table 4 breaks down the pairwise accuracy across all languages and questions.Questions 1-3 have higher agreement, while questions 4-6 (which depend on more context and have a higher degree of subjectivity) have lower agreement.A similar trend is reflected in the Krippendorff's α values (Krippendorff, 1980, shown in Table 5), which correct for the probability of random agreement, except grammar (Q3) scores lowest.
These patterns in the annotators' responses are positive indicators about the overall quality of the SEAHORSE ratings.However, the more important test of the dataset's quality is its usefulness for developing evaluation metrics, which we discuss in the next section.

SEAHORSE
The SEAHORSE dataset is meant to serve both as a source of training data for learnt metrics as well as a meta-evaluation benchmark for these metrics.In this section, we evaluate SEAHORSE on these aspects by looking at how well metrics finetuned with our collected annotations can predict human ratings of generated summaries, both from the SEAHORSE test set and other existing datasets.When training metrics, we use a filtered version of the dataset that removes all duplicates and non-Yes or No ratings (88,280 total items).We divide the annotations into train/dev/test splits, where the summaries in the train and dev sets are based on articles from the original datasets' validation sets.The test set of SEAHORSE contains summaries of the articles in the original datasets' test sets.

Metrics
One way to train a metric using SEAHORSE is to finetune a text-to-text generation model, where the model is trained to take an article and summary as its input and to output the string '0' or '1' as a prediction of the human rating.We finetune mT5-_xxl (Xue et al., 2021) with the SEAHORSE training set to do this task, finetuning a separate metric for each dimension of quality.We call this model 9402 mt5 SEAHORSE 5 .More details are in Appendix A. Note that our goal is not to train a state-of-the-art metric but rather to evaluate the utility of SEA-HORSE as a resource to train and evaluate such metrics.
We compare the performance of mt5 SEAHORSE to several baselines: • majority_class A majority class baseline (i.e., picking the most frequent class).
• ROUGE-L The ROUGE-L score between the article and the summary.
We note that since we are operating in the reference-free setting, other learnt metrics such as BLEURT (Sellam et al., 2020) or BERTScore (Zhang* et al., 2020) are not applicable since they measure the similarity between the prediction and reference.
We evaluate the SEAHORSE and baseline metrics in two ways: the area under the ROC curve and the correlation (Pearson's ρ) between the metric and human scores.These measures are not sensitive to a thresholding value and are also used in the work we compare with (Honovich et al., 2022;Aharoni et al., 2023).

Evaluation on the SEAHORSE test set
We first evaluate mt5 SEAHORSE on the SEAHORSE test set to confirm that a model is able to learn to predict the different dimensions of quality in SEAHORSE.The results are shown in Table 6.As expected, we see that the mt5 SEAHORSE model is able to predict SEAHORSE ratings better than the baselines according to both our metrics.The repetition (Q2) metric performs the best out of the 6 dimensions, which is also the dimension with the highest pairwise annotator agreement.Examples of summaries paired with human, SEAHORSE, and ROUGE-L ratings can be found in Appendix C.
Reducing the size of the base mT5 model from XXL (13B parameters) to Large (1.2B) drops the performance of the metric, but shows similar trends and still outperforms all baseline approaches.More mt5_L SEAHORSE results can be found in Appendix D.

Evaluation on the mFACE dataset
In addition to achieving good performance on the SEAHORSE test set, we would like to evaluate how well models trained on SEAHORSE generalize to other multilingual summarization human evaluation datasets without any further tuning.This would give evidence that improving on SEAHORSE would lead to better evaluation metrics in general.
For this purpose, we choose the mFACE dataset7 (Aharoni et al., 2023).mFACE contains human evaluations of the XL-Sum test set, which consists of 45 languages on 3 dimensions: quality, attribution, and informativeness.While their definition of attribution is the same as ours (i.e., following AIS (Rashkin et al., 2021)), their definitions of quality (Is the summary comprehensible?) and informativeness (Is the summary a good summary of the article?) do not line up exactly with a single one of our questions, a misalignment that we expect to occur in practice given the lack of standardization of summarization human evaluation.
As a result, for each mFACE dimension, we use the SEAHORSE metric for the question that is most similar; attribution clearly aligns with Q4, and for quality and informativeness, we consider Q1 and Q6 to be the closest fit, respectively.
We evaluate on both the full mFACE dataset (all languages), as well as the 5-language subset that is common to both mFACE and SEAHORSE (en, es, ru, tr, vi).In addition to our baseline models, we also compare to an "upper-bound" mT5_xxl model that has been directly trained on mFACE data (mt5 MFACE ).Results are shown in Table 7.In all but one column, mt5 SEAHORSE outperforms the other methods that were not trained on the mFACE data and also performs well on the languages it was not finetuned on.mt5 SEAHORSE even performs comparably to mt5 MFACE on the 5 language subset on all dimensions, and the attribution dimension on the all-language set.mt5 MFACE performs better on quality and informativeness on the all-language set, as one would expect, since it has seen supervised data from those languages and dimensions whereas mt5 SEAHORSE is applied in a zero-shot setting.
As in the prior section, we apply mt5 SEAHORSE without any further finetuning to these datasets to assess its ability to evaluate attribution to other datasets and tasks beyond summarization.In addition to comparing to the majority class and ROUGE-L baselines, we also compare with t5 NLI .
Results are shown in Table 8. mt5 SEAHORSE achieves the best results across the summarization datasets, which is expected as many of these datasets consist of XSum and CNN/DailyMail (Hermann et al., 2015), the first of which is also a source of the SEAHORSE summaries and the second is a different news summarization dataset.Interestingly, despite only being trained on summarization data, mt5 SEAHORSE performs competitively to t5 NLI on the dialogue datasets (BEGIN, Q 2 , and Dial-Fact), indicating its suitability for evaluating tasks outside of summarization.t5 NLI performs best on the Fever, VitaminC, and PAWS tasks, which is expected given that the t5 NLI model was trained on these datasets.

Related work
We briefly review other large-scale datasets of human evaluations of summaries that have been released and compare them to SEAHORSE, but note that most focus on annotating the test data, which would lead to test data contamination when training metrics.
SummEval (Fabbri et al., 2021) and Real-Summ (Bhandari et al., 2020) are summarization meta-evaluation benchmarks with 12,800 and 7,742 annotations respectively.These benchmarks focus on a single language and single dataset: the The only other multilingual summarization evaluation dataset, to the best of our knowledge, is mFACE (Aharoni et al., 2023), which has annotations for 31,500 summaries covering a broader set of languages (45 languages).mFACE focuses on one dataset (XL-Sum) and a smaller set of models than SEAHORSE.In §4 we use mFACE as a comprehensive out-of-domain evaluation set, and view it as complementary to SEAHORSE, which aims to provide large-scale and diverse training data for metrics.

Conclusion
In this work, we present SEAHORSE, a large-scale multilingual, multifaceted dataset for summarization consisting of 96K human annotations of summaries.Due to its size and scope, SEAHORSE enables the training and evaluation of learnt metrics across several quality dimensions.Our results show that SEAHORSE-trained metrics not only achieve strong performance on our own test set but also generalize to other external and out-of-domain benchmarks: mFACE and TRUE.In the future, we are interested in exploring how SEAHORSE can be used more directly to improve the quality of summarization models and metrics, and hope this paper and the public release of SEAHORSE enables further research on these topics.

Limitations
The summaries in this work are in 6 languages, and the selection of these languages was based on the number of datasets and articles available for each language.We would like future work to explore the incorporation of low-resource languages, perhaps with the use of crosslingual and fewshot summarization systems.While the raters we worked with in this project went through several rounds of instructions and training, there is a degree of subjectivity inherent in the 6 text quality evaluation tasks and human ratings are noisy, as each individual rater may interpret and rate qualities slightly differently.Finally, the mT5-based metrics presented in this work primarily serve as a demonstration of the potential of the SEAHORSE data for developing summarization metrics; they have not optimized via thorough hyperparameter search, comparing different modeling architectures or approaches, etc.We hope the dataset and experimental results will provide a starting point for this type of exploration in the future.

Ethics Statement
This work relies on the efforts of human evaluators, who were compensated for their work.The summaries in this work are machine-generated and should not be treated as truth; they may contain misleading or incorrect information.None of the human ratings capture this dimension of the text, as our quality dimensions focus on the relationship between the summary and the source article, not a broader set of information or perspectives.For example, if an article contains a factual error, a summary that contains the same error should be rated as "Yes" for Q4 (attribution) because it is consistent with the article.We used summarization models of varying quality in this work, but all are imperfect and their output should be treated with caution.

C SEAHORSE example summaries and scores
Figure 4 shows 3 summaries from the SEAHORSE dataset, along with ratings for the attribution (Q4) dimension from the human raters, mt5 SEAHORSE , and ROUGE-L.

D Comparison between mT5_large and mT5_xxl
Table 11 compares the results of two versions of mT5 finetuned on SEAHORSE data, mT5_large and mT5_xxl, on the SEAHORSE and mFACE test sets.Scores are generally close between the two models, but mT5_xxl outperforms the large metric in all cases except one.

Figure 2 :
Figure 2: The annotation interface used to collect SEAHORSE.First, Question 1 and the summary are shown to the evaluator.Once they confirm that the summary is comprehensible, Questions 2-3 are shown.Finally, the article and Questions 4-6 are displayed (as pictured above).

Figure 4 :
Figure 4: Example summaries and ratings from the human raters, mt5 SEAHORSE , and ROUGE-L for attribution (Q4).
Table2contains the median length of summaries produced by each model, along with two measures of the overlap between the summaries and the source articles.

Table 3 :
The percent of "Yes" responses, broken down by model and question.

Table 4 :
The average pairwise agreement, broken down by language and question.

Table 6 :
Metrics' ability to predict SEAHORSE ratings, measured with Pearson's coefficient (ρ) and the area under the ROC curve (roc).mt5_L SEAHORSE is a finetuned version of mT5_large; the other mt5 metrics finetune mT5_xxl.

Table 7 :
Metrics' ability to predict mFACE ratings, measured with Pearson's coefficient (ρ) and the area under the ROC curve (roc).The asterisk indicates that the associated model was trained on the training portion of the mFACE dataset.

Table 8 :
Metrics' performance on the TRUE benchmark, measured with area under the ROC curve.t5 NLI is a T5-xxl model trained on a mixture of NLI datasets that includes the FEVER, VitaminC, and PAWS training sets (and thus those numbers are indicated with an asterisk).