How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries’ linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find that study parameters such as the overall number of annotators and distribution of annotators to annotation items are often not fully reported and that subsequent statistical analysis ignores grouping factors arising from one annotator judging multiple summaries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addition, we highlight that for the purpose of system comparison the current practice of eliciting multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget.


Introduction
Current automatic metrics for summary evaluation have low correlation with human judgements on summary quality, especially for linguistic quality evaluation (Fabbri et al., 2020).As a consequence, manual evaluation is still vital to properly compare the linguistic quality of summarization systems.
While the document understanding conferences (DUC) established a standard manual evaluation procedure (Dang, 2005), we conduct a comprehensive survey of recent works in text summarization that reveals a wide array of different evaluation questions and methods in current use.Furthermore, DUC procedures were designed for a small set of expert judges, while current evaluation campaigns are often conducted by untrained crowd-workers.The design of the manual annotation, specifically the overall number of annotators as well as the distribution of annotators to annotation items, has substantial impact on power, reliability and type I errors of subsequent statistical analysis.However, most current papers (see Section 2) do not consider the interaction of annotation design and statistical analysis.We investigate the optimal annotation methods, design and statistical analysis of summary evaluation studies, making the following contributions: 1. We conduct a comprehensive survey on the current practices in manual summary evaluation in Section 2. Often, important study parameters, such as the total number of annotators, are not reported.In addition, statistical significance is either not assessed at all or with tests (t-test or one-way ANOVA) that lead to inflated type I error in the presence of grouping factors (Barr et al., 2013).
In summarization evaluation, grouping factors arise whenever one annotator rates multiple summaries.
1 Excluded from the analysis are sentence summarization or headline generation papers, although most of the points we make hold for their evaluation campaigns as well.Summarization evaluation papers that do not present a new system but concentrate on sometimes large-scale system comparisons are discussed in the Related Work section instead.Lists of all included and excluded papers are given in Supplementary Material, which also contains exact evaluation parameters per paper in a spreadsheet.
We assess both what studies ask annotators to judge, as well as how they elicit and analyse judgements.The survey was conducted by one of the authors: for most papers, the categories they fell into were obvious.For difficult cases (unclear specifications, papers that do not fit the normal mould) the two authors discussed the categorisations.Survey results are given in Table 1.Further details about the choices made in the survey, including category groupings/definitions and what is included under Other, can be found in Appendix B. As many papers conduct more than one human evaluation (for example on different corpora), we also list individual annotation studies (a total of 95).
Of the systems that do have human evaluation, many focus on content, including informativeness, coverage, focus, and relevance.Where linguistic quality is evaluated, most focus on general questions about fluency/readability, with a smaller number of papers evaluating coherence and repetition.
In the rest of this section we focus on the three aspects of evaluation we cover in this paper: How to elicit judgements, how these judgements are analysed statistically and how studies are designed.

Methods
The majority of evaluations is conducted using Likert-type judgements, with the second most frequent method being rank-based annotations, including pairwise comparison.Best-worst scaling (BWS) is a specific type of ranking-oriented evaluation that requires annotators to specify only the first and last rank (Kiritchenko and Mohammad, 2017).QA (Narayan et al., 2018) is used for content evaluation only.This motivates us to compare both Likert and ranking annotations in Section 4.1.

Statistical Analysis
If a significance test is conducted, most papers analyse their data either using ANOVA or a sequence of paired t-tests.Both tests are based on the assumption that judgements (or pairs of judgements, in case of paired t-test) are sampled independently from each other.However, in almost all studies, annotators give judgements on more than one summary from the same system.Thus the resulting judgements are only independent if we assume that all annotators behave identically.Given that prior work (Gillick and Liu, 2010;Amidei et al., 2018), as well as our own reliability analysis in Section 4.1, show that especially crowd-workers tend to disagree about judgements, this assumption does not seem warranted.As a consequence, traditional significance tests are at high risk of inflated type I error rates.This is well known in the broader field of linguistics (Barr et al., 2013), but is disregarded in summarization evaluation.We show in Section 5 that this is a substantial problem for current summarization evaluations and suggest alternative analysis methods.

Design
Most papers only report the number of documents in the evaluation and the number of judgements per summary.This, however, is not sufficient to describe the design of a study, lacking any indication about the overall number of annotators that made these judgements.A study with 100 summaries and 3 annotations per summary can mean 3 annotators did all judgements in one extreme, or a study with 300 distinct annotators in the other.Only 26 of the 95 studies describe their annotation design in full, almost all of which use designs in which a small number of annotators judge all summaries.Only 6 of 49 crowdsourced studies report the full design.
We show in Section 5 that a low total number of annotators aggravates type I error rates with improper statistical analysis.In Section 6 we further show that with proper analysis, a low total number of annotators leads to less powerful experiments.Almost all analysed papers choose designs with multiple judgements per summary.However, we show in Section 6.2 that this -for the purpose of system ranking -leads to loss of reliability as well as power when compared to a study with the same budget and only one annotation per summary.

Coherence and Repetition Annotation
To elicit summary judgements for analysis, we conduct studies on two linguistic quality tasks.In the first, we ask annotators to judge the coherence of the summaries, while in the second we ask for the repetitiveness of the summary.We select these two tasks over the more frequent Fluency task as we found in preliminary investigations that many recent summarization systems already produce highly fluent text, making them hard to differentiate.We do not evaluate Overall and Content as both require access to the input document, which differentiates these questions from the linguistic quality evaluation of the summaries.
For both tasks, we conduct one study using a seven-point Likert-scale (Likert) and another using a ranking-based annotation method (Rank), where annotators rank summaries for the same document from best to worst.Screenshots of the interfaces for both approaches and full annotator instructions are given in Appendix A.
Corpus and Systems.Mirroring a common setup (see Section 2), we select four abstractive summarization systems and the reference summaries (ref) for analysis.
• The pointer generator summarizer (PG) (See et al., 2017), which is still often used as a baseline for abstractive summarization • The abstractive sentence rewriter (ASR) of Gehrmann et al. (2018), which is a strong summarization system that does not rely on external pretraining for its generation step • Seneca (Sharma et al., 2019), a system that combines explicit modelling of coreference information with an external coherence model • BART (Lewis et al., 2020), a transformer network that achieves SotA on CNN/DM.
We randomly sample 100 documents from the popular CNN/DM corpus (Hermann et al., 2015) with corresponding summaries from all systems to form the item set for all our studies.Study design.We ensure a sufficient total number of annotators by using a block design.We separated our corpus into 20 blocks of 5 documents and included all 5 summaries for each document in the same block, which results in 5 × 5 = 25 summaries per block.
All items in a block were judged by the same set of three annotators.No annotator was allowed to judge more than one block.This results in a total of 3 × 20 = 60 annotators and 1500 judgements per task.Figure 1 shows a schematic overview of our design, which balances the need for a large enough annotator pool with a sufficient task size to be worthwhile to annotators.
We recruited native English speakers from the crowdsourcing platform Prolific 2 and carefully adjusted the reward to be no lower than £7.50 per hour based on pilot studies.Summaries (or sets 2 prolific.com of summaries for Rank) within a block were presented in random order.

Ranking vs. Likert
Table 2 shows the average Likert scores and the average rank for all systems, tasks and annotation methods.We use mixed-effect ordinal regression to identify significant score differences (see Section 5 for details).Both annotation methods provide compatible system rankings for the two tasks, though for the repetition task both methods struggle to differentiate between systems.If we were interested in the true ranking, we could conduct a power analysis given some effect size of interest and elicit additional judgements to improve the ranking.However, as we are concerned with the process of system evaluation and not the evaluation itself, we do not conduct any further analysis.
In the remainder of this section, we focus on the reliability of the two methods as well as their cost-effectiveness.
We find that on coherence, Rank is more reliable than Likert, though not on repetition.An investigation of the Likert score distributions for both tasks in Figure 2 shows that coherence scores are relatively well differentiated whereas a majority of repetition judgements give the highest score of 7, indicating no repetition at all in most summaries.We speculate overall agreement suffers, because ranking summaries with similarly low level of repetition (and not allowing ties) is potentially arbitrary. 4

Cost-efficiency
While more reliable annotation methods allow for fewer annotations, the cost of a study is ultimately determined by the work-time that needs to be invested to achieve a reliable result.To investigate  this, we randomly sample between 2 and 19 blocks from our annotations and compute the total time annotators spent to complete each sample.We also compute the Pearson correlation of the system scores in each sample with the scores on the full annotation set.We relate time spent to similarity between sample and full score in Figure 3.
For coherence, Rank is more efficient than Likert.On repetition, the lower reliability of Rank also results in lower efficiency.However, with additional annotation effort, reliability becomes on-par with Likert.This is a consequence of the overall lower annotator workload for Rank.
The two most common significance tests in summarization studies, ANOVA and t-test (see Table 1), both assume judgements (or pairs of judgements, in the case of t-test) are independent.This is, however, not true for most study setups as a single annotator typically judges multiple summaries and multiple summaries are generated from the same input document.Both documents and annotators are thus grouping factors in a study that must be taken into account by the statistical analysis.Generalized mixed effect models (Barr et al., 2013) offer a solution but have, to the best of our knowledge, not been used in summarization evaluation at all.We choose a mixed effect ordered logit model to analyse our Likert data for both tasks. 5We will show that traditional analysis methods have a substantially elevated risk of type I errors, i.e. differences between systems found in manual analysis might be overstated.
Method.The ordered logit model we employ can be described as follows: where P (Y ≤ c) is the probability that the score of a summary is at most c. µ c is the threshold coefficient for level c, β is the vector of fixed effects and u a , u d are the vectors of annotator-and documentlevel random effects respectively, where u a , u d are both drawn from normal distributions with mean 0. Finally, X, Z a , Z d are design matrices for fixed and random effects.As the only fixed effect, we use a dummy-coded variable indicating the system that has produced the summary, with ref as the reference level.We estimate both random intercepts and slopes for both documents and annotators following advice of Barr et al. (2013) to always specify the maximal random effect structure.In practical terms this means that we allow annotators to both differ in how harsh or generous they are in their assessment, as well as in which system they prefer.Similarly, we allow system performance to vary per-document, leading to both generally higher or lower scores, as well as different system rankings per document.We fit all models using the ordinal R-package (Christensen, 2019) and compute pairwise contrasts between the parameters estimated for each system using the emmeans-package (Lenth et al., 2018) with Tukey-adjustment.
To demonstrate the problem of ignoring the grouping factors, we can now sample artificial data from the model distribution and try to analyse it with inappropriate tests.This Monte-Carlo simulation is similar to the more general analysis of Barr et al. (2013).
We set β to 0 so all systems perform equally well on the population level and only keep the (zero-mean) document and annotator effects in the model.The false-positive rate of statistical tests on this artificial data should thus be no higher than the significance level.We then repeatedly apply both the t-test and the approximate randomization test (ART) (Noreen, 1989), a non-parametric test, to samples drawn from the model and determine the type I error rate at p < 0.05.We set the number of documents to 100 and demand 3 judgements per summary to mirror a common setup in manual evaluation.We then vary the total number of annotators between 3 and 300 by changing how many summaries a single annotator judges.
Results.We report results given the model estimated for Likert in Figure 4. Ignoring the dependencies between samples leads to inflated type I error rates, whether using the t-test or the ART.This is especially severe when only few annotators judge the whole corpus.In the extreme case with only three annotators in total, the null-hypothesis is rejected in about 40% of trials at a significance level of 0.05 in both tasks.Even our original design with 60 annotators still sees an increase of the type I error rate by about 3%.Only if every annotator judges a single document and annotations are averaged per document, samples are independent and thus the real error is at the nominal 0.05 level.This design, however, is unrealistic given that annotators must be recruited and instructed.
We suggest two solutions to this problem: Either use mixed effect models or aggregate the judgements so samples become independent.This allows the assumptions of simpler tools such as ART to be met.In our study, we could average judgements in every block to receive independent samples.This is only possible, however, if the design of the study considers this problem in advance: a crowd-sourcing study that allows annotators to judge as many samples as they like is unlikely to result in such a design.

Study Design and Study Power
When conducting studies for system comparison, we are interested in maximizing their power to detect differences between systems.For traditional analysis, the power is ultimately determined by the number of documents (or judgements, when no aggregation takes place) in the study.However, when analysis takes into account individual annotators, power becomes additionally dependent on the total number of annotators and how evenly they participated in the study.This gives additional importance to the design of evaluation studies.In this section, we thus focus on how to optimize studies for power and reliability.
We first show that for well-powered experiments, we need to ensure that a sufficient total number of annotators participates in a study.In the second part of this section, we will then demonstrate studies can improve their power by not eliciting multiple judgements per summary.

Overall Number of Annotators
To demonstrate the difference in power caused by varying the total number of annotators in a study, we determine the power for a design with the same total number of documents and judgements per document but different total numbers of annotators.
We run the experiment both with regression and ART with proper aggregation of dependent samples as described in Section 5. We refer to the latter as ARTagg to differentiate it from normal ART.For each design we repeatedly sample artificial data from the Likert model and apply both tests to the data.The process is the same as in Section 5 except we do not set β to zero and count acceptances of the null-hypothesis. 6e again set the number of documents to 100 and the number of repeat judgements to 3 and vary the total number of annotators between 3 and 75 by varying the number of blocks between 1 and 25.We test for power at a significance level of 0.05.
Figure 5 shows how power drops sharply when only few annotators take part in the study.This is in line with the theoretical analysis of Judd et al. ( 2017) that shows that the number of participants is crucial for power when analysing studies with mixed effect models.The drop is worse for ARTagg as fewer annotators mean fewer independent blocks and thus a lack of datapoints for the analysis.

Annotator Distribution
Most studies elicit multiple judgements per summary, following best practices in NLP for corpus design (Carletta, 1996).While this leads to better judgements per document, the goal of many summarization evaluations is a per system judgement.
For this kind of study, Judd et al. ( 2017) show that for mixed models that include both annotator and target (in our case, input document) effects, a design where targets are nested within annotators, i.e. every annotator has its own set of documents, is always more powerful than one where they are (partially) crossed with annotators, i.e. a study with multiple annotations per summary, given the same total number of judgements.In fact, power could be maximized by having each annotator judge the sum- maries for only a single, unique document.However, this is usually not realistic due to the fixed costs of annotator recruitment and instruction.We demonstrate on our dataset how both reliability and power are affected by nested vs. crossed design.
To compare reliability, we randomly sample both nested and crossed designs from our full study and then compute the Pearson correlation of the system scores given by this smaller annotation set with the system scores given by the full study.As shown in Figure 6, nested samples are always at least as good and mostly better at approximating the results of the full annotation compared to a crossed sample with the same annotation effort.
We also conduct a power analysis for regression and ARTagg comparing nested and crossed designs.We again turn to Monte-Carlo simulation on the Likert models and sample nested and crossed designs with the same total number of judgements (i.e. the same cost).We keep the block size constant at 5 and vary the number of annotators between 3 and 60.For nested designs, we drop the document-level random effects from the ordinal regression, as document is no longer a grouping factor in nested designs.
Figure 7 shows that nested designs always have a power advantage over crossed designs, especially when few judgements are elicited.We also find that ART can be used to analyse data without loss of power when there are enough independent blocks.This might be attractive as ART is less computationally expensive than ordinal regression.

Related Work
Human evaluation has a long history in summarization research.This includes work on the correlation of automatic metrics with human judgements (Lin, 2004;Liu and Liu, 2008;Graham, 2015;Peyrard and Eckle-Kohler, 2017;Gao et al., 2019;Sun and Nenkova, 2019;Xenouleas et al., 2019;Zhao et al., 2019;Fabbri et al., 2020;Gao et al., 2020) and improving the efficiency of the annotation process (Nenkova and Passonneau, 2004;Hardy et al., 2019;Shapira et al., 2019).The impact of annotator inconsistency on system ranking has been studied both by Owczarzak et al. (2012) and Gillick and Liu (2010).To the best of our knowledge, we are the first to investigate the implications of annotator variance on the statistical analysis and the design in summarization system comparison studies.
For general NLG evaluation, van der Lee et al. ( 2019) establish best practices for evaluation studies.We extend on their advice by conducting experimental studies specifically for summary evaluation.In addition, we show the importance of study design and consideration of annotator-effects in analysis on real world data.The advice of Mathur et al. (2017) regarding annotation sequence effects should be taken into account in addition to our suggestions.
Method Comparison.Ranking has been shown to be effective in multiple NLP-tasks (Kiritchenko and Mohammad, 2017;Zopf, 2018), including NLG quality evaluation (Novikova et al., 2018).In this work we confirm this for coherence evaluation, although we find evidence that ranking is less efficient on repetition, where many documents do not exhibit any problems.We also add the dimension of annotator workload as a primary determinant of cost to the analysis of the comparison.
Multiple methods have been suggested to reduce study cost by sample selection (Sakaguchi et al., 2014;Novikova et al., 2018;Sakaguchi and Van Durme, 2018;Liang et al., 2020) or integration with automatic metrics (Chaganty et al., 2018).These efforts complement ours, as care still needs to be taken in analysis and study design.
Recently, rank-based magnitude estimation has been shown to be a promising method for eliciting judgements in NLG tasks and offers a combination of ranking and rating approaches (Novikova et al., 2018;Santhanam and Shaikh, 2019).However, it has not yet found widespread use in the summarization community.While magnitude estimation has been shown to reduce annotator variance, our advice regarding experimental design and grouping factors in statistical analysis applies to this method as well, as annotators can still systematically differ in which systems they prefer.
Statistical analysis.With regard to statistical analysis of experimental results, Dror et al. (2018) give advice for hypothesis testing in NLP.However, they do not touch on the problem of dependent samples.Rankel et al. (2011) analyse TAC data and show the importance of accounting for input documents in statistical analysis of summarizer performance and suggest the use of the Wilcoxon signed rank test for analysis.Sadeqi Azer et al. (2020) argue that p-values are often not well understood and advocate bayesian methods as an alternative.While the analysis in our paper is frequentist, the mixed effect model approach can also be integrated into a bayesian framework.Kulikov et al. (2019) model annotator bias in such a framework but do not account for differences in annotator preferences.In work conducted in parallel to ours, Card et al. (2020) show that many human experiments in NLP underreport their experimental parameters and are underpowered, including Likert-type judgements.Their simulation approach to power analysis is very similar to our experiments.In addition to their analysis, we show that ignoring grouping factors in statistical analysis of human annotations leads to inflated type I error rates.We also show that power can be increased by choosing nested over crossed designs with the same budget.The problem of underpowered studies has also been tackled outside of NLP by Brysbaert (2019).
For psycholinguistics, Barr et al. (2013) demonstrate how generalizability of results is negatively impacted by ignoring grouping factors in the anal-ysis.Mixed effect models have found use in NLP before (Green et al., 2014;Cagan et al., 2017;Karimova et al., 2018;Kreutzer et al., 2020), but to the best of our knowledge they have not been used in summary evaluation.

Conclusion
We surveyed the current state of the art in manual summary quality evaluation and investigated methods, statistical analysis and design of these studies.We distill our findings into the following guidelines for manual summary quality evaluation: Method.Both ranking and Likert-type annotations are valid choices for quality judgements.However, we present preliminary evidence that the optimal choice of method is dependent on task characteristics: If many summaries are similar for a given aspect, Likert may be the better option.
Analysis.Analysis of elicited data should take into account variance in annotator preferences to avoid inflated type I error rates.We suggest the use of mixed effect models for analysis that can explicitly take into account grouping factors in studies.Alternatively, traditional tests can be used with proper study design and aggregation.
Study Design.Study designers should control the number of annotators and how many summaries each individual annotator judges to ensure sufficient study power.Additionally, to ensure reliability of results, studies should report the design and the total number of annotators in addition to the number of documents and repeat judgements.Studies with repeat judgements on the same summary do not provide any advantage for system comparison and are less reliable and powerful than nested studies of the same size.
We hope that these findings will help researchers plan their own evaluation studies by allowing them to allocate their budget better.We also hope that our findings will encourage researchers to take more care in the statistical analysis of results.This prevents misleading conclusions due to ignoring the effect of differences in annotator behaviour.

A Interface Screenshots
We show screenshots of the instructions for both annotation methods and tasks in Figure 8 and interfaces in Figure 9.

B.1 Categories
While most categories are self-explanatory, we elaborate on some of the decisions we made during the survey in this section.
Evaluation Questions.We allow a single study to include multiple evaluation questions, as long as all questions are answered by the same annotators and use the same method.We make no distinction between informativeness, coverage, focus and relevance and summarize them under Content.Similarly, we summarize fluency, grammaticality and readability under Fluency.Other includes: • One study with a specialized set of evaluation questions evaluating the usefulness of a generated related work summary • One study of polarity in a sentiment summarization context • One study where annotators were asked to identify the aspect a summary covers in the context of review summarization • Two studies evaluating formality and meaning similarity of reference and system summary • One study evaluating diversity • One study conducting a Turing test • One study asking paper authors whether they would consider a sentence part of a summary of their own paper.

B.2 Survey Files
All papers we considered for the survey are listed in the supplementary material in the file all papers.yamlby their id in the ACL anthology bib-file.The 58 SDS/MDS system papers that contain new human evaluation studies and are thus included in the survey are listed in the category with human eval.
For the sake of completeness, we further list summarization papers we did not include in our survey.We separate them into the following categories: no human eval 47 SDS/MDS system papers without human evaluation sentsum 27 Sentence summarization and headline generation papers non system 34 summarization papers that do not introduce new systems, like surveys, opinion pieces and evaluation studies other 10 Papers that conduct summarization with either non-textual input or non-textual output

Figure 1 :
Figure1: Schematic representation of our study design.Rows represent annotators, columns documents.Each blue square corresponds to a judgement of the summaries of all five systems for a document.Every rectangular group of blue squares forms one block.

Figure 2 :
Figure 2: Score distribution of Likert for both tasks.Each data point shows the number of times a particular score was assigned to each system.

Figure 3 :
Figure3: Time spent on annotation (in minutes) vs. correlation with the full-sized score.We gather annotation times in buckets with a width of ten minutes and show the 95% confidence interval for each bucket.

Figure 4 :
Figure4: Relation of type I error rates at p < 0.05 to the total number of annotators for different designs, all with 100 documents and 3 judgements per summary.We conduct the experiment with both the t-test and approximate randomization test (ART).We show results both with averaging results per document and without any aggregation.We run 2000 trials per design.The red line marks the nominal error rate of 0.05.

Figure 5 :
Figure 5: Power for 100 documents and 3 judgements per summary with different number of total annotators.

Figure 6 :
Figure 6: Reliabilities of nested vs. crossed designs for Rank and Likert for both tasks.

Figure 7 :
Figure 7: Power for p < 0.05 of nested and crossed designs for ARTagg and regression.X-axis shows the number of judgements elicited, Y-axis the power-level.
Figure 8: Screenshots of the Annotator Instructions.

Table 1 :
Our survey for 58 system papers with manual evaluation studies

Table 2 :
Results of our annotation experiment.Numbers in brackets indicate rank for a system for a given annotation method.Multiple ranks in the brackets indicate systems at these ranks are not statistically significantly different (p ≥ 0.05, mixed-effects ordinal regression).

Table 3 :
Krippendorffs α with ordinal level of measurement and Split-Half-Reliability for both annotation methods on the two tasks.