Is human scoring the best criteria for summary evaluation?

Normally, summary quality measures are compared with quality scores produced by human annotators. A higher correlation with human scores is considered to be a fair indicator of a better measure. We discuss observations that cast doubt on this view. We attempt to show a possibility of an alternative indicator. Given a family of measures, we explore a criterion of selecting the best measure not relying on correlations with human scores. Our observations for the BLANC family of measures suggest that the criterion is universal across very different styles of summaries.


Introduction
The goal of summarization is to convey important and only important information of the text in a fluent and comprehensible concise summary, preserving the factual consistency with the text. Almost all of these desired qualities of a summary are subjective to the background and opinion of a reader, arguably except the factual consistency.
When it comes to choosing a good evaluation measure, the correlations with human-assigned quality scores is accepted as the crucial criteria. Gabriel et al. (2020) formulated and explored a framework for judging evaluation measures by correlation with annotated factual errors. Arguably, the factual faithfulness can be annotated objectively, and with detailed classification of factual errors (Kryscinski et al., 2020;Huang et al., 2020;Vasilyev et al., 2020b). However, other summary qualities are subjective; this forces researchers to be careful in design and usage of human annotations (Bhandari et al., 2020;Fabbri et al., 2020).
Our motivation to seek criteria alternative or complementary to the correlation with human scores comes from the following observations: 1. Annotation scores are subjective and depend on the types of texts and summaries and on the qualification of annotators. For example, there is a big difference in expert and crowdsourced scores in (Fabbri et al., 2020) 1 . 2. Annotators tend to have a bias favoring anything that helps them to assign a score quickly: extractiveness of the summary, and focus on top of the document (Ziegler et al., 2020). 3. The annotation itself, as the task of assigning quality scores to a summary by a human, is different from how the summary quality is being valued by a typical human user. A real human reader does not have a goal of scoring a summary, but rather uses the summary to guess the content of the text.
In this paper we explore a criterion for selecting an 'optimal' evaluation measure different from maximizing correlation with human scores; we provide evidence that the criterion should be reliably universal across different kinds of summaries. We also observe how a dubious modification of automated evaluation, imitating a human scorer's behavior, can increase correlation with human scores.

Family of measures and max-help criterion
One of motivations for this exploration is to take a cue from a typical summary user -a user not trying to assign a score to the summary, but rather trying to guess the content of the full text with the help of the summary. In order to imitate such user, the measures based on text reconstruction or on question-answering are the most natural to consider. Following (Vasilyev et al., 2020a), we consider an evaluation measure as a triplet: 1. Language task: Language task to be performed on the text, e.g. text reconstruction or question-answering. The language task is generic, intuitively corresponds to the process of a user understanding the text. The models responsible for the task are trained on large datasets not related to the problem of summarization. 2. Setup: Setup for getting help from the summary. Somehow, the model should get help from the summary, making it easier to perform the language task on the document. 3. Metrics: A specific metrics used to measure the boost in the language task performance, due to the help from the summary.
We propose that an optimal measure should on average extract maximal help from the summary. Our reasoning is that the measure most capable in extracting help from summaries should be best fit for quantifying the help. Such a measure would be the most similar to an experienced summary user. Thus, if we have a family of measures, then accordingly to this 'max-help' criterion we should chose a measure that on average (across many samples) outputs a higher value of the boost.
In this paper we explore the BLANC families, as they leave less ambiguity in the choice of the underlying language model 2 . Two families defined in (Vasilyev et al., 2020a) differ by the setup. The BLANC-help family gets information from the summary by having the model read the summary before reading and reconstructing the text. The BLANC-tune family gets information from the summary by lightly tuning the model on the summary before reading and reconstructing the text. Practically, the evaluation in both families is arranged to process the text not all at once but sentence-by-sentence.
Measures in each of the families, BLANC-help and BLANC-tune, may differ by the parameters defining the setup, or by the metrics measuring the boost. Several choices of metrics were explored in (Vasilyev et al., 2020b), all giving similar results. A choice of the setup parameters also does not make a large difference, except for frequency of masking the text tokens. In this paper we will explore the variations in the setup in both the BLANC-help and BLANC-tune families.

Universal trends
The max-help criterion, formulated in previous section, may be credible only if it does not depend too strongly on the types of texts and summaries.
In order to excessively verify this assumption, we considered four types of summaries (and the corresponding texts): 1. CNN summaries from the CNN / Daily Mail dataset (Hermann et al., 2015).

Daily Mail summaries from the CNN / Daily
Mail dataset. 3. Top two sentences from random daily news. 4. Random two sentences from random daily news.
The random daily news were selected as three random documents per day over one year, with the 'summaries' of the document being two top and two random sentences. We used 1000 samples for each of the four types of summaries. For BLANC-help family, we found that for all four datasets the optimal or near-optimal setup (accordingly to the max-help criterion) happens to be at: 1. Interval between masking locations in the text: gap = 2. 2. Number of tokens allowed to be masked at each masking location: gap_mask = 1. 3. Minimal length of one-word token allowed to be masked is 6 characters: L normal = 6. 4. Minimal length of leading token of a composite word is 1 character, i.e. always masked: L lead = 1. 5. Minimal length of any of follow-up tokens of a composite word is 1 character, i.e. always masked: L f ollow = 1.
It makes sense that a normal word expressed in BERT model dictionary by single token is supposedly too common to be masked (unless it is a long enough word). This setup is almost the same as the parameters found in (Vasilyev et al., 2020b) to maximise correlation with human scores, except L normal = 4 and L f ollow = 100 (follow-up tokens are never masked). Ignoring small effects of the L tokens thresholds, maximizing correlations in this case also maximises average BLANC-help value, as was noticed in (Vasilyev et al., 2020b). As we show here, such lucky coincidence is not a rule: the "max-help" and the "max-human" (maximal correlation with human scores) measures do not always coincide.
The setup may be arranged differently, and may be defined to depend on different parameters. But the question we ask is fundamental for any family of measures: does the 'optimal' max-help evaluation measure remains optimal (or at least nearoptimal) for different kinds of texts and summaries? Figure 1 provides convincing evidence for the positive answer.
In Figure 1 we consider average BLANC-help value obtained with supposedly sub-optimal (different from max-help) setup. We consider a change of gap and gap_mask to enforce a less frequent and a more frequent masking, and a change in the token length thresholds for masking tokens. Remarkably, the average BLANC-help value drops in each case for all four datasets. The token length thresholds have almost no influence, making a drop just a few percents. Change in frequency of masking has a larger effect, leading to a drop 10%-20%.
For BLANC-tune family, we found that for all four datasets the max-help setup happens to be at: 1. Interval between masking locations in the text for inference: gap = 3. 2. Number of tokens allowed to be masked at each masking location for inference: gap_mask = 2. 3. The masking at tuning is not random but done 'evenly', the same way as for inference. 4. Interval between masking locations in the text for tuning: gap tune = 4. 5. Number of tokens allowed to be masked at each masking location for tuning: gap_mask tune = 3. 6. Minimal length of one-word token allowed to be masked is 6 characters: L normal = 6. 7. Minimal length of leading token of a compos- Figure 1: Drop of mean BLANC-help value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (top two sentences and random two sentences from random news articles). The parameters probed are: 'gap 3/1' is gap = 3 and gap_mask = 1; 'gap 3/2' is gap = 3 and gap_mask = 2; 'toks-normal 5' is L normal = 5; 'tokslead 2' is L lead = 2; 'toks-follow 2' is L f ollow = 2.
ite word is 1 character, i.e. always masked: L lead = 1. 8. Minimal length of any of follow-up tokens of a composite word is 1 character, i.e. always masked: L f ollow = 1. 9. Probability of replacement of a masked token by another random token at tuning is zero: p replace = 0. 10. Probability of leaving a masked token as it is at tuning is 0.1: p keep = 0.1.
Notice that p replace = 0 differs from the standard BERT training which is done with both p replace and p keep equal 0.1. However, both these probabilities have only weak influence on the BLANC-tune. Figure 2 shows several examples of changes of the setup, and again illustrates that the 'optimal' measure remains optimal across all four datasets.

Experts and turkers
If we chose a measure by any criterion that is not optimized by correlation with human scores, then, naturally, such measure would correlate with human score less strongly than the 'max-human' (maximum-correlation) measure of the same family. It is interesting to review how these two measures diverge.
Our "max-help" criterion favors the measures from BLANC-help and BLANC-tune described in Figure 2: Drop of mean BLANC-tune value when parameters differ from optimal. The drop is shown as a fraction of the optimal mean BLANC value. The summaries probed are: CNN and DM (from the CNN/Daily Mail dataset), Top and Rand (top two sentences and random two sentences from random news articles). The parameters probed are: 'gap-infer 2/1' is gap = 2 and gap_mask = 1; 'gap-tune 2/1' is gap tune = 2 and gap_mask tune = 1; 'p-replace 0.1' is p replace = 0.1; 'toks-normal 4' is L normal = 4; 'tune-rand' is making tokens masking random rather than even at tuning. the previous section. The "max-human" criterion of maximum correlation with human scores favors somewhat different measures of the same families.
The "max-help" measure was found using CNN/Daily Mail and random daily news data, and with no need for human scores. There is no need, for that matter, even for human summaries: as shown in the previous section, using sentences from the text leads to the same choice. The "maxhuman" measure is from (Vasilyev et al., 2020b) 3 . Let see how the measures correlate with human scores of the dataset SummEval (Fabbri et al., 2020) 4 . Table 1 shows correlations of both measures with average expert scores assigned to four qualities in (Fabbri et al., 2020). Naturally, the correlations of the max-human measure is higher. But if there is a systematic bias in human scores, and if the max- help criterion has any merit, then we may expect that switching from max-human to max-help would even stronger decrease correlations with non-expert scores, which supposedly might be even further from the max-help 'truth' than the experts. Each summary in (Fabbri et al., 2020) was scored not only by three experts, but also by five 'turkers' (crowdsource workers). With switch from max-human to max-help measure, the ratio of Pearson correlation with experts to correlation with turkers formally indeed increases by 10% for relevancy, 70% for fluency, 68% for consistency (yet decreases by 1% for coherence); the correlation with turkers suffers also increase of p-value above 0.05 for all qualities. Similarly, the ratio of Spearman correlation with experts to correlation with turkers increases 15% for relevance, 47% for fluency, 77% consistency (yet decreases 6% for coherence), and again p-values for turkers increase above 0.05. This exercise gives a hope for max-help criterion, or some similar universal principle, not dependent on maximising correlations with human scores.

Limited comparison with text
After reading a summary, an annotator may chose not to review carefully the whole text, but to consider in detail only part of it, whatever attracts attention through a quick glance or a quick read. We can imitate this by using only the most relevant part of the document in calculating BLANC. By most 'relevant' part we mean the part most related to the summary. In modifying BLANC this way, we would supposedly move opposite to the direction described in the previous sections: it is reasonable to expect that correlation with human scores will increase, but this would make a dubious 'improvement' of the BLANC as a measure.
Indeed, it is easy to increase the correlation of BLANC with average expert score for the dataset of 1600 samples of SummEval (Fabbri et al., 2020). We can calculate BLANC separately for each sentence of the text, and select n sentences with highest BLANC. We can consider these selected sentences as the 'text' to deal with, and calculate BLANC on it. Compared to working with full text, Spearman correlation with average expert score increases as shown by thin lines in Figure 3. In this and other figures through this section all p-values are below 0.05. We can imagine a human expert paying more attention to several (say three or five) most 'promising' sentences of the text. In evaluating relevance, this might be not very different from working with full text. But for other qualities (coherence, consistency, fluency) the correlation increases.
Naturally, for a human it is easier to review a contiguous piece of text rather than separated pieces, even if this might diminish legitimacy of evaluation of all qualities, including relevance. And, no surprise, BLANC for such contiguous part of text correlates with human scores even better -as shown by thick lines in Figure 3. Figure 4 illustrates the same trends when the resulting BLANC is calculated for each selected sentence separately, and then averaged over the sentences. Figure 5 shows the increase of correlations when the text is restricted not by the number of sentences but by a threshold on BLANC of a sentence. Selection of a part of the text for comparing it with summary is used in SUPERT multi-document evaluation measure ) as a tool for creating 'reference summary' from each document and then applying evaluation of the summary on the created references. In the context of BLANC here, the selection of a part of the text is done differently and has a clear interpretation: instead of estimating usefulness of the summary in guessing the whole text, we estimate how much the summary would help to guess only the most 'relevant' part of the text. The 'relevant' means the part of the text for which the summary turned out to be most helpful. We suspect that this is equivalent to using only the most promising (for annotator, after reading the summary) part of the text. This does not necessarily mean that the evaluation measure is improved, even though the correlation with human scores is stronger.

Conclusion
In this paper, we critically reviewed the assumption that maximal correlation with human scores defines the best evaluation measure for summarization; we provided observations supporting our scepticism. We stated the motivation and made the case for an alternative or at least complementary criterion for choosing an optimal summary evaluation measure from a family of measures. We suggested the maximal average extracted usefulness of summary as such a criterion. We provided observations that the criterion is fairly universal across very different kinds of summaries.