Finding a Balanced Degree of Automation for Summary Evaluation

Human evaluation for summarization tasks is reliable but brings in issues of reproducibility and high costs. Automatic metrics are cheap and reproducible but sometimes poorly correlated with human judgment. In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method. Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s) but replaces the manual work of judging SCUs’ presence in system summaries with a natural language inference (NLI) model. Fully automatic Lite3Pyramid further substitutes SCUs with automatically extracted Semantic Triplet Units (STUs) via a semantic role labeling (SRL) model. Finally, we propose in-between metrics, Lite2.xPyramid, where we use a simple regressor to predict how well the STUs can simulate SCUs and retain SCUs that are more difficult to simulate, which provides a smooth transition and balance between automation and manual evaluation. Comparing to 15 existing metrics, we evaluate human-metric correlations on 3 existing meta-evaluation datasets and our newly collected PyrXSum (with 100/10 XSum examples/systems). It shows that Lite2Pyramid consistently has the best summary-level correlations; Lite3Pyramid works better than or comparable to other automatic metrics; Lite2.xPyramid trades off small correlation drops for larger manual effort reduction, which can reduce costs for future data collection.


Introduction
Evaluating the quality of summaries is a challenging task. Human evaluation is usually regarded as the gold standard. Out of different human evaluation methods, Pyramid (Nenkova and Passonneau, 2004) has been perceived as an objective and reliable protocol and used by early summarization benchmarks, e.g., TAC (DBL, 2008(DBL, , 2009. Given one or several reference summaries of an example, human assessors first exhaustively extract Summary Content Units (SCUs), each SCU contains a single fact, from the reference(s), and then check whether they are present in a system summary. Figure 1 shows an example of human-labeled SCUs. Despite the reliability, manual evaluation is usually: (1) not reproducible, results may change when different evaluators are involved, making it hard to compare the results across papers; (2) expensive, in terms of time and cost. Thus, it is unlikely to apply human evaluation extensively in model selection (e.g., to choose the best checkpoint); instead, people usually treat it as an additional quality verification step. Aiming to work as a proxy of humans, many automatic metrics have been proposed (Lin, 2004;Tratz and Hovy, 2008;Giannakopoulos and Karkaletsis, 2011;Yang et al., 2016;Zhang et al., 2019;Deutsch et al., 2021). However, most of them cannot reliably substitute human evaluation due to the unstable performance across datasets (Bhandari et al., 2020), weak to moderate correlations with human judgment (Fabbri et al., 2021), or more indication of topic similarity than information overlap (Deutsch and Roth, 2021).
In this work, we want to combine human and automatic evaluations and find a balance between reliability and reproducibility (plus expense). Recall the Pyramid method (Nenkova and Passonneau, 2004), where these SCUs for reference summaries only need to be annotated once, then they can be fixed. It means SCUs can come with the datasets and are reusable for evaluating different systems. Hence, what hinders this method from being reproducible is its second step of asking humans to judge the presence of SCUs in system summaries. Whenever we have a new summarizer, we need to collect human labels for this step. Therefore, we propose to retain the reusable SCUs but replace human effort in the second step with a neural model. Basically, people are answering whether a SCU is entailed by the summary, which is closely related to the Natural Language Inference (NLI) task, i.e., judging whether a hypothesis is entailed by the premise. A lot of NLI datasets are available (Bowman et al., 2015;Williams et al., 2018a;Thorne et al., 2018;Nie et al., 2020) and recent NLI models have achieved close-to-human-level performance. Hence, we use a pretrained NLI model and finetune it on some in-domain gold labels of SCUs' presence. Then, we replace humans with the finetuned model, so that the evaluation results are reproducible as long as the same model is used. Meanwhile, it can run automatically during development to guide model selection and the evaluation cost will be dramatically reduced. Shapira et al. (2019) propose LitePyramid to simplify the standard Pyramid method via crowdsourcing. Following but different from their work, we additionally automate the presence annotation, and hence we call our method Lite 2 Pyramid.
Lite 2 Pyramid still requires human efforts to extract SCUs from reference summaries, and this step is usually considered to be more difficult. Early benchmarks, e.g., TAC (DBL, 2008(DBL, , 2009, are small-sized with fewer than 100 examples in the evaluation set, for which it is already expensive to manually collect SCUs. However, current popular summarization datasets, e.g., CNN/DM (Hermann et al., 2015), contain more than 10K evaluation examples, and hence we want to simulate SCUs via an automatic method for such large-scale datasets. For this, we make use of Semantic Role Labeling (SRL) that can automatically decompose a sentence to semantic triplets, e.g., subject-verb-object, and we take each triplet as a pseudo-SCU, which we call Semantic Triplet Unit (STU). Figure 1 illustrates the difference between SCUs and STUs. Although STUs do not always contain a single fact and some information might also be misrepresented, we find that it can reasonably simulate SCUs and lead to a fully automatic metric, Lite 3 Pyramid.
Lastly, instead of using either all human-labeled SCUs or all automated STUs, we investigate balanced trade-offs in between, e.g., using half SCUs and half STUs. A naive way is to randomly sample some reference sentences and substitute their SCUs with STUs. However, we find it is unstable and sometimes even works worse than using all STUs. More reasonably, we design an active learning (Settles, 2012) inspired selection method to help decide which sub-parts of the dataset are more worthy of obtaining expensive SCUs for. For this, we develop a regressor to predict the "simulation easiness" of each reference sentence: if a sentence is too complex to be well represented by STUs, we will ask humans to annotate SCUs for it; otherwise, we can apply automatic SRL. We call this method as Lite 2.x Pyramid, since it provides a smooth, flexible transition from Lite 2 Pyramid to Lite 3 Pyramid and balances reliability with cost.
To comprehensively evaluate the quality of metrics, we not only use 3 existing meta-evaluation datasets (TAC2008 (DBL, 2008), TAC2009 (DBL, 2009), REALSumm (Bhandari et al., 2020) but also newly collect PyrXSum with 100 XSum (Narayan et al., 2018) test examples plus summaries produced by 10 systems. Next, we compare our new metrics to 15 existing automatic metrics on these 4 meta-evaluation setups for both system-level and summary-level correlations with human Pyramid scores. We find that Lite 2 Pyramid consistently has the best summary-level correlations and is reliable as an out-of-the-box metric. Lite 3 Pyramid also mostly performs better or competitively. Lastly, the regressor-based Lite 2.x Pyramid can help substantially reduce annotation efforts for only small correlation drops, e.g., on TAC2008, TAC2009, it trades off only 0.01 absolute summary-level Pearson correlation and 0 system-level correlation for 50% SCU reduction.

Related Works & Background
Each example in a summarization dataset contains one or several source document(s) and one or several human-written reference(s). System-generated summaries are evaluated by comparing them to the references (i.e., reference-based) or directly scored (i.e., reference-free). This evaluation process is critical and directly affects our development choices.
Human (or manual) evaluation has been considered as the gold standard. Early benchmarks (DBL, 2008(DBL, , 2009 conducted three human evaluations: Responsiveness, Linguistic Quality, and Pyramid. The first two ask humans to directly rate the overall responsiveness or linguistic quality on a Likert scale. Following this, some works collect ratings for different aspects, e.g., relevance, readability (Paulus et al., 2018;Kryscinski et al., 2019;Fabbri et al., 2021). However, these ratings may suffer from raters' subjectivity. Pyramid (Nenkova and Passonneau, 2004) has been perceived as a more objective method, and it is reference-based. It has two steps: pyramid creation and system evaluation. In the first step, humans exhaustively find the Summary Content Unit (SCU) contributors from references, each contributor describes a single fact; contributors with the same meaning will be merged into one single SCU; then each SCU is weighted by how many contributors it has, equal to the number of references in which it is found. In the second step, each SCU has been manually checked its presence in the system summary; and the Pyramid score is the normalized sum of present SCUs' weights (essentially, a recall score). Passonneau (2010) normalize it by the total weight of the best possible summary. Recently, Shapira et al. (2019) propose LitePyramid. It removes SCU merging and weighting, allowing SCUs of the same meaning to co-exist, and they show that the evaluation can be reliably conducted by crowdsourcing workers.
Automatic metrics trade off the reliability of human evaluation for reproducibility, low cost, and fast speed. Many automatic metrics have been introduced, the majority of which are referencebased. Some metrics measure the n-gram overlap (Papineni et al., 2002;Lin, 2004), out of which ROUGE (Lin, 2004) is the most widely adopted metric till today. Some other works compute the similarity over n-gram graphs (Giannakopoulos and Karkaletsis, 2011;Giannakopoulos et al., 2008) or distributions (Lin et al., 2006). Since exact n-gram matching is too rigid, METEOR (Banerjee and Lavie, 2005;Denkowski and Lavie, 2014) provides flexibility by stemming, synonyms, etc., and recently, a few metrics enable "soft" matching through contextualized word embeddings (Zhao et al., 2019;Clark et al., 2019;Zhang et al., 2019). However, Deutsch and Roth (2021) point out that the n-gram based metrics indicate more topic similarity than information overlap. Structural evaluation metrics have also been proposed beyond n-grams. BEwT-E (Tratz and Hovy, 2008) decomposes the system summary and the reference(s) into syntactic units and compute their similarities, and decomposed-ROUGE (Deutsch and Roth, 2021) computes ROUGE for each syntactic category. APES (Eyal et al., 2019) and QAEval (Deutsch et al., 2021) are QA-based metrics that assume similar answers will be obtained from similar system summaries and reference(s).
Automatic Pyramid methods have also been proposed (Yang et al., 2016;Hirao et al., 2018;Gao et al., 2019). They usually decompose both the system summary and the references into smaller units (e.g., Elementary Discourse Units) and compare the two list of units. Differently, our Lite 3 Pyramid only decomposes the reference summaries to semantic triplet units (STUs), and we use NLI to judge the presence of each STU in the system summary, which is closer to the original Pyramid's procedure and leads to better correlations with human scores (refer to Section 5). Peyrard et al. (2017) propose a learned metric, S3, that is trained to directly predict human Pyramid or Responsiveness scores based on ROUGE, FrameNet features, etc., which is similar to how we finetune the NLI model with human labels of SCUs' presence. Xu et al. (2020) is distantly related to us in the way of representing texts by SRL, but it is used to weigh the content in the source document(s). Besides, some reference-free metrics are introduced for summary quality estimation (Xenouleas et al., 2019;Gao et al., 2020;Vasilyev et al., 2020) or faithfulness evaluation (Durmus et al., 2020;Wang et al., 2020).
Semi-automatic evaluation is introduced by Zhou et al. (2007). They automatically decompose both system summary and reference(s) into semantic units and then ask humans to match/align the two lists of units. In contrast, our semi-automatic Lite 2 Pyramid retains the reusable SCUs while automatically judges the SCUs' presence in the system summary (via NLI).

Lite 2 Pyramid
Lite 2 Pyramid is a semi-automatic metric that retains human-labeled Summary Content Units (SCUs) to represent reference summaries of a data example i, i.e., {SCU ij } N i j=1 , where N i is the total number of SCUs from all reference summaries. The original Pyramid (Nenkova and Passonneau, 2004;Passonneau, 2010) assumes there are multiple references available (e.g., TAC datasets (DBL, 2008(DBL, , 2009 have 4 references per example). Therefore, each SCU comes with weight, {w ij } N i j=1 , representing the number of reference summaries in which the SCU is found. To evaluate a particular system summary s i , the standard Pyramid method manually checks each SCU's presence, sums up Catherine Nevin was allowed out despite being jailed for life in April 2000. 62-year-old was seen on the bus, with a pal and walking around in Dublin...  the weights of present SCUs, and normalizes it:

Reference
The best possible score is the highest sum of weights the summary can obtain with the same number of present SCUs (details can be found in (Passonneau, 2010)). Differently, LitePyramid (Shapira et al., 2019) takes a union of SCUs from all reference summaries with duplication (we use SCU * to distinguish it from the de-duplicated SCU used above) and then samples the same number (K) of SCUs for every data example, hence: Without weighting, this method also works in single-reference situations. Different from this method, we keep the exhaustive set (instead of a fixed-size sample) of SCUs for each example (also used by Bhandari et al. (2020)). Importantly, we replace human efforts of checking SCUs' presence with a Natural Language Inference (NLI) model f NLI 's entailment prediction. Using e to denote entailment, our metric can be written as: (2) Note that multiplying the weights and dividing by the sum of the weights is equal to repeating SCU i for w i times, which shows how we treat SCUs as an exhaustive set with duplication. For single-reference datasets (CNN/DM or XSum), the weights are all 1. Plus, the above equations all compute summary-level scores. To get one single score for the system, we simply take the average across examples, e,g., 1 |D| i∈D Lite 2 Pyramid(s i ). The f NLI function can be implemented in four different ways, denoted as p 3c , l 3c , p 2c , l 2c , and explained below. Following the standard 3-class setting of NLI tasks, the NLI model will predict whether the SCU ij is entailed by or neutral to or contradicted with the summary s i . Hence, we can use either the output probability of entailment class p 3c (e) or the predicted 1 or 0 entailment label l 3c (e) as the function f NLI . However, existing NLI datasets (Bowman et al., 2015;Williams et al., 2018b;Thorne et al., 2018;Nie et al., 2020) have different data distributions and domains from the summarization data; hence models trained on these datasets may not perform well in judging the presence of SCUs. Therefore, we finetune the pretrained NLI model by human-labeled SCUs plus presence labels. Since humans only give 2-class labels (present or not present), we adapt the model to perform two-way classification. Specifically, we add up the logit of neutral (n) and contradiction (c) classes as the logit of the "not present" label: p 2c (e) = exp(logit e ) exp(logit e )+exp(logit n +logit c ) . Again, we can use p 2c (e) or l 2c (e) as f NLI after finetuning. In our experiments, we call the pretrained NLI model on NLI datasets as "zero-shot" because it has not seen summarization data. Empirically, we find that when using the zero-shot NLI model, l 3c works best; while after finetuning, p 2c usually works best.

Lite 3 Pyramid
Lite 3 Pyramid fully automates Lite 2 Pyramid by also simulating the human-annotated SCUs with automatic extracted semantic triplets. We use a Semantic Role Labeling (SRL) model (Carreras and Màrquez, 2005;Palmer et al., 2010;He et al., 2017;Shi and Lin, 2019) to achieve this goal. SRL determines the latent predicate-argument structure of a sentence, e.g., who did what to whom. As shown in Figure 1, the SRL model will identify several frames for each sentence, and each frame has one verb and a few arguments. For each frame, we keep the verb and any arguments before the verb unchanged, then we enumerate the arguments after the verb to form a list of triplets as {(ARG bef ore , where M is the number of arguments after the verb. We concatenate the three elements in each triplet to form a short sentence because a SCU is a short sentence and we want to resemble it as much as possible. We call these short sentences Semantic Triplet Units (STUs). 2 For example, as illustrated by Figure 1, based on the 4 frames identified by SRL, we extract 9 STUs from the reference.
Since one entity can be referred to by pronouns or different names in the summary, we also apply Coreference Resolution (Lee et al., 2018) to improve the simulation quality. As shown in Fig-ure 1, Catherine Nevin and 62-year-old are identified as coreference, so we use Catherine Nevin as the subjects of STUs and add an additional STU Catherine Nevin is 62-year-old. 3 In our experiments, we only apply coreference resolution for REALSumm because empirically, on TAC datasets, we find applying it works worse than not applying; and PyrXSum has one-sentence summaries where coreference hardly appears. 4 Although STUs seem to reasonably simulate SCUs for the example in Figure 1, it has limitations, especially, when the sentence is syntactically complicated, e.g., with a lot of modifiers, clauses, complements (refer to Section 5 for more discussions).
After we obtain the STUs from all reference summaries, we score a system summary s i by: where M i is the total number of STUs. Note that there is no weight because we extract STUs from every reference summary and take a union, which allows STUs of the same meaning to co-exist.

Lite 2.x Pyramid
As discussed so far, human-annotated SCUs are accurate yet expensive, whereas automatically extracted STUs are cheap yet sometimes erroneous. The next natural question is how to find a balance between them. One way is to randomly replace 50% sentences' SCUs with STUs, but a more intuitive way is to make the decision based on the "easiness" of simulating the sentence's SCUs by STUs. If the sentence is unlikely to be well represented by STUs, we can ask humans to label SCUs for it; otherwise, we can use STUs to reduce cost. This is similar to how active learning (Settles, 2012) chooses which training examples to collect human labels for. We define simulation easiness as the average simulation accuracy of each SCU. ROUGE-1-F1 (R1 F1 ) (Lin, 2004) is used to measure the simulation accuracy: Then, the easiness of a sentence with N sent SCUs is written by Easiness sent = 1 Nsent Nsent j=1 Acc j . The higher the easiness score is, the more accurately the STUs resemble SCUs.
After we obtain these gold easiness scores, we want to train a regressor to predict the score based on sentence complexity features. As we mentioned above, the sentence's syntax can indicate its simulation difficulty. Therefore, we get the Constituency Parsing tree (Joshi et al., 2018) of each sentence and define the following features: (1) sentence length; (2) linearized parsing tree length; (3) parsing tree depth; (4) sentence length / parsing tree depth; (5) the counts for each of the 65 nonterminal tokens (e.g., NNP). In total, we represent each sentence with a 69-dim feature vector. Then, we train an XGBoost (Chen and Guestrin, 2016) regressor to predict the simulation easiness by minimizing the mean squared errors. Given this regressor, we propose to replace top 0.x scored sentences' SCUs with STUs, leading to Lite 2.x Pyramid. For example, Lite 2.5 Pyramid (illustrated in Figure 1) means that we use STUs for the top 50% scored sentences and use SCUs for the other half.

Evaluation
Correlation with human scores. Following the standard meta-evaluation strategies used in previous works (Peyrard et al., 2017;Bhandari et al., 2020;Deutsch et al., 2021), we evaluate metrics by two types of correlation with gold human scores. System-level correlation aims to evaluate how well can the metric compare different summarization systems? We denote the correlation measure as K, human scores as h, the metric as m, and generated summaries as s. We assume there are N examples and S systems in the mete-evaluation dataset. Then, the system-level correlation is defined as: Summary-level correlation answers if the metric can reliably compare summaries generated by different systems for the same document(s). Using the same notations, this correlation is written by: We use Pearson r or Spearman ρ as the correlation measure K. Pearson measures linear correlation while Spearman measures ranking correlation. 7 We find that the exhaustive set based computation (replacing fNLI in Equation 2 by gold labels) has close to perfect correlation with TAC's official scores. REALSumm also use this computation as reflected by the gold score in Figure 1. Models. We use the pretrained RoBERTa-large (Liu et al., 2019) based NLI model released by Nie et al. (2020), which has been trained on multiple NLI datasets. We continually finetune this model with the gold SCUs plus SCU-presence labels always for 2 epochs. For SRL, Coreference Resolution, and Constituency Tree Parser, we use the out-of-the-box tools provided by AllenNLP Shi and Lin, 2019;Lee et al., 2018;Joshi et al., 2018). See the complete implementation details in Appendix A.3.

Human-Metric Correlation Results
Since we find that finetuning the NLI model with in-domain presence labels is greatly beneficial, following Peyrard et al. (2017), we evaluate by 5-fold cross-validation. For each dataset, we split it into 5 folds, finetune the NLI model on 4 folds, test on the left one, and repeat for 5 times. We report the 5-fold average correlations of both our metrics and the 15 metrics we compare to for fair comparison. Instead of random splitting, we split the data by examples or by systems, aiming to check the generalizability across examples or systems. E.g., if we split REALSumm by examples, each fold has summaries of 20 examples; when split by systems, each fold has summaries generated by 5 systems. Table 1 shows our 5-fold (split by examples) cross-validation results. Firstly, it can be observed that our Lite 2 Pyramid always has the best or close to the best correlations; especially, it has 0.08 to 0.16 higher summary-level correlations than the best metrics we compare to. It demonstrates the advantage of semi-automatic evaluation which dramatically improves reliability without losing reproducibility. Meanwhile, it indicates that the finetuned NLI model can generalize to new data examples and works reasonably well as a proxy of human judgment. In contrast, Lite 2 Pyramid-0, which uses a non-finetuned NLI model, usually works greatly worse than Lite 2 Pyramid, which indicates the importance of in-domain finetuning. It is surprising that Lite 2 Pyramid-0 works better than or similar to Lite 2 Pyramid on PyrXSum. We conjecture that because our PyrXSum is relatively smallsize, the finetuning will not make big difference.
Secondly, our Lite 3 Pyramid has the best correlations comparing to the other automatic metrics, except for PyrXSum; again, its advantage is more prominent on summary-level correlation (around 0.03 to 0.05 better). Its failure in PyrX-Sum is caused by the limitation of SRL. XSum's reference summary sentences usually have a lot of modifiers, adverbial phrases/clauses, or complements, which increases the difficulty of decomposing it into STUs. E.g., for the summary "Netherlands midfielder Wesley Sneijder has joined French Ligue 1 side Nice on a free transfer", human annotates the following 5 SCUs: "Wesley Sneijder is a midfielder", "Wesley Sneijder comes from Netherlands", "Wesley Sneijder has joined French Ligue 1 side", "Wesley Sneijder has joined Nice", and "Wesley Sneijder has been on a free transfer". However, since SRL frames are centered around verbs, it can only extract two STUs: "Netherlands midfielder Wesley Sneijder joined French Ligue 1 side Nice" and "Netherlands midfielder Wesley Sneijder joined on a free transfer". On average, human labels 4.8 SCUs per PyrXSum summary, however, the number is only 2.8 for STUs. Hence, a better semantic unit decomposer needs to be designed to improve Lite 3 Pyramid's accuracy.
Lastly, Lite 2.x Pyramid alleviates the problem mentioned above by deferring complex sentences to humans to annotate SCUs for. As shown in Table 1, Lite 2.5 Pyramid, which saves half human effort by substituting 50% sentences' SCUs with STUs, always has correlation reduction less than half of the difference between Lite 2 Pyramid and Lite 3 Pyramid and sometimes even has better system-level correlations than Lite 2 Pyramid. The full Lite 2.x Pyramid curves are shown in Figure 2, where the x-axis is the percentage of STUs (the higher means the fewer human efforts involved) and the y-axis is the summary-level Pearson correlation (Figure 4 in Appendix shows system-level correlations). We can see that our Lite 2.x Pyramid offers a smoothing transition from semi-automatic Lite 2 Pyramid to automatic Lite 3 Pyramid. More importantly, compared to randomly selecting sentences (yellow dash lines), our regressor-based selection achieves a slower correlation reduction, i.e., saving the same amount of human effort our method can retain higher metric quality. Plus, this curve gives people flexible choices per their budget.
Due to space limitations, the 5-fold (split by systems) cross-validation results are in Table 4    paring to other automatic metrics except for the system-level correlations on REALSumm and PyrXSum. And, Lite 2.x Pyramid also nicely bridges Lite 2 Pyramid and Lite 3 Pyramid and works better than random replacement. However, differently, Lite 2 Pyramid does not get the best system-level correlations on REALSumm and PyrXSum, which may indicate the bigger generalization challenge across different systems.
Takeaway: Lite 2 Pyramid consistently has the best summary-level correlations and the best systemlevel correlations in most cases. The automatic Lite 3 Pyramid also mostly works better than other automatic metrics. Lite 2.
x Pyramid provides flexible and balanced degrees of automation per budget.

Out-of-the-Box Generalization
We release the finetuned NLI models and the pretrained sentence regressors for future usage, so that they will work as out-of-the-box evaluation metrics for any summarization tasks. Then, a natural question to ask is how will the metrics perform on a new summarization task? To better estimate the out-ofthe-box performance, we simulate out-of-the-box situations by training the NLI model and the regressor on some dataset(s) and then evaluate metrics on the other dataset(s). For example, in the last big row (starting with TAC08+TAC09+REALSumm) of Table 2, we finetune the NLI model and train the regressor on the entire TAC08+TAC09+REALSumm data then evaluate our metrics on PyrXSum only. Meanwhile, we also compare to other metrics. Different from the numbers in Table 1, numbers in Table 2 are calculated on the entire meta-evaluation set instead of the average of 5 folds. It can be observed from Table 2 that our Lite 2 Pyramid retains its advantage in most out-ofthe-box situations, especially for summary-level correlation. Though Lite 3 Pyramid does not always outperform the best metrics, it stays competitive. In addition, Lite 2.5 Pyramid retains its feature of trading off less than 50% correlation for saving 50% human effort. Surprisingly, learning from more data does not perform better: for PyrXSum, learning from all three other datasets (TAC08+TAC09+REALSumm) gets significantly  Table 2: Out-of-the-box generalization results. In each column, the bold numbers are the best and the underline numbers are the best out of automatic metrics.
worse performance than learning from TAC08 only or TAC08+TAC09. We conjecture that the difference between REALSumm (originated from CNN/DM (Hermann et al., 2015)) and PyrXSum (originated from XSum (Narayan et al., 2018)) leads to a "distribution shift", which causes the performance drop. Besides, though new metrics have been proposed, ROUGE is still the dominant evaluation metric in the summarization literature. However, based on our comparison, ROUGE is not the best evaluation choice in most cases, while ME-TEOR (Banerjee and Lavie, 2005) and the learningbased metric, S3 (Peyrard et al., 2017), have fairly good correlations with human judgment. Overall, our automatic Lite 3 Pyramid is on a par with them, having the best performance in 4 cases (4 underline scores in Table 2). We provide the support of our metrics through our github repository and we will also incorporate it within the SacreROUGE library (Deutsch and Roth, 2020).

Conclusion
We propose to combine manual effort and automation for summary evaluation. We introduce a semiautomatic Lite 2 Pyramid that gains reproducibility by replacing part of human effort with an NLI model. Following it, an automatic Lite 3 Pyramid is proposed through decomposing references by SRL. Plus, we propose a simple yet effective regressor to decide which sentences are more worthy of labeling SCUs for, leading to flexible transition metrics, Lite 2.x Pyramid. Evaluating on four meta-evaluation datasets and comparing to 15 other automatic metrics, Lite 2 Pyramid consistently has the best summary-level correlations; Lite 3 Pyramid also performs better or competitively; and Lite 2.x Pyramid offers flexible degrees of automation, and its regressor will provide useful expense-saving guidance for future datasets.  Then, we collect the SCUs' presence labels for each system summary on Amazon Mechanical Turk. Figure 3 illustrates the data annotation instructions and interfaces shown to crowdsourcing workers. The summaries usually only contain one sentence. We estimate it will take around 30-45 seconds for a native English speaker to finish one HIT. Following Bhandari et al. (2020), we pay $0.15 per HIT, which is respectably higher than the U.S. federal minimum wage requirement. Meanwhile, we select annotators that are located in the U.S., have an approval rate greater than 98%, and have at least 10,000 approved HITs.
We collect 4 responses per summary (100 * 10 * 4 HITs) and finally, 104 workers were involved. After annotation, we filter the annotations from a noisy worker who did 210 HITs but disagreed with the majority in 72% of the time. After this filter-  Table 4: 5-fold (split by systems) cross-validation results. In each column, the bold numbers are the best and the underline numbers are the best out of automatic metrics. All Lite 2 Pyramid-0 numbers are based on f NLI = l 3c . All other numbers of our metrics are based on f NLI = p 2c , except that those star * numbers are based on f NLI = l 2c .
ing, we obtain an average inter-annotator agreement (Krippendorff's alpha (Krippendorff, 2011)) of 0.73. Following Bhandari et al. (2020), we use the majority vote to mark the presence of an SCU and break ties by "not present". Table 3 shows the gold Pyramid scores of different systems. Usually judging the presence of SCUs is considered as a task with little ambiguity, reflected by the high inter-annotator agreements achieved by REAMSumm (0.66) (Bhandari et al., 2020) and our PyrXSum (0.73). To further verify this, on REAL-Summ, instead of taking the majority vote, we randomly sample 1 out of 4 as the gold label. We conduct this for 3 rounds and test Lite 2 Pyramid's correlations with these 3 sets of human labels. We get 0.89/0.63, 0.90/0.63, 0.90/0.63 system/summarylevel Pearson correlations, respectively. They are close to each other and also close to the results obtained from the majority vote (0.89/0.64). This means workers give rather consistent SCUpresence labels.

A.4 Additional Results & Ablations
Cross-Validation Results. As a complement of Figure 2 in the main paper, Figure 4 shows the Lite 2.x Pyramid curves for system-level correlations. It can be observed that comparing to using random replacement, our Lite 2.x Pyramid always achieves higher or the same correlations when the same amount of human effort is reduced. Besides, Table 4 shows our 5-fold cross-validation (split by systems) results.