Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

With the recent undeniable advancement in reasoning abilities in large language models (LLMs) like ChatGPT and GPT-4, there is a growing trend for using LLMs on various tasks. One area where LLMs can be employed is as an alternative evaluation metric for complex generative tasks, which generally demands expensive human judges to complement the traditional automatic metrics for various evaluation dimensions such as fluency and consistency. In this work, we conduct extensive analysis to investigate the stability and reliability of LLMs as automatic evaluators for abstractive summarization. We found that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements due to significant limitations. That is, LLM evaluators rate each candidate system inconsistently and are dimension-dependent. They also struggle to compare candidates with close performance and become more unreliable with higher-quality summaries by obtaining a lower correlation with humans. In other words, with better abstractive summarization systems being introduced at a fast pace, LLMs may result in misleading and unreliable evaluations.


Introduction
The desire for inexpensive and fast automatic metrics has never stopped growing.In certain tasks like extractive summarization, where full source sentences are selected to appear in the summaries, simple n-gram overlap metrics against the "gold" summaries like ROUGE (Lin, 2004) or BLEU (Papineni et al., 2002) may work well because the correct answer space is narrow.However, for more open tasks like abstractive summarization, there are countless equally good summaries and the * Chenhui is under the Joint PhD Program between Alibaba and National University of Singapore.
"gold" summaries become less important.Although many neural-based metrics such as BERTScore and BARTScore (Zhang et al., 2020b;Yuan et al., 2021), are advocated as more human-aligned, the evaluation criteria are also becoming increasingly complex.As a result, abstractive summarization may not be sufficiently evaluated with automatic metrics (Owczarzak et al., 2012;Nenkova, 2006), and often require extensive human evaluations as complements (Yang et al., 2023;Welbl et al., 2021).However, human evaluations often come with hefty costs and slow iteration cycles, while also being difficult to reproduce and standardize due to small sample sizes and potential human biases (Shen et al., 2022b;Liu et al., 2022).
Recent large language models (LLMs) like Chat-GPT and GPT-4 (OpenAI, 2023) have demonstrated outstanding capabilities in language comprehension and reasoning.This leads to a growing trend of employing LLMs as evaluators for complex language generation tasks by prompting them with carefully and elaborately crafted instructions (Chiang and Lee, 2023;Gao et al., 2023;Wang et al., 2023a;Wu et al., 2023;Luo et al., 2023;Liu et al., 2023).Despite the preliminary success suggested by such works, it is still inconclusive as to what degree of confidence we can trust the evaluation results produced by LLMs across different dimensions, despite their supposedly high average correlation with humans.It is also unclear if certain LLM-based metrics are more reliable than others, or if their reliability and fairness vary for different candidate systems.
In this work, we conduct extensive analysis to assess whether LLM evaluators can reliably replace human judges.Specifically, we incorporate two common human evaluation approaches with LLM evaluators, namely Likert-scale scoring (He et al., 2022;Shen et al., 2022b;Zhang et al., 2020a) and head-to-head (H2H) comparisons (Shen et al., 2022a;Li et al., 2020;Liu and Lapata, 2019).For Likert-scale scoring, we explore direct reason-thenscore (RTS) generation and a multiple-choice question (MCQ) method.The former instructs the LLM to provide reasoning before giving a score, while the latter simply prompts it to choose a specific score with a pre-determined description as the reason.For the Head-to-Head (H2H) comparison, we prompt LLM for a preference over the summaries from two compared candidate systems.
Our experiments show that LLM evaluators, with RTS and MCQ, outperform existing automatic metrics (Lin, 2004;Yuan et al., 2021).However, they are not ready to be reliable alternatives for human evaluation yet.Specifically, (i) LLM evaluators struggle to distinguish candidates with close performances ( § 4.2.1).(ii) LLM evaluators are candidate-dependent, meaning they do not exhibit highly consistent degrees of human alignment for different candidates ( § 4.2.3).Thus, they may unfairly favor or disfavor an evaluated candidate.(iii) LLM evaluators are dimensiondependent, meaning they have varying degrees of evaluation capabilities for different dimensions like coherence and fluency ( § 4.2.3).(iv) Lastly, as the quality of summaries improves with better candidates, LLM evaluators become unreliably less correlated with human judgments, according to our newly proposed meta-correlation metric ( § 4.2.4).
While we still call for a better automatic metric, in the meantime, we suggest a temporary solution in § 5 for abstractive summarization practitioners to use LLMs more reliably.Specifically, we advocate calculating the correlation between RTS and MCQ as a preliminary indicator of the reliability of the LLM for certain dimensions.If RTS and MCQ do not generally agree with each other, then further human evaluations are required.

Related Work
Summarization The summarization task involves generating a summary that contains concise and important (i.e., salient) contents of the original input article (Nenkova and McKeown, 2012).This task has been handled with 2 different approaches: extractive and abstractive.Unlike extractive summarization systems that directly extract salient phrases or sentences from the input article (Ernst et al., 2022;Chen et al., 2021;Zhou et al., 2018;Dong et al., 2018), abstractive summarization systems are expected to generate summaries using their own words and apply sentence fusion or paraphrasing techniques (Shen et al., 2023;Liu et al., 2022;Xiao et al., 2022;Lewis et al., 2020;Zhang et al., 2020a;Ziegler et al., 2019;Bing et al., 2015;Xu and Durrett, 2021).As such, abstractive summarization poses significantly more challenges for automatic and human evaluation pipelines (Saha et al., 2022;Pagnoni et al., 2021), because it is increasingly insufficient to use the provided "gold" summary as ground truth.
Human Evaluation Human evaluation can be conducted with different approaches.Some work (He et al., 2022;Shen et al., 2022b;Zhang et al., 2020a;Cheng et al., 2020;Gao et al., 2019;Liu et al., 2018;Li et al., 2017;Kryściński et al., 2018) employ a Likert scale to evaluate the summaries on discrete ranges, such as from 1 to 5.Meanwhile, many others suggest comparison approaches by asking human annotators to select the best summary out of 2 or more generated summaries from different systems (Shen et al., 2022a;Li et al., 2020;Liu and Lapata, 2019;Fan et al., 2018;Fabbri et al., 2019).Following this, we test LLM-based evaluators using both approaches with human-friendly instruction prompts.
Automatic Evaluation ROUGE (Lin, 2004) has been a common lexical overlap metric to evaluate summarization systems.Apparently, ROUGE is not sufficient for abstractive summarization, because the "gold" labels it relies on cannot comprehensively account for the complexity and variability of this task.In addition, the common usage of sentence fusion techniques and novel words for abstractive summarization may make ROUGE even less reliable.Zhang et al. (2020b) propose the neural-based BERTScore, which leverages the BERT word embeddings to compute the semantic similarity among tokens.Yuan et al. (2021) later introduce BARTScore, which uses BART (Lewis et al., 2020) to compute the probability of a summary given its input article.Nonetheless, these metrics may not reflect all of the complicated evaluation dimensions required for abstractive summarization mentioned earlier, nor do they have sufficiently high correlations with humans.

LLM-based Evaluation
There are many concurrent works that demonstrate the potential of LLMs to conduct complex human tasks (Chiang and Lee, 2023;Gao et al., 2023;Wang et al., 2023a;Wu et al., 2023;Luo et al., 2023;Liu et al., 2023;Cheng et al., 2023).The key advantage of instruction-tuned LLMs, like ChatGPT or GPT-4 (Ouyang et al., 2022;OpenAI, 2023), is that we can explicitly describe in natural language what our evaluation criteria and dimensions are and how to score the summaries, similar to how we would explain such tasks to a human expert.Chiang and Lee (2023) use LLMs for open-ended story evaluations, while Luo et al. (2023) apply ChatGPT specifically for evaluating the consistency of summaries.Wu et al. (2023) formulate LLMs as diverse roleplayers to evaluate summaries from the perspectives of different personas.Wang et al. (2023a) and Liu et al. (2023) also explore the LLM's evaluation potential in various dimensions for the natural language generation task.Our work differs from the above works in that besides investigating the LLMs' capability using different approaches across various dimensions for abstractive summarization, we further focus on the reliability of LLM across evaluated systems and dimensions.

LLM as a Zero-Shot Evaluator
We investigate an LLM's evaluation capabilities in the dimensions of coherence, consistency, fluency, and relevance respectively, as defined by Fabbri et al. (2021) (see Appendix A).Following common human evaluation approaches, we propose two methods for Likert-scale scoring, namely the reason-then-score method and the multiple-choice question method, as well as one method for headto-head comparisons.We describe each method in § 3.1 using the relevance dimension as an example (see more prompts and details in Appendix B).We further experiment with alternative phrasings for different methods in Appendix C.Besides exploring different evaluation methods, the stability of LLM-based evaluations across different summarization systems is equally important.Ideally, a stable LLM evaluator should perform equally well regardless of the evaluated systems, with a close (if not identical) degree of alignment with human judgments.In § 3.2, we propose a meta-correlation metric and explain how it can gauge the extent to which LLM evaluators' performances depend on the evaluated systems, which indicates how stable and reliable they may be with evaluating any future candidate systems.

Summary Evaluation Methods
Reason-then-Score (RTS) Given the success of chain-of-thought prompting (Kojima et al., 2022; Score the following Summary given the corresponding Article with respect to relevance from one to five, where one indicates "irrelevance", and five indicates "perfect relevance".Note that relevance measures the Summary's selection of important content from the Article, whether the Summary grasps the main message of the Article without being overwhelmed by unnecessary or less significant details.

Summary: {summary}
Provide your reason in one sentence, then give a final score: Table 1: Example prompt for the RTS method on the relevance dimension.Texts in {blue} represent the article and the corresponding summary to be evaluated.
Table 2: Example prompt for the MCQ method on the relevance dimension.Texts in {blue} represent the article and the corresponding summary to be evaluated.Wei et al., 2022), an intuitive method is to ask the LLM to evaluate a specific dimension by first generating the reasoning and then a corresponding score.Since the SummEval dataset (Fabbri et al., 2021) contains human scores on a Likert scale of 1 to 5, we also ask the LLM to score the summaries in the same range, as shown in Table 1.
MCQ Scoring (MCQ) Nevertheless, previous works find that the reasoning generated by the LLM does not always make sense (Lyu et al., 2023; Table 3: Example prompt for the H2H method on the relevance dimension.Text in {blue}: the specific article, and the corresponding summaries generated by a pair of compared models.Wang et al., 2023b;Gao et al., 2022).To avoid the misguidance of wrongly generated reasoning, we explore a more constrained MCQ method for the Likert-scale scoring.As shown in Table 2, instead of allowing the LLM to freely generate its thoughts, we dictate specific reasoning for each score.
Head-to-Head Comparison (H2H) Lastly, some concurrent works also observe that ChatGPT can act as an effective ranking model (Ma et al., 2023a,b).We thus explore the head-to-head comparison approach for LLM-based evaluations.As shown in Table 3, we present 2 summaries (Summary #1 and #2) generated by different summarization systems on the same input article, then prompt the LLM to select the better summary, or to indicate a tie.Moreover, to avoid potential biases that arise from the summary IDs, we conduct each evaluation twice, presenting the same summary as either #1 or #2 respectively.

Stability of LLM Evaluators
To ensure fairness across all evaluated systems, we argue that it is crucial for LLMs to produce stable evaluations.That is, regardless of evaluated systems, the LLMs should maintain a consistent degree of alignment with human judgments.We investigate such stability in two ways.
First, We categorize the summaries based on their originating summarization systems, and then examine the correlation between the LLM and human evaluations for each system.Ideally, if an LLM is stable across systems, it should produce evaluations that are similarly correlated to human evaluations.Otherwise, if the correlations differ significantly across different candidates, then we may conclude that the LLM's evaluations are system-dependent.
Second, we define a meta-correlation metric to quantify the extent to which the LLM's performance is affected by the quality of the evaluated systems.Specifically, we use the average human score for each candidate as an indicator of its summarization quality (Q i ), as shown in Equation ( 1): where f human (•) indicates the human evaluation, g i,j represents the j th summary generated by the i th candidate system.Each candidate's quality is calculated as an average of N generated summaries (N = 100 for all systems).Next, we use the correlation P i between LLM scores and human scores as an indicator of the LLM's evaluation performance for the i th candidate, as follows: where ρ denotes the correlation metric (i.e., Spearman correlation, Pearson correlation, or Kendall's Tau2 ), and f LLM (•) indicates the LLM's evaluation for each summary g i,j .Finally, we calculate the meta-correlation3 M on a total of k candidates as: Ideally, an LLM should work well regardless of the quality of the evaluated systems, which means that M should be close to zero.On the other hand, a significant M would indicate an undesirable relationship between the LLM's evaluation capability and the quality of the evaluated systems, suggesting that the LLM evaluation is not stable, such that it may not evaluate each candidate system fairly using the same standards.

Setups
We use the ChatGPT "gpt-3.5-turbo-0301"snapshot ( § 4.2) for all three methods.By using a fixed snapshot, we ensure all evaluations are conducted with the same LLM model.In addition, we evaluate with the GPT-4 "gpt-4-0314" snapshot ( § 4.3) using the best evaluation method determined by ChatGPT to check for any potential improvement.Given that ChatGPT and GPT-4 are amongst the top performing LLMs, we use their performance to estimate the potential of LLMs as reliable evaluators.Additional results using three different-sized Llama 2 models (Touvron et al., 2023) are reported in Appendix D, which all performs worse.Similar to Luo et al. (2023) and Wu et al. (2023), we set the temperature to 0 and reset the dialogue history for each evaluation instance.
Dataset We use the SummEval benchmark dataset (Fabbri et al., 2021).This dataset contains expert human annotations for coherence, consistency, fluency, and relevance on the generation results from 12 abstractive systems (see details in Appendix table 21) on the CNN/DM dataset (Hermann et al., 2015).Each evaluated system generates summaries for the same 100 news articles, and each summary is scored by 3 expert annotators from 1 to 5. The annotations achieve with a high kappa coefficient of 0.713 (Fabbri et al., 2021).We further calculate the annotations' standard deviations across each evaluated system in Appendix Table 20.Given a step size of 1, the standard deviations are considered very small, thus suggesting that this dataset has a high level of human agreement.Following Chiang and Lee (2023), Chhun et al. (2022), and Guan and Huang (2020), we use the average human scores as the reference scores.
Prompts We conduct evaluation following our prompt formats given in Table 1, 2, and 3. Following Fabbri et al. (2021), we re-use the definitions of the evaluation dimensions: (i) Coherence -the collective quality of all sentences, (ii) Consistency -the factual alignment between the summary and the summarized source, (iii) Fluency -the quality of individual sentences, and (iv) Relevance -the selection of important content from the source.
Measurements To compare all evaluation methods on equal ground with human evaluation, we use four different measurements.First, we count the number of correct preferences (#CP), which is the number of times each automatic metric has the same preference as the average human scores do over a set of compared system pairs ( § 4.2.1).This can help measure the alignment of evaluation methods with humans at a granular level.To determine the preferred system by a particular metric, we assign a system 1 point if its generated summary is evaluated as better than that of the other system according to the metric, or assign both systems 0.5 for a tie.Then, we aggregate the different scores for the compared systems for all 100 test inputs, and the system with a higher score is considered the preferred system by that metric (see Appendix Table 22 for details).
Next, we also use the Pearson correlation (Cohen et al., 2009), Spearman correlation (Spearman, 1987), and Kendall's Tau (Kendall, 1938) to measure the relationship between the scores of automatic evaluators and humans ( § 4.2.2, 4.2.3, 4.2.4).While the Pearson score measures linear relationships, the other two measure the ordinal relationship that may be non-linear.Moreover, Kendall's Tau is less sensitive than Spearman correlation to outliers due to its paired counting of concordant and discordant pairs.

ChatGPT Evaluator
In this section, we examine the ChatGPT evaluator across many aspects, ranging from human correlation and stability across different systems.

Correct Preferences
The ultimate goal of evaluation is to determine if one candidate system is better than the other in a compared pair.The number of correct preferences (#CP) metric normalizes all evaluation methods into determining whether an evaluator can, as a human expert would, pick the same better system or determine a tie.We conduct such analysis with different pairs of summarization systems on the same input articles.average performances according to human scores.However, for RTS, MCQ, and other baselines, we can easily calculate the #CP for all 66 possible pairs (see Appendix E).
Table 5 reports the #CP for both the standard 66pair full set (in brown) and the 11-pair challenge set (in black).As shown for the larger standard set, RTS unanimously obtains the largest #CP across all dimensions, with an average of 58.5 out of 66 candidate pairs (i.e.88.6% accuracy).
Despite the high overall accuracy, weaknesses of such evaluators are revealed as we dive into their performances in the 11-pair challenge set (black scores of Table 5), where the evaluated candidates are close matches.Specifically, BARTScore-CNN-Para performs better than RTS in coherence and consistency, possibly because it is fine-tuned with same-domain summarization data.For fluency and relevance, ChatGPT-RTS still performs best among all evaluators.Nonetheless, its average accuracy drops significantly to 63.6% (7 out of 11), which indicates LLM evaluators struggle to differentiate the closely matched candidate systems.In other words, LLM evaluators may only reliably compare candidates with a relatively large performance gap.

Correlations with Human
Table 4 reports that Spearman, Pearson correlations, and Kendall's Tau between scores of multiple automatic evaluators and humans with a total of 1200 summaries from all systems, across the four evaluation dimensions.As shown, ChatGPT RTS and MCQ demonstrate stronger correlations with humans than many automatic evaluators, such as ROUGE and BARTScore, with up to 0.2 gains in fluency.While RTS achieves higher correlations in the dimensions of consistency and relevance, MCQ has relatively strong correlations in the dimensions of coherence and fluency.Meanwhile, the specialized BARTScore-CNN family also shows competitive performance in coherence, most likely due to the fine-tuning process with CNN/DM.

Per-candidate Correlations
Next, we break down the human correlation of ChatGPT-RTS for each candidate system and measure the statistical spread for the correlations across all systems (see raw results in Appendix table 23).Ideally, a stable evaluator should exhibit the same human correlation across candidates and dimensions, and display flattened boxes in a line.
However, as illustrated in Figure 1, the spread of correlations for different candidates is particularly wide, with up to 0.5 correlation difference in consistency.This means that the RTS evaluator exhibits a significantly varying degree of alignment with human judgment for different candidates.In other words, ChatGPT-RTS is candidate-dependent In addition, the medians across the four dimensions are also different.This indicates that the ChatGPT is also dimension-dependent and unstable.Given such varying performances across different dimensions, ChatGPT may not behave well with a newly introduced evaluation criterion.

Summary Quality vs Human Alignment
Using our proposed meta-correlation measurement in § 3.2, we analyze the relationship between summary quality and human correlation of LLM evaluators.We illustrate the meta-correlation in terms of Kendall's Tau for both RTS and MCQ in Figure 2. As shown, both RTS and MCQ exhibit strong negative meta-correlation for consistency and fluency.This suggests that ChatGPT becomes less humanaligned with improving qualities of the evaluated systems.
To illustrate this phenomenon further, we scatter the paired coordinates of the summarization system quality (Q i , Equation (1)) and ChatGPT's evaluation performance (P i , Equation (2)) in Figure 3.As shown, while the LLM evaluator is better humancorrelated with lower-quality candidates (< 3.5), it is less reliable when dealing with high-quality candidates (> 4.7) with much lower and inconsistent correlations.
We compare the meta-correlation for all evaluation metrics in Table 6.We can see that while the ROUGE metrics exhibit no significantly negative meta-correlation, the neural metrics all display significant meta-correlation in certain dimensions.One highly likely reason for this behavior is due to the varying biases inherent to the neural models, which would explain why ROUGE as a simple n-gram overlap metric doesn't exhibit significant negative meta-correlations.Interestingly, ROUGE-2 even shows a strong positive meta-correlation on coherence (which is plausible, because bi-gram overlap performance may be more accurate as candidates produce more coherent texts).
Both the BARTScore variants and LLMs demonstrate the most negative meta-correlations.ChatGPT-RTS has the most negative metacorrelation in the dimensions of consistency and fluency, indicating that it may be the least reliable to evaluate high-quality systems on these dimensions.On the other hand, the BARTScore family may be unreliable in comparing systems with high qualities of coherence, consistency, and relevance.
So far, the observations discussed in § 4.2.3 and § 4.2.4 collectively suggest that LLM evaluators may not be a reliable standalone metric for challenging scenarios, and further human evaluation is required for conclusive decisions.

RTS and MCQ Scores
Lastly, we delve into the detailed scores generated by ChatGPT with either the RTS or MCQ method.Since both methods score the summaries in the same range of human scores of 1 to 5 (Fabbri et al., 2021), we can show a direct comparison of the average RTS and MCQ scores with human scores in Figure 4 (see more details in Appendix F).As shown, the RTS scores are much lower than the human scores across all dimensions, while MCQ scores are consistently higher and better match the human scores (except for relevance).In other words, while RTS is best aligned with humans according to § 4.2.1 and § 4.2.2, we cannot replace the human scores with RTS scores in absolute terms.
The discrepancy may be attributed to the unfaithful reasoning generated by LLMs (Lyu et al., 2023;Wang et al., 2023b;Gao et al., 2022).Our further investigation suggests that ChatGPT-RTS generates false or unrelated-to-dimension reasoning.Thus, it is possible that the much lower scores could be caused by ChatGPT penalizing the sum- maries according to false premises (more examples in Appendix G).For instance, RTS may penalize the summary's repetitiveness in the consistency dimension or suppress fluency ratings for missing important details. 4 On the other hand, the MCQ counterpart gives higher overall scores, most likely because the confined set of pre-defined reasons prevents such unrelated penalization, though not leading to better human alignment.

GPT-4 Evaluator
A natural question to ask is whether such aforementioned limitations are resolved with an stronger LLM.In this section, we conduct similar analyses on GPT-4 (OpenAI, 2023) with the RTS method.We present the GPT-4 results in the last rows of Table 4 and 5.The results suggest that a stronger LLM does not necessarily translate to a stronger LLM evaluator, although Table 4 does show that GPT-4 outperforms ChatGPT in terms of human correlation consistently across most dimensions.Unfortunately, GPT-4 still suffers from the same limitations as ChatGPT.It appears to be both candidate-dependent and dimension-dependent, as demonstrated by the large spreads with varying median values across dimensions in Figure 5 and the significantly negative meta-correlations out of 3 dimensions (Table 6).However, GPT-4 is less dimension-dependant as compared to ChatGPT, as the medians in the box plots in Figure 5 are more aligned than those in Figure 1.
In addition, there is a notable enhancement in the meta-correlation for consistency, which we attribute to a significant reduction in reported hallucinations with GPT-4 (OpenAI, 2023).It is possible that with much more instruction training to avoid hallucinations, GPT-4 is much better aligned with humans to detect inconsistencies (i.e.hallucina- tions) in summaries.
Nevertheless, GPT-4 exhibits a much worse negative meta-correlation in the relevance dimension, which, interestingly, seems to reflect the challenges of maintaining both "truthfulness" and "informativeness" (Ouyang et al., 2022).This is because a model could be easily made more truthful if allowed to provide less relevant information (for instance, by refusing to answer the users' questions).It is possible that with reduced capability in the informativeness dimension, the model is less capable of differentiating the nuances of less relevant summaries when the summary quality is generally high.Nevertheless, we leave it to future work to determine whether GPT-4's more negative metacorrelation in the relevance dimension could be related to its stronger performance in consistency.We provide more details on the GPT-4 evaluator in Appendix H.

A Temporary Efficient Framework
Despite the aforementioned limitations, it may be hard to resist the temptation of using LLM evaluators given their superiority over other automatic metrics.In such a case, one should be able to tell when LLM evaluators are more likely to be unreliable and employ further human evaluation when necessary.To this end, we suggest combining the RTS and MCQ scores as a cost-efficient framework.Specifically, we calculate the correlation between RTS and MCQ scores for the i th candidate system as a reliability indicator: Then, we can loosely infer that up to a reliability tolerance r ∈ (0, 1), the LLM evaluators (either RTS or MCQ) are reliable if R i > r.In other words, given a candidate i, if RTS and MCQ agree with each other up to a certain degree of tolerance r, we may assume the evaluator is reliable enough to avoid invoking further human evaluation.
To validate this theory, we measure the correlations ρ(R i , P RTS i ) or ρ(R i , P MCQ i ), where is the performance of either method as defined in Equation ( 2).Given significantly large positive values of either ρ(R i , P RTS i ) or ρ(R i , P MCQ i ), we can then conclude that R i can be used as a reliable indicator for the performance of the corresponding method.
As shown in Table 7, R i demonstrates a significant correlation with P RT S i on both the consistency and fluency dimensions, and with P M CQ i on the coherence and consistency dimensions.This means that if RTS and MCQ generally agree with each other on the candidate's performance on a particular dimension with high ρ(R i , P RTS i ) (or ρ(R i , P MCQ i )), RTS (or MCQ) is more likely to be human-aligned.Meanwhile, if RTS disagrees with MCQ (R i < r), further human evaluators are required to provide a conclusive evaluation.We provide R i values for ChatGPT on each evaluated system in Appendix Table 29.

Conclusion
We explore the potential of using LLMs with different prompting techniques as metrics for abstractive summarization systems.Our extensive analysis suggests that while LLMs like ChatGPT perform better than commonly used automatic metrics across different summarization systems and dimensions, they are still not ready to replace human evaluators because they are candidate-and dimension-dependent, and they do not align well with human when comparing high-quality candidates.Nonetheless, if an LLM evaluator is to be used, we suggest combining multiple evaluation methods as a preliminary indicator to determine whether the metric is likely to be unreliable and whether further human evaluation is required.

Limitations
Potential Human Bias.We benchmark the LLM evaluation results against the average of three human expert scores.Naturally, it is possible that these scores may exhibit potential biases of the human experts.Nevertheless, we wish to explore whether LLM evaluators are aligned with human experts, and may naturally exhibit the same bias as a human would.In other words, we examine whether we can reliably replace human annotators with LLMs, instead of seeking a "perfect" solution that has absolutely zero bias.
Dataset Size.Given the constraints of the small size of the human-annotated SummEval dataset, we could only evaluate 100 summaries generated for each summarization system, with a total of 12 abstractive summarization systems.Since we have observed a significant correlation of LLM evaluations with humans for the consolidated 1200 summaries across all systems, it is possible that with a larger evaluation number, the per-system correlation could also be improved.In addition, given only 12 evaluated systems, our meta-correlation may still be subject to sample biases.We leave more investigations for the future once there are larger annotated datasets.
Prompt tuning.Designing better prompt for LLMs are also ongoing research.Although it is possible that LLMs may act as better evaluators with better prompts, prompt tuning is not our focus.We seek to highlight the limitations of the investigated LLMs and have demonstrated that limitations such as negative meta-correlation are also found with a few other alternative prompts (see Appendix C).
Availability of Commercialized LLM We note that the "gpt-3.5-turbo-0301"snapshot is currently taken down5 by OpenAI and replaced with a newer snapshot, "gpt-3.5-turbo-0613".This is also one disadvantage of using out-of-the-box commercialized LLM for summarization evaluations, as the exact checkpoints may not be stably available.As a result, future models may not be fairly compared against previously evaluated models using a different LLM checkpoint.Nevertheless, our paper only seeks to investigate the potential of LLM as an out-of-the-box evaluator, and the OpenAI models are currently one of the strongest.Eventually, we wish to raise awareness of some of the significant limitations found with these LLMs, which need to be resolved before LLMs can be used as direct replacements for human evaluations.In addition, we also note that the cost of evaluating only 100 sum-maries for each system is relatively low (around 2 USD per system using ChatGPT).Since LLMs also conduct evaluations much faster than humans (around 2 minutes for LLMs versus 10 hours for human for 100 summaries), it may not pose significant barriers if one was to re-evaluate all compared systems on a single LLM.
Limited Use of the Temporary Solution Unfortunately, our temporary efficient framework doesn't apply to the relevance dimension, where the R i has no significant correlation with the performances of either RTS or MCQ.Moreover, the r value may be dataset-dependent, and it is hard to decide where to draw this line.We leave for future work of developing better methods to gauge the reliability of LLM evaluations.

E Challenging Pairs
To count the total correct pairs, we only evaluate the challenging pairs, which consist of summarization systems of consecutive performances according to average human scores across all dimensions.Thus, each pair contains 2 summarization systems with the smallest difference in terms of average performance.
For instance, as shown in Table 21, M22 has the best average human score of 4.57, followed by M23 of 4.55, then M17 of 4.52.We thus compare model pairs of "M22-M23" and "M23-M17".The full challenge set is shown in Table 22.
For RTS, MCQ, and all other baseline metrics,  we simply need to compare the evaluated values across all systems, and each metric only needs to evaluate a total of 1200 summaries.However, for H2H, we need to evaluate a total of 6,600 summary pairs for the full standard set, and each pair needs to be evaluated twice with different summary positions (see § 3.1), resulting in a total of 13,200 LLM evaluations.Due to a limited budget, we thus only compare a challenge set of 11 pairs, reducing the total required LLM evaluations to 2,200.Table 25: Average scores for ChatGPT using RTS (Chat-RTS) and MCQ (Chat-MCQ), GPT-4 using RTS (GPT-4), as well as the human scores.

Dimension Generation
Consistency The summary is inconsistent with the article as it omits important details and repeats a phrase multiple times, resulting in a score of 1. Consistency The summary includes some relevant information but lacks coherence and omits important details, resulting in a score of 2.5 out of 5 for consistency.

Fluency
The summary is well-written and grammatically correct, but lacks important details from the article such as the potential link between antibiotics and weight gain, and the warnings about over-prescription of antibiotics and the development of superbugs.Score: 3/5.Fluency The summary lacks coherence and does not provide a clear understanding of the article, resulting in a score of 1 for disfluency.

G Incorrect Reasons in RTS
We illustrate some of the observed incorrect reasons generated with the RTS method by ChatGPT that do not correspond to the evaluated dimension in Table 26.We further provide incorrect reasons generated by GPT-4 in Table 27.We haven't compiled the exact number of mismatched reasons for either LLM, but we do observe a non-trivial fraction of the evaluations containing unrelated-to-dimension reasoning during preliminary verification.For instance, on the consistency dimension, ChatGPT-RTS has 42% evaluations containing dimension-irrelevant reasoning for the M11 (avg.score of 2.36) model, and 10% for the M17 model (avg.score of 4.84).

Dimension Generation
Coherence The summary is somewhat coherent but includes incorrect information, such as stating that Merson restarted his row with Burnley instead of Andros Townsend, and does not flow smoothly; therefore, I would give it a score of 3. Consistency The summary is consistent with the article as it mentions the helicopter airlifting patients to the hospital, but it leaves out important details such as the number of people involved and their injuries.Final score: 3. Consistency The summary accurately reflects the main points of the article, although it could have mentioned Falcao's unhappiness in France.

Fluency
The summary is coherent and grammatically correct, but it does not provide enough information about the incident.Final score: 3.

Fluency
The summary is only one sentence and do not provide enough information about the article's content.Final score: 1.

H GPT-4 Evaluator
We also look into the reasoning of GPT-4 and discover that it makes the same mistakes as ChatGPT by penalizing the summary for reasons unrelated to the evaluated dimension (see Table 27).
Another major difference is that GPT-4 tends to give overly generous scores.In one exceptionally extreme case, GPT-4 gives full scores for all generations by M12 in terms of consistency.Table 25 also shows the much higher average scores given by GPT-4 across all dimensions than those of ChatGPT-RTS.

Figure 3 :
Figure 3: Relationship between per-model correlations (Kendall's Tau) and human scores on consistency.

Table 5 :
Number of correct preferences (#CP) on the 11-pair challenge set (in black) and the 66-pair full set (in brown).Random: for each pair, there are three possibilities (two possibilities for one model being better, one possibility for a tie) so the random #CP is one-third of the total compared pairs.

Table 6 :
Meta-correlation for various evaluation methods.Bolded: most negative meta-correlation.Underlined: second most negative meta-correlation.Values in light gray color are insignificant (p-value ≥ 0.05).

Table 7 :
Correlations between RTS-MCQ R i and RTS-Human (P RTS High values suggest R i can be a reliability indicator for RTS and MCQ.Light gray values are insignificant (p ≥ 0.05). i).

Table 18 :
Results of using alternative prompt with ChatGPT.Light gray values are insignificant (p-value ≥ 0.05).Human-Corr reports the overall correlation of ChatGPT scores with human scores.Meta-corr shows the metacorrelation.

Table 19 :
Results of Llama 2 models of 7B, 13B, and 70B RTS correlations.Light gray values are insignificant (p-value ≥ 0.05).Human-Corr reports the overall correlation of LLM scores with human scores.Meta-corr shows the meta-correlation.

Table 20 :
The standard deviation of human annotations across different summarization systems and evaluation dimensions.

Table 22 :
#CP calculation for the ChatGPT-H2H metric of Model A over Model B. The numerical values in the middle section columns are aggregated scores for Model A. We omit the value for Model B, which is simply "100aggregated scores for Model A".We use "✓" to indicate both LLM and humans prefer the same model, and "×" otherwise.The model pairs are sorted in descending order according to the average human scores for each model.

Table 26 :
Examples of wrong reasons generated during RTS by ChatGPT that do not correspond to the evaluated dimension.Bolded: reasons that don't match the evaluated dimension.

Table 27 :
Examples of wrong reasons generated during RTS by GPT-4 that don't correspond to the evaluated dimension.Bolded: reasons that don't match the evaluated dimension.