Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization

Despite recent advances, evaluating how well large language models (LLMs) follow user instructions remains an open problem. While evaluation methods of language models have seen a rise in prompt-based approaches, limited work on the correctness of these methods has been conducted. In this work, we perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of LLMs. Our investigation is performed on grounded query-based summarization by collecting a new short-form, real-world dataset riSum, containing 300 document-instruction pairs with 3 answers each. All 900 answers are rated by 3 human annotators. Using riSum, we analyze the agreement between evaluation methods and human judgment. Finally, we propose new LLM-based reference-free evaluation methods that improve upon established baselines and perform on par with costly reference-based metrics that require high-quality summaries.


Introduction
Large Language Models (LLMs) have shown human-level performance in many NLP tasks.Recent advances in instruction tuning (Ouyang et al., 2022;Brown et al., 2020) and alignment (Stiennon et al., 2020;Zhou et al., 2023) have dramatically increased the ability of these models to follow instructions.In addition to being used to tackle unseen tasks in zero-shot setups (Chung et al., 2022), these models are now also used as surrogates to human annotators, especially for NLG tasks (Chiang and Lee, 2023;Wu et al., 2023), where human evaluations are time-consuming and expensive.
Consider the instruction "Briefly describe the purpose of the assignment and assumption agreement mentioned in the paragraph" from Figure 1.
There are several dimensions to evaluate a generated output on: (i) Coherence: whether it is understandable and free of grammatical mistakes, (ii) Faithfulness: whether facts in the output are supported by the document, (iii) Style: whether specific formatting requirements (lists, brevity, ...) are met, and (iv) Alignment: whether it semantically fulfills the instruction.
Analyzing these different facets for each model output increases the cognitive load of annotators, thereby increasing the likelihood of errors or lowquality evaluations (Goyal et al., 2022).It also increases the turnaround time and hence annotations become expensive.An increasingly popular alternative is to ask LLMs to evaluate the generated outputs.Recent work like Liu et al. (2023a) and Fu et al. (2023) show that LLMs can produce humanlike evaluations of text by using clever prompting techniques (Wei et al., 2022b;Yao et al., 2023).But preliminary studies have shown that LLMs can be inconsistent in their evaluations and can easily be influenced (Wang et al., 2023a;Shen et al., 2023).Gehrmann et al. (2023) have also looked at evaluation flaws and have recommended that metric developers should focus on metrics with smaller, but better defined scopes (like instruction-following).
Hence, there is an urgent need for a standard framework to analyze the specifics of instructionfollowing abilities of LLMs.SummEval (Fabbri et al., 2021) proposes something similar for vanilla summarization.Doing this for instructionfollowing can be tricky because we would like to not only evaluate the LLMs as task solvers "Summarize this document in 20 words or less", but also as task evaluators "Does the summary satisfy the conditions of the instruction?".The metaevaluation framework should be robust and ideally reference-free (Liu et al., 2023a).Reference-free evaluation for text generation has been widely studied (Liu et al., 2022;Hessel et al., 2021;Ke et al., 2022), but to the best of our knowledge, there has arXiv:2310.08394v1[cs.CL] 12 Oct 2023 Document: Lee, You should be receiving a package shortly containing the following: (...) 3. Assignment and assumption agreement to move the equipment from TurboPark to the CAED I.There will be one for CAED II as well.This document is being reviewed by the bank, so I'm not convinced it is in final form.You will note that there is an acknowledgement section for GE.I cut and pasted from the consent to assignment from the TurboPark documents, but shortened the whole thing considerably.Here's that document: 4. Signature pages (signed by Enron) from the ESA deal, both the facility agreement and the override letter.Obviously, we need your signature.I will forward the final CA facility agreements to you once again, along with the blacklines against what you initialled.

Instructions:
• Briefly describe the purpose of the assignment and assumption agreement mentioned in the paragraph.
• Explain the changes made to the GE acknowledgement section in the context of the TurboPark documents.
• Summarize the final steps regarding the CA facility agreements and signature pages.Answers (for instruction #1): F-PaLM 2-S The assignment and assumption agreement is to move the equipment from TurboPark to the CAED I. F-PaLM 2-Sc The purpose of the assignment and assumption agreement is to move the equipment from TurboPark to the CAED I. GPT-3.5 (...) The purpose of the assignment and assumption agreement is not specified.been no prior work on reference-free evaluations for instruction-following.
In this work, we take the first steps towards building such a framework.To make this problem tractable, we choose to limit our scope to the task of query-based summarization.We consider this to be an appropriate initial task since (i) numerous domains to source documents from exist, (ii) the space of appropriate instructions is broad, while still (iii) maintaining groundedness of both instructions and answers into facts present in the documents.We leave the expansion of the dataset in size and domain/instruction scope to future work.
Contributions For this purpose, we release a rated, instructed summarization dataset riSum 1 consisting of 900 instruction-summary pairs with 3 human ratings each (Figure 1).
We introduce several reference-free evaluation methods which perform on-par with expensive reference-based methods and outperform existing reference-free baselines in terms of correlation with human judgement.
Lastly, we leverage riSum to perform an extensive meta-evaluation, quantifying how well different evaluation methods are able to replace human judgments by statistically ranking model outputs.

Model naming
In this work, we rely on different LLMs for a variety of tasks.Specifically, we use GPT-3.5 (Ouyang et al., 2022) and GPT-4 2 (Ope-1 Dataset will be made available soon. 2 OpenAI model id: gpt-4-0314
2 Data Collection

Dataset collection
Data sourcing To create riSum, a total of 100 documents are chosen from 10 existing datasets of different domains to ensure the data is as diverse as possible.The documents are uniformly sampled from each dataset, restricting to documents with a word count between 100 and 500 words (Table 1).
Instruction generation To procure instructions for each document, we first evaluate the quality of generations from four models: F-PaLM 2-Sc, F-PaLM 2-Lc, GPT-3.5, and GPT-4.We randomly sample 10 documents from the dataset and let each model generate 3 instructions per document.Each of the 40 (10 × 4 models) document-instructions pairs was rated "good", "neutral", or "bad" by three evaluators in a side-by-side setting.In this evaluation, GPT-4 outperformed the other models on 6/10 documents, therefore we used it to sample instructions for all documents in the dataset.This results in a total of 300 document-instruction pairs.
Answer generation Subsequently, three different models3 are used to generate answers for each of the document-instruction pairs, yielding the final dataset with 900 data points.
Human evaluation Finally, each documentinstruction-output triplet individually is evaluated by at least three human annotators.They are asked two questions: 1. Does the output follow the instruction?(Y/N).
2. Rate the output on a scale of 1 to 5. 1 indicates the output does not follow the instruction at all, 5 indicates the instruction is followed strictly.
See Appendix C for a description of the annotator UI, Appendix D for annotator guidelines, and Appendix E for the instruction-generation prompt.

Analysis of Human Ratings
For analyzing annotator agreement (Table 2), we leverage locally and globally computed Krippendorff α (Krippendorff, 2019).For the first boolean question, we use the nominal distance function (indicator function) and for the second ordinal question, we use the interval distance method (squared difference).For local application, we compute a localized α for each document-instruction pair and then aggregate the results over all pairs.We omit 5 document-instruction pairs from the analysis for which the Krippendorff α is not defined because there is no annotator overlap among the 3 ratings for each of the 3 model outputs.
We note that around 67% of the dataset has high levels of agreement on the first question and 57% on the second question.The tail of disagreement is long (Figure 2), but we hypothesize that given the difficulty of rating outputs in these diverse and highly specific texts, disagreements would be nonnegligible even with higher replication rates.At the expense of gathering only relative information, ranking two responses against each other instead of rating single responses may help.Given the diversity of domains and instructions, hiring domain experts for future ratings could help increase quality and agreement, whilst also increasing costs.
Additionally, factoring out independent rating dimensions (e.g.language level, factuality) may help quantify common mistakes types in LLM instruction following and identity misalignment areas with respect to human expectations at the expense of a slower and more expensive rating process.
In Table 3a, we present aggregate numbers for annotator preferences among the three model outputs.We explore the mean of ratings, majority consensus votes (ties broken randomly), and a global mean over individual ratings (no aggregation).In Table 3b, the three model outputs are ranked for each document-instruction pair and the ranking indices are then averaged across the dataset, with ties broken randomly.Both tables are averaged over 100,000 runs to eliminate noise from tie-breaking.Table 3: Aggregate model quality according to human ratings."Mean" aggregation takes the mean of human ratings for each model output (n = 300), "Maj."takes the majority vote with ties broken randomly (n = 300), and "None" performs no aggregation (n = 900).Averaged across 100,000 runs.FI is the binary rating "Follows Instruction?",HW is the qualitative rating of "How Well?".Ratings are normalized to 0-1 and reported as %.

Evaluation Methods
We propose and evaluate several methods that model annotator preferences, focusing our analysis on reference-based vs. reference-free methods and their effectiveness in different data regimes.

Reference-based methods
Reference-based methods require access to at least one reference answer which can be considered the "gold standard" for each document-instruction pair.
Given numerous prior work noting that summaries written by crowd workers exhibit limitations associated with lack of annotator expertise in the domain (Gillick and Liu, 2010), especially at narrower tasks like query-based summarization (Jiang et al., 2018), we use LLM-generated references for benchmarking reference-based methods instead.
The requirement of having access to high-quality references fundamentally limits the utility of the methods.In all our reference-based experiments, we use GPT-4 and F-PaLM 2-Lc generated summaries as references.Since we use GPT-3.5 and F-PaLM 2-S, and F-PaLM 2-Sc to generate candidate answers for evaluations, we use larger variants of these models to generate the "gold" references, which ensures that they are generally of higher quality (see e.g.Table 19 of Anil et al., 2023).2020) take a (candidate, reference) answer pair as input and aim to model semantic similarity between the two texts.In all results below, we use the BLEURT 20 model (Pu et al., 2021).In scenarios with multiple reference answers, we take the maximum BLEURT 20 score across all reference answers.ROUGE (n-gram-based) Lin ( 2004) also take (candidate, reference) pairs as input and measure n-gram overlap to provide a numerical estimate of how well the candidate resembles the reference.We report the geometric mean of ROUGE 1 , ROUGE 2 , and ROUGE Lsum and refer to this method as ROUGE avg .Similar to BLEURT, in a scenario with multiple reference answers, we report the maximum ROUGE avg score for a given candidate.

Reference-free baseline methods
We investigate popular heuristics (e.g.length of the generated response) and several LM-based approaches, varying the amount of data used.Finetuning a model on a subset of the collected data would also yield a viable evaluation method, but we leave that for future exploration.
Length-based heuristics The simplest referencefree method we use is based on length heuristics.The length of the model output is a common source of bias in human ratings when evaluating the quality of summaries, where longer answers are often preferred over shorter ones, since the former usually contains more information.Therefore, it is a natural baseline for assessing the degree to which the collected ratings suffer from this type of bias.We simply count the words and sentences using NLTK (Bird et al., 2009) and meta-evaluate how they would behave if they were used as a proxy for generated answer quality.

Model-based methods
We benchmark the following state-of-the-art model-based methods on the riSum dataset: (i) BARTSCORE and BARTSCORE CNN (Yuan et al., 2021), and (ii) T5 ANLI (Honovich et al., 2022).Both are encoderdecoder Transformer models and have around 400M and 11B parameters respectively.

LLM-based reference-free methods
The following methods depend on an underlying LLM for evaluation.Though we use PaLM 2 models in our experiments, these methods are model agnostic, and any LLM can be used in their place.For the following methods, we leverage either the base PaLM 2-S/L models, or the instruction-tuned F-PaLM 2-Sc/Lc.

Constrained Softmax
We feed the underlying model two prompts: one for the "Follows instruction?(Y/N)" question, and another for the "How well?(1-5)" question.The prompts used correspond to the task descriptions provided to annotators (Prompts presented in Figure 9 of Appendix E).
Instead of sampling tokens to obtain the ratings, we use the model to compute the negative log-likelihood of all the possible rating values ("Yes"/"No" for the first question, {1, 2, 3, 4, 5} for the second question) and pick the most likely token as the rating.This approach has multiple advantages over generating tokens directly: 1. Correctness: The model can never output a rating that is not from the list of options.2. Efficiency: All our rating values are a single token in the model's vocabulary, which makes the scoring extremely efficient.Additionally, repeated sampling is not necessary to obtain a more precise estimate of the model's rating.3. Uncertainty: By re-normalizing the likelihoods across all rating values, we obtain a rating distribution, which lets us precisely quantify the confidence the model assigns to ratings.For an unbiased estimate with respect to the logits, we fix the softmax temperature to 1.
Finally, we return the expected value for each of the question's distributions: where R is the random variable representing the rating, r represents the rating values: {0, 1} for Question 1, {1, 2, 3, 4, 5} for Question 2. (d, i, a) represent the document, instruction, and answer.Additionally, we discuss a variant called Constrained Softmax n-shot, where we contextualize the model with n examples (document-instructionanswer-rating tuples) in each of the prompts.

Self-Agreement
In this method, we test if the model is consistent with itself across rating gen-  erations by repeatedly sampling the rating from the LLM n = 7 times.To diversify the samples, we experiment with various softmax temperatures, only to find that lower temperatures yield better results4 .The final rating is the arithmetic mean of the individual samples.We contextualize the model with k = 3 examples in the prompt (see Figure 7 in Appendix E).We also investigate the following variants: • no intro Omitting the description of the task in the prompt and using only the k examples.
• rationale Asking the model to generate Chain of Thought-like "rationales" for the given rating to each few-shot example (Wei et al., 2022b) 3): 1.The models communicate amongst each other in a controlled manner for up to 3 rounds and try to arrive at a consensus.After at most 3 rounds, one of three outcomes occurs: (i) unanimous agreement: all 3 models agree.If this happens in the earlier rounds, the process ends immediately, (ii) majority agreement: one model disagrees with the other two, or (iii) disagreement: all 3 models disagree with each other.
2. In each round, all models provide a rating and a brief rationale.The models do not have access to the other model outputs till the end of a round6 .Before the start of rounds two and three, they receive the ratings and rationales of all models from the previous rounds.
The prompt for the models is presented in Figure 8 of Appendix E. This method is referred to as Multi-LLM Agreement henceforth.We repeat the process n = 3 times for added stability.

Evaluating Agreement with Annotators
As discussed in Section 2, we asked annotators to provide a binary Yes/No rating answering whether a model output follows the instruction and a qualitative rating from 1 to 5, representing how well it follows the instruction.Using meta-evaluation methods described below, we then study agreement between annotators and our evaluation methods.

"Follows Instruction?"
For the binary rating, we compute a macroaveraged Area Under ROC Curve (AUC ROC) statistic for each evaluation method.Using AUC ROC, we analyze the effectiveness of each method if they were used as binary classifiers for "Does the output follow the instruction?",thereby assessing the degree to which they can replace human ratings.Since our classes are imbalanced towards "Yes" (Table 3a) we opt for the macro-averaged version of ROC AUC so that we can better detect which methods can accurately predict the "No" class.

"How well?"
Rank-based evaluation To analyze the ability of evaluation metrics to rank model outputs in relation to each other, we compute Kendall's T b rank distance d T b among the model outputs for each document-instruction pair.When the ranking produced by a metric is independent from human ranking, the value of d T b will be equal to 0.5 in expectation.Values below 0.5 represent rankings that are similar to the human ranking, values above 0.5 represent orderings that are similar to the inverse of the human ranking.As opposed to the T b rank correlation coefficient, d T b has values in the range of [0, 1] and can be interpreted as a distance function (lower is better): Compared to other forms of T , T b adjusts for ties: situations, where a metric or annotators give the same rating to two or more model outputs for one document-instruction pair.
For our human ratings, T b is not defined for 9 out of 300 document-instruction pairs: the mean of the 3 annotators' ratings is constant for all 3 models, making it impossible to rank the models.We report the mean and standard error of the rank distance d T b across all non-constant pairs.Linear value correlation Additionally, we would like evaluation method outputs to align with annotators' notions of "good" or "bad".To study this, we compute Pearson's distance across all document-instruction-answer tuples: where r is Pearson's correlation coefficient between an evaluation method's values and the mean annotator rating.Values of d |r| range from 0 to 1; the lower the value, the higher the linear correlation with human ratings.

Results and Analysis
We compare the effectiveness of evaluation methods on the three rating dimensions, based on the reported numbers for the binary rating "Follows Instruction?" and for the qualitative rating "How well?" in Table 4.For both rating tasks, the two length-based heuristics perform the worst out of all methods, which suggests that the instructions are of good quality, as annotators are not strongly influenced by the length of model outputs.

Predicting "Follows Instruction?"
First, we focus on how good of a binary classifier the methods are.We report the AUC ROC and its standard error (Section 4.1) with respect to the human majority vote labels.
Reference-based methods Having access to several reference answers that follow the instruction continues to be a good indicator when combined with ROUGE or BLEURT.However, the results show that, when we have access to a capable LLM like F-PaLM 2-Lc, it is better to use it directly as a reference-free evaluator, than sampling reference summaries from it and using reference-based metrics like ROUGE avg and BLEURT 20 .(Yuan et al., 2021) 68.4 ± 2.5 45.0 ± 1.9 74.7 ± 2.2 BARTSCORECNN (Yuan et al., 2021) 69.7 ± 2.4 43.7 ± 1.9 70.3 ± 2.4 T5ANLI (Honovich et al., 2022) 71.9 ± 2.3 38.8 ± 1.9 64.7 ± 2.5 LLM-based reference-free methods (Section 3.3) (n = 900).For "How Well?" (1-5 rating), we report Kendall's rank distance d T b comparing evaluation methods' ranking of answers to that of annotators' (n = 291) and Pearson's distance from mean annotator responses d |r| (n = 900).All values are in %, ± signifies standard error, ↑ signifies higher is better (↓ lower is better).Methods highlighted in bold have overlapping confidence intervals with the best method per column.Non-deterministic methods (Self-Agreement, Multi-LLM Agreement) have been re-run 5× and the mean is reported.
Reference-free methods As expected, performance of each evaluation method improves with model size.We observe that standard error is usually higher (> 2.0) when using PaLM 2-S compared to PaLM 2-L (< 2.0), across different methods.Combined with generally lower performance, methods using PaLM 2-S as the underlying model are more noisy and produce less meaningful evaluations compared to methods using PaLM 2-L.
We also note that Multi-LLM Agreement approaches, while interesting, are outperformed by both Self-Agreement and Constrained Softmax approaches, irrespective of the model size.
For scoring-based approaches (Constrained Softmax), non-instruction-tuned LLMs outperform their instruction-tuned counterparts.When generation is involved, instruction-tuned models outperform their base versions.This applies to rating generation, but also for generating answers directly.We only report numbers of instruction-tuned LLMs for generation-based methods and corre-spondingly, only report numbers of non-instructiontuned LLMs for scoring-based approaches.

Predicting "How Well?"
In the case of qualitative ratings, obtaining a ranking of answers that matches the annotators' ranking proves to be difficult.We note sensitivity in the analysis with respect to how ratings are aggregated per answer (majority vote or mean).To minimize ties and maximize the use of annotator information, we use mean aggregation for the following analysis.
Observing d T b ranking performance, ROUGE avg using GPT-4 model-generated answers seems to perform on-par with F-PaLM 2-Lc Self-Agreement based methods, as well as the 11B parameter T5 ANLI model from Honovich et al. (2022).
Reference-based methods In our experiments, BLEURT performs worse than ROUGE at relative ranking of model outputs.Since ROUGE is based on surface form, there is reason to believe that sam-  ples from different models in a single LM-family are closer in surface form than samples from different LM-families.In Table 5, we analyze the effectiveness of methods at picking the best answers out of the 3 model outputs.Perfect agreement happens when the sets of annotator and metric "winner" models is equal.Disagreement occurs when the intersection between annotator and metric winners is empty.Within disagreement, prefers own LM family means the metric winners contained at least one model output from the LM family the metric is based on.We observe that when the evaluation model is sufficiently different from the rated models, the likelihood of evaluation models preferring their own LM family goes down.However, when using a similar model, reference-based methods are more biased towards preferring their own LM family.If human-written reference answers are unavailable, using a reference-free metric is preferable.
Reference-free methods Similarly to the binary rating, we observe that methods with larger underlying models perform better.Likewise, referencefree methods based on F-PaLM 2-Lc outperform their reference-based counterparts when using the same underlying model.The base PaLM 2-L model with Constrained Softmax performs better and at lower cost than using the instructiontuned F-PaLM 2-Lc to generate reference summaries.With more available compute, one can further improve performance by leveraging multisampling Self-Agreement methods.
Interestingly, using random examples in Self-Agreement decreases performance as opposed to hand-crafting a small (k = 4) set of held-out examples.Contrary to intuition, using Chain-of-Thought approaches (rationale) seems to degrade performance, but when removing the task description (no intro) we do not observe a big drop.
When linear correlation d |r| with human ratings is required, methods that model the qualitative rat-ing directly outperform more generic methods.

Related Work
Measuring instruction following with LLMs Liu et al. (2023a) use GPT-4 as a backbone model and study the correlation with human ratings on non-query-based summarization, finding a bias towards LLM-generated text.We do not study this aspect, as our rating task focuses on model-generated text.Fu et al. (2023) propose a zero-shot approach for multi-faceted evaluation of text generation.
An increase in interest for improving instructionfollowing capabilities of LLMs has resulted in the creation of multiple datasets.FLAN (Wei et al., 2022a) and Natural Instructions (Mishra et al., 2022) were two of the earlier datasets which turned standard NLP tasks (e.g.sentiment classification, question-answering) into instruction following tasks.Other works like Self-Instruct (Wang et al., 2023b), Super-NaturalInstructions (Wang et al., 2022), and the H4 instruction dataset (Hugging Face, 2023) curate human-written instruction and answer pairs.Guo et al. (2023) andQingyi Si (2023) collect instruction-answer pairs from LLM generations.All of them use standard NLP metrics or human annotation to evaluate the model outputs.
Model-based metrics A large body of prior work focuses on model-based approaches fine-tuned on human ratings.Usually, encoder models such as BERTSCORE (Zhang et al., 2020) or BLEURT (Sellam et al., 2020) are used, but encoder-decoder methods exist as well (BARTSCORE, Yuan et al., 2021).We focus on low-resource zero/few-shot methods using larger, decoder-containing models from PaLM and GPT families.LLM evaluation Many recent works use LLMs as evaluators for summarization tasks.Wu et al. (2023) use LLMs with "different persona" to evaluate summaries from various perspectives.Luo et al. (2023) examine if LLMs can be used to detect factual inconsistencies.Concurrent to our work, Liu et al. (2023b) curate a human-evaluation dataset consisting of 22,000 summary-level annotations and perform a study of various automatic and LLMbased metrics for summarization and call for more rigorous evaluation of LLM performance.

Conclusion
In this work, we investigate the effectiveness of multiple evaluation methods in quantifying the degree to which LLM-generated text follows usergiven instructions.We release riSum, a new shortform dataset of 300 document-instruction pairs with 3 answers each.All of the 900 answers are rated by at least 3 human annotators.When analyzing agreement between evaluation methods and human judgment, we find that established metrics, such as ROUGE and BLEURT are not effective at quantifying LLMs' instruction-following ability.LLM-based evaluation methods tend to have stronger correlation with annotator judgment, without requiring high-quality reference answers.We hope that the introduced evaluation framework is adopted by the community for evaluating instruction-following abilities of LLMs, possibly expanding into more tasks, domains, and examples.

A Limitations
While the presented data offers a variety (e.g.diverse origin texts), a drawback to our work is that we only consider the task of instruction-based summarization (e.g.long-form question answering, query-driven summarization, stylistic summarization) as such.The extent to which metrics generalize to other tasks is not yet explored.Furthermore, for language diversity, the proposed benchmarks are restricted to English only.However, we hope that this initial benchmark allows further work to consider a larger range of tasks as well as exploration for how these benchmarks generalize to other languages.
Our correlation with human judgment analysis on the qualitative rating ("How Well?") has a limitation where the annotators do not provide sufficient signal to distinguish between the 3 answers.This happens in only 9 out of the 300 documentinstruction pairs and we chose to skip those pairs in the analysis for this rating task.The motivation for doing this is that our focus is on the cases where there is sufficient signal from the human annotators when an answer is better than another.
We acknowledge that relying on human ratings as a ground truth has drawbacks, especially as summarization is notoriously difficult to evaluate due to the subjective nature.To mitigate this, we provide extensive training and feedback to annotators and are in active communication throughout the annotation process to provide clarifications.The annotators used in our experiment have over a year of experience with rating NLU tasks.However, a limitation is that our annotator pool represents individuals from similar backgrounds, which may mean other populations would have differing quality perspectives.The background statistics of annotators can be found in Appendix C.2.

B Ethics Statement
The alignment of model behavior with user expectations is a crucial area of research, and we recognize the importance of contributing to the development of benchmarking methods for instruction following.Our work represents a step towards benchmarking how LLMs can self-evaluate their performance in the task of summarization.However, there are still many other aspects of summary quality, such as factuality, that warrant further exploration due to their significant downstream implications.A model's ability to follow instructions for a specific task, such as summarization, may not reflect the overall proficiency in instruction following.As such, these metrics serve as proxies to estimate the extent to which task instructions are adhered to within the context of summarization.Given the ongoing discussions regarding the risks associated with LLMs, this distinction is relevant.
During dataset construction, it is important to acknowledge the ethical concerns arising from the use of publicly sourced data without explicit permission from the original parties.While the data we employ is derived from previously released datasets, the examples are generated using LLMs trained on large, uncurated, static datasets obtained from the internet.

C Annotator methodology C.1 Annotation UI
In Figure 5 we illustrate the user interface used for collecting the dataset.Annotators follow a multistep process, by first answering "Does the output follow the instruction?"followed by "Rate the output on a scale of 1 to 5" to qualitatively assess the answer.
The UI also allows annotators to navigate through the provided content and highlight words that appear either in the answer or in the original text.Annotators can use this as a way to verify that content is present in both the output and input.

C.2 Annotator demographics
Table 6 presents the results of an optional questionnaire given to our annotators, aimed at understanding their background factors.Out of a total of 14 annotators, we have received responses from 7 individuals, who collectively accounted for approximately 65% of the annotation coverage for our dataset (Figure 4).This information allows us to gain a better understanding of the perspectives and experiences of our annotators, which can impact the annotation outcomes.

D.1 Objective
The goal of this task is to evaluate the quality of summaries generated based on given instructions.You will be provided with a document, an instruction, and an output (summary).Your task is to answer two questions: 1. Does the output follow the instruction?
(Yes/No), and 2. Rate the output on a scale of 1 to 5, with 1 indicating that the output does not follow the instruction at all, and 5 indicating that the output follows the instruction strictly.

D.2 General Guidelines
Understanding the Document Before evaluating the output, make sure you have a clear understanding of the document.The document can be a news article, a chat conversation, an email, etc. Read the document carefully and identify the main points, themes, or ideas.
Analyzing the Instruction The instruction will be related to summarization.It can be general (e.g."Summarize in 3 bullet points") or specific to the paragraph (e.g."Summarize the main novelty of the research work concisely").Make sure you understand the instruction and its requirements.
Evaluating the Output Compare the output with the document and the instruction.Check if the output follows the instruction and captures the main points, themes, or ideas of the document.
Evaluation Criteria For Question 1, answer "Yes" if the output follows the instruction and "No" if it does not.Consider the following factors: (i) does the output meet the format requirements (e.g.bullet points, concise summary)?and (ii) does the output address the specific focus of the instruction (e.g.main novelty, key findings)?For Question 2, rate the output based on how well it follows the instruction and captures the main points, themes, or ideas of the document.Use the

Figure 1 :
Figure 1: Randomly sampled example from riSum (data source: AESLC).Highlighted how GPT-4 transforms parts of the input document into grounded instructions.
Human evaluationKryściński et al. (2018);Huang et al. (2020);Shen et al. (2022b) and several others have resorted to human evaluation for analyzing the quality of reference summaries and model outputs.They adopt a Likert-type scale for rating individual aspects of generated text.Fan et al. (2018);Fabbri et al. (2019);Shen et al. (2022a) and others perform side-by-side comparisons of two or more model-generated summaries and use Elo, or other rating systems to build rankings of models.

Figure 4 :
Figure 4: Number of questions annotated by each human annotator.Annotator IDs pseudonymized to capital letters.

Figure 5 :
Figure 5: Example screenshot of the annotator UI.

Table 1 :
Data sources from which riSum is sampled and the minimum (Min), median (Med), and maximum (Max) sampled document length (in words).10 documents were sampled without replacement from each of the 10 data sources.
LLM instances 5 debate amongst each other and try to arrive at a common assessment.Though there are no restrictions on the LLMs to use, we evaluate the simplest case where each instance is the same LLM.The rules of communication are set as follows (Figure

Table 4 :
AUC ROC Curve measures how well methods predict Yes/No annotator responses on "Follows Instruction?"

Table 5 :
Agreement analysis with respect to mean qualitative ranking ("How well?").