Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation

We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision), while allowing direct control for compression ratio. Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation (West et al., 2022), where latent knowledge in pre-trained language models is distilled via explicit examples sampled from the teacher models, further purified with three types of filters: length, fidelity, and Information Bottleneck. Moreover, we uniquely propose iterative distillation of knowledge, where student models from the previous iteration of distillation serve as teacher models in the next iteration. Starting off from a relatively modest set of GPT3-generated summaries, we demonstrate how iterative knowledge distillation can lead to considerably smaller, but better summarizers with sharper controllability. A useful by-product of this iterative distillation process is a high-quality dataset of sentence-summary pairs with varying degrees of compression ratios. Empirical results demonstrate that the final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the quality of resulting summarization.


Introduction
We introduce REFEREE, a new framework for sentence summarization that works by iteratively generating and distilling knowledge into successively better models.This allows REFEREE to be [Refer]ence fr[ee]-beginning by distilling from a large language model rather than with supervised data.Yet, our method results in a more efficient, 1 See https://github.com/msclar/refereefor code, models, and data.: Our method results in high quality, referencefree compact summarizers.We begin by using a large language model (e.g.GPT-3) to generate many summaries that demonstrate different aspects we may want in a summary-gray represents an aspect wellrepresented in these generations, while black is underrepresented.We first use REFEREE-DISTILL to iteratively filter and train summarizers that better represent these desirable aspects, e.g.shorter summary length.
We then use generations from REFEREE-DISTILL to train a model in which these aspects are controllable: this is REFEREE-CONTROL.
compact, and controllable summarization model than what we start with.
Our work follows the paradigm of Symbolic Knowledge Distillation (West et al., 2022), which transfers implicit knowledge from a massive language model to a considerably smaller student model by explicitly generating knowledge in textual form.Unlike traditional knowledge distillation (Hinton et al., 2015) where the teacher model and the student model are of the same type, symbolic knowledge distillation allows for the student model to be of a different type.
Our work differs from West et al. (2022) in three key aspects.First, our distillation is iterative: each student model becomes a teacher in successive rounds, refining and improving summarization at every step.Second, REFEREE controls for more than just overall quality, improving multiple model aspects in each round such as length, fidelity, and information bottleneck (Tishby et al., 1999), then allowing explicit length control at generation time.Third, our work is the first to show that referencefree, controlled sentence summarization can be formulated as symbolic knowledge distillation.REFEREE works in two phases, illustrated in Figure 1.First, REFEREE-DISTILL uses a modest number of generated summaries from GPT-3 (Brown et al., 2020) to produce high quality and compact summarizers (Goyal et al., 2022).We follow an iterative approach; in each iteration we filter generations for desirable qualities, re-train a new and better summarizer, and finally generate new summaries for the next round.Each round amplifies effects of the previous rounds, improving notions of summary quality like entailment or shorter length.Second, REFEREE-CONTROL uses these iteratively distilled summaries to train a model with explicit control: in our experiments, we use progressively shortened generations from each iteration to train a final summarizer with explicit length control.
We find that REFEREE demonstrates compelling empirical results compared to competitive baselines.REFEREE-DISTILL, even without explicit length control, is able to generate shorter summaries with more consistency and equal quality compared with the original teacher model (GPT-3, 16x larger in size) as well as a supervised model.Moreover, REFEREE-CONTROL, which has more direct length control baked in, demonstrates a sharp degree of control in length, and succeeds at generating high quality summaries at specified lengths with significantly higher accuracy than GPT-3.In sum, the promising empirical results of REFEREE encourages further future investigation to extend the framework of symbolic knowledge distillation for reference-free, controlled text summarization.

Methods
We first describe REFEREE-DISTILL (see §2.1), an iterative procedure to promote specific behaviors that may not be prevalent in the original data, while maintaining summary quality.We explore two different filters, detailed in §2.2.We then de-tail REFEREE-CONTROL (see §2.3), a model that separates summaries into categorical variables and is iteratively trained to, summarize a given sentence within the desired category (e.g., a range of compression ratio).In this work we only consider categories that reflect different compression ratio, but the same approach could be applied to other types of control categories, such as style.

Iterative Symbolic Knowledge Distillation: REFEREE-DISTILL
Let D = D 0 ∪ . . .∪ D t denote a sentence corpus without reference summaries.We start with a teacher model (GPT3-Instruct Curie) from which we want to distill summarization knowledge under a fixed budget.Using D 0 -a small subset of Dwe first generate a dataset of sentence-summary pairs (C 0 ) by few-shot prompting the teacher and automatically filtering low-quality generations.Filters will be detailed in Section 2.2.Throughout the whole training procedure, we store each entry (s, s ′ ) as "s TL;DR: s ′ <eos>".Here, <eos> denotes end of sequence and TL;DR; is a separator that has been shown to encourage summarization behavior (Radford et al., 2019).Let M 0 be a pre-trained model significantly smaller than GPT-3 (GPT2-Large in our experiments).Using the seed dataset C 0 , we train a student model M 1 by fine-tuning M 0 with language modeling loss.We then iteratively refine this model by (1) using it to generate summaries for a subset of D, (2) filtering them to remove undesired behaviors, and ( 3) training another student model on the filtered dataset, essentially distilling a better summarizer.More precisely, We execute this procedure for t steps, creating t+1 different summarization datasets in the process: C 0 , C 1 , . . ., C t . 2 We discuss two possible instantiations of the filter i below.

Filters
There is no one summary that is better than all others; depending on the desiderata of the end users, some might prefer shorter but less informative summaries, while others might prefer longer, and more informative ones.While some of these goals are universal and always desired (for example, a summary should be accurate, in that it should not contain information not present in the input), others can be tailored to the end task.We use binary filters (filter i ) to operationalize these goals.We experiment with the following filters.
Summary Fidelity Filter To encourage accurate summaries, we employ a simple but effective criterion: the summary should be entailed by the input sentence.More formally, we define a binary filter, f NLI (s, s ′ ) := 1{s ⇒ s ′ }, and discard all nonentailed sentence-summary pairs to avoid using these samples when training the next iteration's student.We measure entailment using an off-the-shelf state-of-the-art NLI model (Liu et al., 2022a).
Summary Length Filter While underexplored in prior work, constraining for the length of written text, especially in summarization, is a desirable feature to support real world applications with limited screen space.To obtain a corpus of summaries of varying lengths, at each distillation step i, we encourage the student M i to generate progressively shorter outputs.We achieve this by constraining C i to contain only summaries with a predefined compression ratio r i ∈ [0, 1].More precisely, where r i > r i+1 for all i, to progressively summarize more succinctly.|s ′ | |s| is commonly referred to as compression ratio.In theory, one could generate data for all desired compression ratios directly from M 1 .However, since the seed dataset C 0 is heavily skewed towards longer summaries, the final corpus after filtering with f NLI would be extremely small for lower compression ratios.We find that combining the two filters and iteratively refining models to produce shorter, accurate summaries leads to a more diverse and still high-quality final corpus.
Contextual Filter For many applications, the sentences we need to summarize are part of a larger piece of text, such as a paragraph or a document (e.g.emails, articles).This contextual information may further improve sentence summary quality, since depending on the larger context, different information could be more important to be preserved, and inter-sentence redundancies could be removed.Inspired from West et al. (2019)'s interpretation of the Information Bottleneck principle (Tishby et al., 1999), we consider the following filter: where NSP refers to "next sentence prediction", p is an oracle language model (which we approximate by GPT2-Large), s next denotes the sentence immediately following the input sentence s, and l ∈ [0, 1] is a hyperparameter.Intuitively, we want to find summaries which are good predictors of the next sentence, to select the most crucial information and preserve coherence.l allows us to strike a balance between sacrificing some of the information in s and maintaining enough to predict s next .Adding f NSP requires expanding the input sequence to also include the next sentence throughout the iterative distillation process defined in §2.1.

Final REFEREE-DISTILL Filters Definition
We experiment with two filters, f 1 and f 2 (or #1 and #2, as we will refer to during experiments).f 1 does not assume the existence of any context, and so it only filters for inaccuracies and length: This allows f 1 to be applied in broader contexts.We also define f 2 , which adds contextual filtering: Fluency Filter To ensure fluency over several self-training iterations, we consider an additional filter only to be used in REFEREE-CONTROL.Given a sentence x = (x 1 , . . ., x ℓ ), we define AvgNLL(x) := − 1 ℓ i≤ℓ log p(x i |x <i ).We determine a summary as fluent if and only if its mean Negative Log Likelihood (NLL) does not exceed that of source sentence, leading to the filter:

REFEREE-CONTROL
Using the high quality corpora of varying compression ratios obtained using REFEREE-DISTILL, we train REFEREE-CONTROL, a summarization model that allows explicit control for desired compression ratio.We divide all possible compression ratios into n buckets, where each bucket b i = i n , i+1 n for 0 ≤ i < n.Using b i as control codes, we train a model that, when prompted with it, can summarize at a compression ratio within b i .
Similar to D, we start with a corpus F = F 0 ∪ . . .∪ F t of sentences without reference summaries.Additionally, we create a seed corpus labeled with compression ratios, E 0 = C 0 ∪ . . .∪ C t (F 0 = D 0 ∪. ..∪D t ) now representing each example (s, s ′ ) as "s <sep> <bucket_tok j> TL;DR: s ′ <eos>", where <bucket_tok j> corresponds to the bucket in which the example lies, that is |s ′ | |s| ∈ b j .<sep> is a special token.We denote each subset of E 0 corresponding to bucket j as E (j) 0 .This seed dataset is filtered to remove low-quality generations, with the same filter as all the subsequent iterations.
Similar to REFEREE-DISTILL, starting with a pre-trained model N 0 (GPT2-Large), we train student models via iterative distillation.In each iteration i, (1) we fine-tune the student model using the bucket labeled corpus E i , (2) generate summaries for F i for all buckets, (3) filter them to create a new labeled corpus E i+1 .We do not reinitialize the student at each iteration, but rather fine-tune starting from the teacher's current local optima.We use h(s, s ′ ) = f NLI ∧ f AvgNLL as the filter.Formally,

Primal-Dual Problem Interpretation of Summarization
Assuming summaries are fluent and factual, sentence summaries trade off between two variables: level of compression and level of information preservation.We are able to effectively fix the level of compression by introducing control codes, and then develop models to maximize information preservation.This is our primal problem.Thanks to length-control codes, we can now also solve the dual problem: "what is the best shortest summary we could write?".Written more precisely, given a fixed level of tolerance for losing information from the original sentence, what is the shortest summary we could write?Furthermore, comparing similar-lengthed summaries also allows for fairer comparisons, since we are effectively measuring changes in only one variable.

On GPT3's Fidelity and Length Control
We analyze GPT3-Instruct Curie's (Brown et al., 2020) sentence summarization capabilities.We promote GPT3 to summarize at different compression ratios by few-shot prompting with high-quality sentence-summary pairs in the desired compression  and GPT-3).The first three rows refer to three different datasets generated through three-shot prompting GPT3-Instruct Curie with summaries from different compression ratios (c.r.).Following rows show results for the third and last iteration of our models, using each one of the two described filters (f 1 , f 2 , see §2.2).Sentences correspond to a held-out set during training.
ratios.More precisely, we do three-shot prompting with three different sets of summaries: one set of sentence-summary pairs has all three pairs with compression ratios in the interval [0.6, 0.8], another set in [0.4,0.6], and another in [0.2, 0.4].
We show that average compression ratio (c.r.) correlates with the prompts' compression ratio (although variance is large), and up to 33% of the time models generate summaries longer than the original sentence (see Table 1).Qualitatively, this seems to be because of punctuation edits or hallucinations.
Besides using prompts that encourage shorter summaries, one can iteratively summarize through few-shot prompting.If f p (s) is the summary GPT-3 generates when prompted with p, then f p (f p (. . .f p (s))) = f n p (s) may also be a summary of s, possibly shorter.We find that successive application of the same prompt did not result in shorter summaries, i.e.
These experiments motivate the need of more sophisticated approaches to length control and reliably summarizing without supervision.

Experiments
Dataset We create the corpora D and F by sampling contiguous sentence pairs from RealNews (Zellers et al., 2019) news articles.We filter out sentences shorter than 50 characters.Using GPT-3 as the teacher, we summarize sentences in D 0 and use the outputs with 60-80% compression ratio as our initial dataset C 0 , since it was the best one quantitatively and qualitatively.Although this implies that we will not initially have enough short sum-Table 2: Results for paired comparison between two models ("comparative" and "baseline", see first two columns).We use BERT Score (∈ [0, 1]), and ROUGE-1,2,L (∈ [0, 100]).Metric differences are computed as detailed in Section 4.1.1,and comparative model' score is shown in brackets.A positive difference reflects an improvement using the comparative model over the baseline.The initial dataset consists of 10000 samples, from where we keep only summaries that differ in at most 10% compression ratio, and are not identical.Training Details We use off-the-shelf model WANLI (Liu et al., 2022a) to create the filter f NLI .We run 3 iterations of REFEREE-DISTILL with compression ratios for each training iteration are as follows: r 1 = 0.7, r 2 = 0.5, r 3 = 0.3.All generated data is decoded via beam search (with beam width 5).All experiments with f NSP are done with r = e −6 ≈ 0.0025, empirically decided through preliminary exploration.We train REFEREE-CONTROL with n = 10 buckets for 7 iterations.For more details, refer to Appendix B. While having contextual information for f NSP is useful, many applications do not have contextual information available.Therefore, we use the corpus C 0 ∪ . . .∪ C t generated using f 1 as initial training data for REFEREE-CONTROL with the goal of increasing its applicability.

Supervised Baseline
We include a supervised baseline as a comparison point.Following Rush et al. (2015), we use Gigaword (Napoles et al., 2012;Graff et al., 2003) as a silver-labeled dataset where the headline is used as the summary for an article's first sentence.We fine-tune GPT2-Large (Radford et al., 2019) (the same architecture as in our models) on this corpus, with default hyperparameters.Due to the training data's nature, the supervised baseline has a low average compression ratio of 55% (similar to REFEREE-DISTILL Iteration 2, but longer on average than Iteration 3).Importantly, we do not (and cannot) include a conventional knowledge distillation from GPT-3 (e.g.Shleifer and Rush 2020) since the full distribution of token logits is unavailable.

Evaluating REFEREE-DISTILL Compression and Fidelity Statistics
We observe that iterative training with selection of progressively shorter summaries achieves the goal of generating shorter summaries, and additionally, less variance in compression ratios (Figure 2).Moreover, in Table 1 we observe that, by using an NLI filter during distillation, REFEREE-DISTILL summaries were ~90% entailed by the original sentence according to WANLI (compared to 79% if not including an NLI filter during training).This vastly surpasses the comparable GPT-3 dataset (20-40%), and achieves similar fidelity as the best GPT-3 summaries, even when our model is significantly smaller.The same trends hold for both filters.

Comparison with GPT-3
We compare the quality of our summaries (s ′ ) and GPT-3 generated summaries (s ′′ ), for every trained iteration and every GPT-3 dataset.Since longer summaries will naturally be able to preserve more of the original information (Schluter, 2017), it is not reasonable to compare two wildly different compression ratios.Therefore, we only compare summaries that differ by length at most 10%: , where s is the original sentence.To measure summary quality automatically, we compute BERTScore (Zhang* et al., 2020) and ROUGE-1,2,L (Lin, 2004) against the original sentence s (no references are available).Given a metric m, we evaluate models based on m(s ′ , s) − m(s ′′ , s).Positive values reflect that s ′ had higher scores than the baseline summary s ′′ , which is desirable for all our metrics.Our models show significant improvements in all metrics when compared to every GPT-3 dataset, and the supervised baseline (see Table 2).Our model shows especially large improvements when compared to the shortest GPT-3 summaries (20-40% prompts).This suggests our iterative procedure was able to preserve quality better during the selection for shorter summaries.
Finally, we compare the effect of introducing a contextual filter f NSP .We compare two identical training runs that differ only in the filter applied (filter #2 vs. #1), and we observe small improvements when including f NSP (see last rows of Table 2, higher values mean #2 was better than #1).

Human Evaluation
We conduct a human evaluation to verify the qualities of summaries from REFEREE-DISTILL.We measure 3 axes: faithfulness (is the summary true to the source?), relevance (does the summary capture important information from the source?) and fluency, each on a 3-point Likert scale. 3We conduct our evaluation on 100 examples and find agreement by Fleiss κ (Fleiss, 1971) of 0.32, 0.34, and 0.57 (respectively) indicating fair to moderate agreement (Landis and Koch, 1977).
We compare between different methods to obtain succinct summaries: REFEREE-DISTILL (Iteration  3, function #1), GPT-3 20-40%, and the supervised baseline.Following §4.1.1,we only compare sentences if all generations differ in compression ratio of at most 10%.See results in Table 3.
Broadly, we find that REFEREE-DISTILL and GPT-3 achieve significantly higher quality than the supervised baseline, with REFEREE-DISTILL showing even slightly higher scores than GPT-3 for all 3 axes.We also note that this evaluation may somewhat favor baselines, since we only select examples where they achieved a similar compression ratio as us.We are not accounting for the fact that REFEREE-DISTILL can generate short summaries for many examples the other two systems cannot.

Evaluating REFEREE-CONTROL
We train all our models with n = 10 buckets to provide a very fine-grained control: the average sentence length in our dataset is 134 characters long, which implies that each bucket may span ~13 characters for the average sentence.This implies a model may only have one or two words of freedom We show that each REFEREE-CONTROL iteration increases bucket accuracy and reduces resulting compression rate variance (See Figure 3, Appendix C.2).To maximize quality, in all our experiments we only use one sampled beam.If we wished to maximize bucket accuracy at the expense of possibly reduced quality, we can take the top beams and select the most likely one that is in the prompted bucket.This procedure increases the bucket accuracy dramatically: in iteration two, using one beam has a 42% bucket accuracy for the bucket 80-90% (in a held-out set), whereas using three yields an accuracy of 71%, and five, 82%.This same trend holds for other iterations and buckets, reaching 93% bucket accuracy in iteration 7.This trade-off between bucket accuracy and summary quality that can be seen for bucket b 3 in Table 4, although the behavior is consistent for all buckets (See Appendix C.1). There, Iteration 3 has slightly higher BERTScore and ROUGE than Iteration 7, at the expense of lower bucket accuracy.We believe this is because as bucket accuracy increases we reach harder to summarize examples at the desired length range, causing average scores to drop.
Later iterations also show more (small) disfluencies, which we partially attribute to the aforementioned cause.Also, small disfluencies may propagate over time, which we mitigate by using the mean negative log likelihood ratio filter f AvgNLL .Removing the fluency filter f AvgNLL will also enable 9 more points of bucket accuracy on average.

Importance of Iterative Distillation
We trained a GPT2-Large model for 10 epochs with the same seed dataset E 0 , and compare with REFEREE-CONTROL iteration 5 (also trained for 10 epochs in total).Bucket accuracy of REFEREE-CONTROL was ~20 points better: for 30-40% bucket, bucket accuracy in the non-iterative version was 42% vs. 69% for REFEREE-CONTROL; for 70-80% bucket, accuracy was 34% vs. 58%).

Human Evaluation
We aim to explicitly test the capacity of systems to generate an acceptable summary.That is, a summary to meets minimum human measures of quality, and also adheres to the desired length constraint.We omit the supervised baseline here, as it does not have explicit length control, and thus include only REFEREE-CONTROL and GPT-3.Specifically, we measure summary accuracy as the fraction of summaries (of 100 randomly preselected sentences) that adhere to length control while being sufficiently fluent, relevant, and faithful.These axes are measured as in §4.1.2,achieving agreement by Fleiss κ (Fleiss, 1971) of 0.34, 0.22, and 0.25 (respectively) indicating fair agreement (Landis and Koch, 1977).We include two accuracy measurements: acc, requiring adhering to length constraints as well as at least 2 ("fair") out of 3 on all human measures of quality; and acc + , which requires 3 out of 3 on all measures along with length adherence.
Table 5 includes results for the 20-40% and 40-60% compression ranges, following our GPT-3 datasets (see C.3 for more setup details).REFEREE-CONTROL vastly outperforms GPT3 for both regimes and metrics.More precisely, for the 40-60% regime REFEREE-CONTROL showed +296% in acc and +279% in acc + when compared with GPT3; and for the 20-40% regime, REFEREE-CONTROL obtained +68% and +99% respectively.We additionally included three unsupervised summarization systems in the human evaluation.These systems perform some length control, making their summaries comparable, but all three models performed poorly when compared with REFEREE-CONTROL: they were at least 23 points below REFEREE-CONTROL in acc, and 17 points below in acc + .Lastly, we would like to emphasize that REFEREE-CONTROL aims to summarize all examples at the requested compression, regardless of the original sentence's length or difficulty.GPT3, on the other hand, only summarizes in the requested compression for longer sentences, which generally correlate with easier cases (see details in A.3).

Related Work
Unsupervised Summarization The vast majority of prior work in sentence summarization assumed access to large-scale text-summary paired datasets from which to train supervised models (Rush et al., 2015;Nallapati et al., 2016;Narayan et al., 2018).Nonetheless, these datasets are costly to create, and naturally-occurring summarization datasets (such as news highlights) are noisy and not easily found in other domains.Therefore, recent work emphasized the need for developing unsupervised or self-supervised methods such as autoencoders (Miao and Blunsom, 2016;Baziotis et al., 2019), but in general they lead to less fluent summaries; more recent work has explored the Information Bottleneck Principle (West et al., 2019) instead of the reconstruction loss of autoencoders.Our work contributes to this emerging line of research by demonstrating an entirely different method based on symbolic knowledge distillation.
Length-Controlled Summarization While real world applications would require controlling for summary length, most prior work for automatic summarization has not proposed a principled mechanism for controlling the level of compression.Notable exceptions include Kikuchi et al. (2016) and He et al. (2020); Fan et al. (2018); Liu et al. (2018).
These last works developed supervised models for controllable length summarization by adding control codes that corresponded to a range of summary lengths-commonly referred to as buckets.However, in both works the degree of control is heavily dependent on the training dataset, since bucket bounds are defined so that each one has the same number of examples; this may make one bucket correspond to a wide range of compression ratios.Our work adds a unique contribution by proposing a reference-free method that allows for full range of controls, and explicitly evaluates for that behavior.Concurrently to this work, Ghalandari et al. (2022) and Liu et al. (2022b) (the latter optimizing Schumann et al., 2020) proposed unsupervised mechanisms that enforce length at the word level, either through a reward mechanism or with strict length enforcement.In contrast, our method uses iterative knowledge distillation to achieve length control.We also control length at the character level, rather than word level.This can result in a more fair and challenging notion of control, as it prevents performing the simple strategy in which all function words are removed first to maximize general meaning.Notably, and also in contrast to our work, both Ghalandari et al. (2022) and Liu et al. (2022b) require training separate models for compressing at different compression ratios.
Knowledge Distillation Many prior works have focused on similar notions of transferring knowledge between models through generation and distillation, and we draw particular inspiration from West et al. (2022).Shleifer and Rush (2020) also follow a similar form to our work, distilling a summarizer from pretrained models.Our work differs in two key ways.First, like most distillation works, Shleifer and Rush (2020) assume having a model trained for the task and with access to its full distribution of token logits -both not the case of GPT-3, used here.Second, like many distillation studies, Shleifer and Rush (2020) aim to mimic the teacher model's distribution, while we attempt to improve it.This core detail sets us apart from many works employing a large teacher model (Kim and Rush, 2016;Schick and Schütze, 2021;Ye et al., 2022), teaching a student to mimic a distribution rather than improve it as in our case.

Conclusions
We presented REFEREE, a framework for sentence summarization that can be trained without reference summaries, while allowing direct control for summary compression ratio.We uniquely proposed iterative Symbolic Knowledge Distillation, where student models from the previous iteration of distillation serve as teacher models in the next.Distilled models are significantly smaller than the original teacher, GPT-3, and empirical results demonstrated that the final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the resulting summaries' quality.A useful by-product of this iterative distillation process is a high-quality sentence summarization dataset with varying degrees of compression, which we will release jointly with our models upon publication.based upon work partly funded by the DARPA CMO under Contract No. HR001120C0124, and by DARPA MCS program through NIWC Pacific (N66001-19-2-4031).S.K. has been supported by a Google PhD Fellowship.Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily state or reflect those of the United States Government or any agency thereof.

Limitations
REFEREE was entirely developed and tested with sentences extracted from news articles.More work is needed to assess REFEREE's robustness when applying it to other domains.These domains may differ in text type, topic, or even temporal differences, that may cause a distribution shift.REFEREE's success is also tied to other systems' quality, mainly the seed dataset generator (GPT-3) and the summary fidelity filter (operationalized using WANLI entailments).REFEREE may propagate errors and biases in NLI entailments, which may be remedied in the future as NLI research progresses.We believe some edge cases generated by REFEREE may be useful to further augment data in NLI systems, but that investigation was outside the scope of this paper.
REFEREE is built entirely at the sentence level, and more work is needed to extend it to paragraph or document-level, although some of the same ideas could be applied (e.g., control codes over longer inputs).

A Further Insights on GPT-3 Datasets
A.1 Successive Application of GPT-3 Prompts Graphic shows no real difference in repeating the same prompt using the previously generated summary to summarize further.

A.3 Distribution of Sentence Length in Cases
Where GPT-3 Respected the Desired Compression We observe that the distribution of original sentence length is markedly different when considering all the sentences in the dataset, versus when only considering the sentences where GPT-3's summary was in the desired compression range (see Figure 6.This shows that GPT-3's sentences compresses more only in easier cases, whereas REFEREE-CONTROL follows the distribution of the original set (see 7).

B Training and Dataset Details
We use GPT2-Large for all our fine-tuned models (774M weights, ~16x smaller than GPT3-Curie (Radford et al., 2019;Brown et al., 2020)), and fine-tune for 5 epochs during each iteration of REFEREE-DISTILL, and for 2 epochs for REFEREE-CONTROL.D 0 consists of 100000 sentences, D i>0 consists of 40000 sentences each; F 0 consists of 220000 sentences, F i>0 consists of 10000 sentences each.All generated data is decoded with sampling beam search with 5 beams.
We balance each bucket before training each REFEREE-CONTROL iteration to avoid overrepresenting some classes.We use string characters 0 to n − 1 as control codes for their respective buckets.We find that repeating the control code increases bucket accuracy -likely because the model attends more to these tokens-, so the end code for bucket 2 will be 2 2 2 2 2 2 2 2 2 2 (ten repetitions).It is crucial to note that because variance reduces with the number of iterations, cases where the summary did not land on the requested compression ratio are close to fulfilling the constraint.For example, when prompting with 30-40%, 87.5% of the samples fell in the 20-40% range for REFEREE-CONTROL Iteration 7 (77% for Iteration 3; 83% for Iteration 5).
Regarding baselines, we trained Ghalandari et al.  2020) generations-due to time constraints.All baselines measure compression ratio at the word level, in contrast to our work that does it at the character level.We train with a 30% target compression ratio in baselines when comparing with the 20-40% regime, and 50% when comparing with the 40-60% regime.This is done to maximize the number of examples that fall in the desired range, since some baselines enforce target compression ratio more strictly than others.

D On BERT Score Monotonicity
In §2.4,we discussed that introducing control codes allows us to solve a related question: given we have a specific tolerance for information loss k, what is the shortest summary we could write?Estimating the level of information preservation with BERT Score Recall, we could sample from each bucket and select the shortest summary with BERT Score Recall ≥ k.BERT Score Recall scores are generally well-ordered when ordering increasingly by bucket: we see that on average, the longest non-decreasing subsequence of BERT Score Recall scores is ~7.7-7.8 for all iteration steps.Having a non-decreasing subsequence of 10 would mean that BERT Scores are perfectly ordered (a summary in a higher bucket would have always higher BERT Score recall than a lower bucket).7.8 means that scores are generally well-ordered, with some noisy generations.

E Examples of Generations
Figure1: Our method results in high quality, referencefree compact summarizers.We begin by using a large language model (e.g.GPT-3) to generate many summaries that demonstrate different aspects we may want in a summary-gray represents an aspect wellrepresented in these generations, while black is underrepresented.We first use REFEREE-DISTILL to iteratively filter and train summarizers that better represent these desirable aspects, e.g.shorter summary length.We then use generations from REFEREE-DISTILL to train a model in which these aspects are controllable: this is REFEREE-CONTROL.

Figure 3 :
Figure 3: Statistics per REFEREE-CONTROL iteration, computed on the datasets E i generated during training.
Language Inference (NLI) for Summarization Pasunuru et al. (2017); Pasunuru and Bansal (2018); Li et al. (2018) have used NLI for summarization enhancement: Pasunuru et al. (2017) use entailment in multi-task learning, and Pasunuru and Bansal (2018); Li et al. (2018) use entailment probability as a reward.In this work, we propose an alternative approach for incorporating NLI for enhancing fidelity of summarization under the Symbolic Knowledge Distillation framework.

Figure 4 :
Figure 4: Histogram showing compression ratios of successive application of the same summarization prompts.Graphic shows no real difference in repeating the same prompt using the previously generated summary to summarize further.

Figure 5 :
Figure 5: Histogram of compression ratio distribution for each prompt set.

Figure 6 :Figure 7 :
Figure 6: Distribution of original sentence length of all original sentences, and of the subset of samples where GPT-3 20-40% returned a summary in the prompted compression range.

Figure 8 :
Figure 8: Boxplots showing the compression ratio distribution per training iteration, per bucket.30-40% bucket is shown in Figure 3 in the main paper.
(2022) from scratch using RealNews.Schumann et al. (2020) was used out-of-the-box, and we use the default training data for Liu et al. (2022b) -the publicly available Schumann et al. (

Table 1 :
Statistics of automatically-generated datasets (ours

Table 6 :
BERT Score, ROUGE-1,2,L and bucket accuracy for the 50-60% bucket.Three GPT3 datasets are shown, and three different iterations of REFEREE-CONTROL.

Table 7
Original: Specifically, the company agreed to limit pre-payments, to provide accurate estimates of charges, and to disclose details of financing agreements.GPT3: Verizon has agreed to limit pre-payments, to provide accurate estimates of charges, and to disclose details of financing agreements.[90.41%,30.41%toolong] REFEREE-CONTROL: The company agreed to limit pre-payments and to provide accurate estimates of charges.[58.9%, in range] Ghalandari et al. (2022): Specifically, the company agreed to limit pre-payments to estimates financing.[53.42%, in range] Liu et al. (2022b): company agreed to limit to provide accurate estimates of disclose details financing [56.85%, in range] Original: "There was a family in the car, they got out before it hit.GPT3:There was a family in the car when the car hit a tree, and they all got out before it hit the tree.[167.8%,107.8%toolong]REFEREE-CONTROL: There was a family in the car.[50.85%, in range] Ghalandari et al. (2022): "There was a family in the car.[52.54%, in range] Liu et al. (2022b): there was a family they got before it [62.71%,2.71%toolong]Original: As for the coming revolution, I can't say for certain when that will happen.GPT3:I can't say for certain when the coming revolution will happen.[82.89%,22.89%toolong] REFEREE-CONTROL: I can't say for certain when it will happen.[57.89%, in range] Ghalandari et al. (2022): As for the coming revolution I can't say.[53.95%, in range] Liu et al. (2022b): for the coming revolution i can <|unk|> t say [59.21%, in range] Original: In one New Jersey county, the Red Cross was AWOL, says Lt. Matthew Tiedemann of the Bergen County Office of Emergency Management.GPT3: In one New Jersey county, the Red Cross was AWOL.[37.98%,2.02%tooshort] REFEREE-CONTROL: In one New Jersey county, the Red Cross was AWOL, says Lt. Matthew Tiedemann.[59.69%, in range] Ghalandari et al. (2022): In one New Jersey county Red Cross was AWOL says Lt. Matthew Tiedemann.[55.04%, in range] Liu et al. (2022b): in new jersey county red cross awol lt.<|unk|> emergency management [52.71%, in range]Original: Furthermore, though some might argue that the purpose of food is merely to remind us of our own mortality, I would argue that the best cuisine can be an art form.GPT3: Food can be both utilitarian and artful.[24.69%,15.31%too short] REFEREE-CONTROL: Furthermore, I would argue that the best cuisine can be an art form, rather than a food source.[58.64%, in range] Ghalandari et al. (2022): Furthermore of food mortality I would argue that the best cuisine can be an art form.[52.47%, in range] Liu et al. (2022b): the is our would that the best cuisine can be an art [32.1%, 7.9% too short] Original: Without proper conservation facilities nearby, the artifacts Batchvarov and his colleagues discovered had to be reburied underwater.GPT3: Without proper conservation facilities nearby, the artifacts Batchvarov and his colleagues discovered had to be reburied underwater.[100.0%,40.0%toolong] REFEREE-CONTROL: The artifacts Batchvarov and his colleagues discovered had to be reburied.[56.06%, in range] Ghalandari et al. (2022): Without conservation artifacts Batchvarov had to reburied underwater [51.52%, in range] Liu et al. (2022b): conservation facilities artifacts colleagues discovered to be reburied underwater [61.36%, 1.36% too long] Original: Often people are desperate or ridiculous rather than heroic, so I do what needs to be done."GPT3:That person is heroic.[23.91%,16.09% too short] REFEREE-CONTROL: Often people are desperate or ridiculous, so I do it.[57.61%, in range] Ghalandari et al. (2022): Often people are desperate or ridiculous heroic I. [54.35%, in range] Liu et al. (2022b): so i do what needs to be done <|unk|> [40.22%, in range]Original: Liam's feet make a guest appearance in the picture, as does his Great Dane Watson who he bought with ex-girlfriend Sophia Smith can be seen by his settee.