Don’t Add, don’t Miss: Effective Content Preserving Generation from Pre-Selected Text Spans

The recently introduced Controlled Text Reduction (CTR) task isolates the text generation step within typical summarization-style tasks. It does so by challenging models to generate coherent text conforming to pre-selected content within the input text (“highlights”). This framing enables increased modularity in summarization-like tasks, allowing to couple a single CTR model with various content-selection setups and modules. However, there are currently no reliable CTR models, while the performance of the existing baseline for the task is mediocre, falling short of practical utility. Here, we address this gap by introducing a high-quality, open-source CTR model that tackles two prior key limitations: inadequate enforcement of the content-preservation constraint, and suboptimal silver training data. Addressing these, we amplify the content-preservation constraint in both training, via RL, and inference, via a controlled decoding strategy. Further, we substantially improve the silver training data quality via GPT-4 distillation. Overall, pairing the distilled dataset with the highlight-adherence strategies yields marked gains over the current baseline, of up to 30 ROUGE-L points, providing a reliable CTR model for downstream use. 1


Introduction
The abstractive text summarization task, aiming to generate accurate and coherent summaries from one or multiple documents, involves two principal sub-tasks: (a) identification of salient information in the input text(s) and (b) its consolidation into a coherent text.Recently, Slobodkin et al. (2022) proposed an explicit decomposition of these two subtasks, particularly concentrating on the latter as an isolated task termed Controlled Text Reduction (CTR), as illustrated in Figure 2.This task takes as input a text with pre-selected marked spans ("highlights") and expects a reduced and coherent version of the text, covering precisely the content of these input spans.In addition to a baseline model, the authors also provided crowdsourced dev and test sets, along with automatically-generated silver training data, where each instance comprises of a document with highlighted content and the corresponding reduced text summary.The proposed adoption of CTR offers greater control over text generation, enabling modular summarization systems, where a single CTR model can be combined with various content selection strategies and user preferences.For example, the same CTR model could be used for generation from content selected for either generic or query-focused summarization (QFS; Dang, 2006), or for long-form question-answering (LFQA; Fan et al., 2019).Further, by excluding the subjective content selection requirement of the full summarization task, CTR offers a semantically well-defined and objective generation task, focused on coherent content consolidation.
Being recently introduced, there currently exists no high-quality CTR model, with the present baseline succeeding to cover only half of the highlighted details while pulling much non-highlighted content from the surrounding context.In this paper, we aim to design an efficient CTR model of sufficient quality for reliable integration in modular architectures.Our proposed method, outlined in Figure 1, addresses two different shortcomings of the existing resources.First, we examine methods to intensify the highlights signal during training and inference, in order to yield better content preservation.Second, we address the noise evident in the available silver training data, improving its quality using  We begin our investigation by exploring strategies for enforcing the highlights signal.First, we explore the use of Reinforcement Learning (RL)  Prompt: In this task, you are presented with a passage, where some parts are "highlighted"... Cloudy weather threatened to mar the show for finnish … Figure 1: Overview of our contributions, encompassing three modeling phases.Components introduced in our approach are denoted in blue.(a) We generate new target summaries using GPT-4, conditioned on the silver highlights in the original dataset.(b) During training, we fine-tune our model taking an RL approach, based on Quark (Lu et al., 2022a).(c) During inference, we employ a highlights-centric controlled decoding algorithm.
during training to both accentuate the highlights signal and mitigate the impact of the data noise, adapting the recently-introduced Quark algorithm (Lu et al., 2022a), which aims to unlearn unwanted properties.Since RL methods bias the model to conform to a reward function, in addition to the training data, we hypothesize it has the potential to reduce the impact of noisy data biases on the model's behavior.We then design a highlightaware decoding mechanism, following Wan et al. (2023), which biases the model to better preserve the highlights content at inference time.Finally, we address the inherent noise within the available silver training data, by employing GPT-4 (Ope-nAI, 2023) to generate cleaner training data for our model, essentially performing (symbolic) distillation from GPT-4.
Empirically, we demonstrate that each of the aforementioned strategies separately yields stateof-the-art results, surpassing the baseline model in terms of highlights content preservation.Further, we show that GPT-4 is indeed effective in generating better silver training data, leading to further improvements.
Hence, our contribution in this paper is twofold: 1. Proposing and investigating multiple strategies to amplify the highlights signal in the CTR setting, addressing training, inference, and data generation.
2. Developing a high-quality CTR model, significantly outperforming the available baseline.

Background
This section provides the needed background regarding our task and the methods we employ.
Controlled Text Reduction Controlled Text Reduction (CTR; Slobodkin et al., 2022) is a recently introduced task that aims to generate a reduced version of a text that exactly covers pre-determined selected content, referred to as "highlights" (see Figure 2).It effectively generalizes the sentence decontextualization task (Choi et al., 2021), which addresses only the case of rephrasing a single full sentence given in context to be comprehensible standalone.In contrast to the full summarization task, which involves a substantial degree of subjectivity in content selection, CTR requires exact preservation of mudularly pre-selected content, making the task more semantically objective.At the same time, this stringent requirement for both faithfulness to and coverage of the highlights, makes the task more semantically challenging than standard summarization, which is obliged only to faithfulness, but not to full coverage.
Accompanying the task, the CTR authors introduced a manually annotated development and test sets, as well as an automatically-generated silver train set, derived from the DUC summarization dataset2 using a summary-source alignment model (SuperPAL; Ernst et al., 2021).However, an approximate 30% mismatch surfaced between the 'silver highlights', namely source spans identified by SuperPAL as related to the summary, and their respective summary content.Such discrepancy may cause a model trained on this data to develop biases toward overlooking certain types of highlights during training, as well as including salient, yet non-highlighted content.Here, we address this issue by improving the CTR training dataset.
Weather experts discussed the drought Thursday and its possible causes and effects, saying they hoped to produce a consensus on how much longer it will last.All of the speakers at the 1988 Drought Symposium called for more research and study.

…
John Hope, the network's hurricane specialist and a forecaster with the National Hurricane Center in Miami, said the drought in the Southeast might be lessened or ended soon by a heavier than normal hurricane season.
Weather experts discussed the possible causes and effects of the drought, with the hope of producing a consensus on how much longer it will last.John Hope, from the National Hurricane Center said the drought might end soon.Controlling via Reinforcement Learning Reinforcement Learning (RL) has been increasingly utilized to control various facets of text generation (Pasunuru and Bansal, 2018;Yuan et al., 2019;Nakano et al., 2021), with most works relying either on REINFORCE (Williams, 1992), an algorithm notable for its direct yet high-variance approach to policy optimization, or the Proximal Policy Optimization (PPO; Schulman et al., 2017), an algorithm renowned for efficiently balancing policy stability and learning.There has also been a growing interest in using RL methods for reducing undesired behaviors, including toxicity (Faal et al., 2022) and redundancy (Mao et al., 2020).
Following this line of work, Lu et al. (2022a) recently introduced Quark -an algorithm inspired by the Proximal Policy Optimization algorithm, designed to unlearn unwanted properties, which we leverage here.While REINFORCE and PPO are used to learn a desired policy, Quark aims to remove or reduce specific behaviors from the learned policy, enabling it to effectively address the undesired behaviors in the text generation process.The algorithm iteratively alternates between three steps: (1) Exploration, where the current model state generates new input-output samples from the training inputs, and then incorporates them into the stored data pool; (2) Quantization, where a predefined reward function is used to rank and classify the accumulated data pool into K quantiles, with a reward token assigned to each; and (3) Learning, where the algorithm maximizes the likelihood of the accumulated data pool, with the instances conditioned on their quantile labels.A KL-divergence penalty is also applied during training to ensure proximity to the original language model distribution (Jaques et al., 2016;Ziegler et al., 2019).
The objective of Quark is to teach the model to generate texts of varying quality with respect to the reward function.Then, at inference, the model is asked to generate high-reward outputs.The algorithm exhibits state-of-the-art results in several attribute-removal objectives, such as toxicity and unwanted sentiment, surpassing many other RLbased text generation approaches, including PPO.
In this work, we propose modifying Quark to teach models to unlearn biases caused by the somewhat noisy training data, thereby enhancing their performance in adhering to the highlighted content.
Controlled Decoding Controlled decoding algorithms aim to guide models to better address concrete output requirements that can be measured and enforced during decoding.To this end, various works designed constraint-sensitive decoding methods that modify the search space in accordance with the constraints (Anderson et al., 2017;Hokamp and Liu, 2017;Post and Vilar, 2018;Lu et al., 2021).
Recently, a faithfulness-aware decoding mechanism was proposed by Wan et al. (2023), which involves a lookahead operation during decoding, inspired by Lu et al. (2022b).At every decoding step, the algorithm projects into the future to create a complete summary commencing with the current tokens of any partially formed summary.It then selects tokens that provide paths exhibiting enhanced faithfulness within the search space.Formally, each token's score is calculated by: where x represents the current input tokens, t is the generation step, y is the generated output, logP (y ≤t |x) is the underlying generation score, g(•) is a faithfulness evaluation function, λ stands as a hyperparameter to regulate the emphasis placed on future predictions, and l stands for the number of tokens to look into the future.L l (y ≤t ) is a collection of length l continuations of y ≤t , and its size ranges from a single continuation in a greedy decoding approach, to k potential continuations in beam-search decoding.In this work, we adapt this algorithm, shifting its focus towards precise matching with the highlighted content rather than being (only) faithful to the entire input.
Data Generation with Language Models Recently, many works proposed using LM-generated data to directly train smaller manageable models.These efforts address various challenges, including controllability (Sclar et al., 2022), model reasoning (Zelikman et al., 2022;Hsieh et al., 2023), and language understanding (Ye et al., 2022;Han et al., 2022).These works align with the Symbolic Knowledge Distillation scheme (West et al., 2022), where knowledge from the teacher model is transferred via a textual dataset used for student training.Here, we employ GPT-4 to generate improved silver training data for our models.

Highlights-Oriented RL Training
To adapt Quark (see §2) for the CTR task, we introduce a highlights-focused reward, based on the ROUGE metric (Lin, 2004).3Previous RL-based text generation studies with ROUGE-based rewards calculated ROUGE scores relative to the gold reference summaries.In contrast, we propose calculating the ROUGE rewards for the generated output compared to the concatenated input highlights, to encourage their content preservation.Furthermore, to motivate the model to balance between the task's two requirements, namely, covering the entire highlights and avoiding the inclusion of excessive nonhighlighted content, we suggest optimizing each of these objectives separately.Inspired by the dualreward procedure (Pasunuru and Bansal, 2018), we propose alternating between two highlightsfocused rewards: One that encourages coverage of highlights, for which we use ROUGE recall, and another that prioritizes adherence (faithfulness) to the highlights, for which we employ ROUGE precision.Indeed, we find that this alternating reward strategy works best over the development set (see Appendix A).

Highlights-Sensitive Decoding
To bias the model to better preserve the highlights content at inference time, we follow the faithfulness-aware decoding strategy from Wan et al. (2023) (see §2).Here, we adapt this method to prioritize partially formed summaries that are likely to eventually (once completed) match better the pre-selected highlights.To that end, we substitute the score g(y ≤t+l , x) in Equation 1with a new score, g(y ≤t+l , x h ), where x h represents the concatenation of the highlights.In essence, our strategy shifts from an input-focused to a highlightsfocused scoring approach, while requiring both faithfulness and complete coverage.Within this framework, we evaluate two potential metrics for the computation of g(y ≤t+l , x h ): ROUGE-L F1 and METEOR scores.We found that ROUGE-L F1 consistently matched or exceeded METEOR's performance, and hence adopted it for our method.Please refer to Appendix B for further details.

Improving Silver Training Data with GPT-4
We wish to improve the quality of the CTR dataset, by leveraging the capabilities of GPT-4 (OpenAI, 2023).The existing CTR training dataset consists of summaries manually composed by experienced summarizers, while the highlights were automatically identified using a summary-source alignment model (see §2).As discussed, the performance of this alignment model leaves much room for improvement.
To capitalize on GPT-4's abilities in text generation, we employ it to generate more fitting summaries, based on the silver highlights.To that end, we supply GPT-4 with a modular prompt, inspired by the Chain-of-Thought approach (Wei et al., 2023).This custom prompt incorporates two exemplars that deconstruct the controlled reduction into three steps: (1) the listing of highlights, (2) the consolidation of highlights on a sentenceby-sentence basis, and (3) the production of the ultimate reduction.The detailed structure of this prompt can be found in Appendix C.
We will henceforth refer to models trained on the GPT-4-generated data as distilled, and to those trained on the original CTR data as non-distilled, following the notion of Symbolic Knowledge Distillation (West et al., 2022).As shown in Section §5.3, we observed an improved quality of the dataset, yielding better alignments between highlights and summaries.

Experimental Setup
Base Model Throughout our experiments, our primary base model is the instruction-finetuned Flan-T5 large model (Flan-T5 large ; Chung et al., 2022), further fine-tuned on the highlights-focused CTR dataset.The selection of this particular model is motivated by emergent research that indicates instruction-finetuned models manifest superior performance in tasks necessitating constrained generation (Sanh et al., 2022;Wei et al., 2022;Zhou et al., 2023).We will refer to this model as Flan-T5 H , where H stands for "highlight-finetuned".
In addition to this variant of Flan-T5 H , we also show results of a large variant of the original pretrained CTR baseline model, which is the Longformer Encoder-Decoder large model (LED large ; Beltagy et al., 2020), finetuned on the CTR dataset.We will refer to this model as LED H . Lastly, we also show the results of the few-shot GPT-4 (Ope-nAI, 2023), when guided with the same prompt used in our distillation process (see §3.3).

Highlights-Oriented
RL Training Our highlights-sensitive RL training involves finetuning the anchor Flan-T5 H model using Quark, combined with our highlights-driven reward policy (see §3.1).Specifically, the first Exploration step (see §2) utilizes the pretrained and fine-tuned Flan-T5 H to generate the initial pool of samples, which triggers the Quark unlearning process, performed on Flan-T5 H .Following Lu et al. (2022a), we set the number of quantiles to eight (see §2).We find, on the development set, that ROUGE-L is the most effective metric for our dual-reward function, yielding the greatest improvements relative to the ROUGE-1 and ROUGE-2 metrics (henceforth, ROUGE rewards refer to ROUGE-L).

Highlights-Sensitive Decoding
To score the degree of alignment between each future completion of a current generated prefix and the highlights, we utilize the F1 measure of ROUGE-L, relative to the concatenated highlights (see §2 and §3.2).We use the hyperparameters used in Wan et al. (2023).Additionally, following its tuning, we chose a beam size of 8.For a more comprehensive discussion of Generating New Training Data with GPT-4 As part of our methodology to improve the generation of the silver training data, we evaluated the effect of varying the number of examples used in the fewshot prompt for GPT-4.This procedure involved the random sampling of fifty instances from the CTR development set, and their subsequent incorporation within a few-shot prompt.The quantity of in-context examples was systematically varied between one and five, the results of which are documented in Figure 3.These results show that the use of two examples delivered the most favorable results across all evaluative metrics, thereby leading us to select this configuration in our experiments.We also experimented with ChatGPT, and found it to be inferior to GPT-4, leading to the adoption of GPT-4 in this setting.For a more in-depth discussion of these findings and additional hyperparameter tuning, please refer to Appendix D.
Evaluation For evaluation, we adopt the evaluation approach outlined by Slobodkin et al. (2022), calculating several content-matching metrics between the generated texts and the concatenated highlights.The rationale behind evaluating against the highlights, rather than the gold summaries, is rooted in CTR's principal requirement: an exact matching to the highlighted content, not necessarily to the reference summary.While the highlights and reference summary are supposed to express the same content, some discrepancies may occur even in the gold test data.Moreover, since automatic metrics are not perfect, models that are less abstract than the human-generated gold summaries, or that exhibit different paraphrastic abstractions, would  ) and BertScore results on the CTR testset, compared to the concatenated highlights, as well as coherency results.In addition to the baseline LED H and Flan-T5 H , we also evaluate the combination of Flan-T5 H with the highlights-sensitive decoding strategy ("+ Con.Decoding"), with the highlights-focused RL strategy ("+ RL"), and with their combination.We also evaluate all those variants when trained with the GPT-4-generated trainset ("Distilled") as well as GPT-4 itself, in a few-shot setting with two exemplars.For each metric, The best non-distilled and distilled Flan-T5 models are in bold, with the best overall model having an asterix.
unjustly be penalized when evaluated against the gold summaries.In Appendix E, we conduct a qualitative analysis that further shows the advantage of directly comparing outputs with the highlights, rather than with the gold summaries.
In addition to ROUGE (Lin, 2004), used in Slobodkin et al. ( 2022), we also measure METEOR scores (Denkowski and Lavie, 2014), informed by recent research underscoring the correlation of this metric with human judgment concerning relevance and consistency (Fabbri et al., 2021).We also report results on the BertScore metric, which is less lexical and more semantic in nature, though not necessarily more reliable.Experimenting also with NLI-based metrics, we observed inadequate performance when applied over sub-sententce spans (our hihglights), and hence leave it for future research to develop NLI-based metrics that are robust to such settings.
Finally, we also follow Slobodkin et al. (2022)'s coherency analysis, by hiring crowd-workers to assess the coherency of 50 random samples for each model, with a 5-point Likert scale (for details see Appendix F).Notably, our coherency analysis tested both coherency and fluency, following the settings in the original CTR paper.This approach follow the standard common practicw, where fluency and coherence are best evaluated manually.

None-Distilled Models
Table 1 presents performance results, for models trained over the original CTR silver training data (Slobodkin et al., 2022)).Primarily, Flan-T5 H exhibits superior performance compared to LED H .This trend is consistently apparent in all subsequent variants, hence, we confine our reporting of subsequent models to Flan-T5 H 's variants. 4e observe that further finetuning Flan-T5 H via our highlights-oriented RL protocol ("+ RL" in Table 1) yields substantial improvements.Additionally, augmenting Flan-T5 H with the highlightsaware decoding strategy ("+ Con.Decoding" in Table 1) leads to an even larger performance improvements across ROUGE and BertScore metrics and a modest improvement in the METEOR metric.We find this sensible, given the aggressive nature of the controlled decoding approach in amplifying the highlight-signal when actually generating the output at inference time.Interestingly, the incorporation of both strategies simultaneously results in a slight drop in performance based on the ROUGE and BertScore metrics compared to the controlled decoding variant, yet it yields the optimal outcomes on the METEOR metric.Critically, all our model variants maintain a human fluency score above 4.3 (out of 5), demonstrating that the fluency requirement is not compromised.This is not surprising, given the impressive ability of current language models to generate fluent texts.

Distilled Models
From Table 1 we observe a noteworthy finding: when trained on the GPT-4-generated data, Flan-T5 H surpasses all the non-distilled alternatives, which were trained on the original CTR data.Notably, it also surpasses the GPT-4 model, where the latter is prompted in the few-shot setting, just as it was when generating the training data for Flan-T5 H .Given the inherent limitation of GPT-4 in that it is not open-sourced and cannot be fine-tuned on our data, this puts it at a distinct disadvantage compared to the other models, thereby obstructing its broader application.
We also note that Flan-T5 H appears to further benefit from both the auxiliary highlightsfocused RL finetuning and the incorporation of the highlights-attentive decoding strategy.We find that the decoding strategy outperforms the RL approach based on ROUGE and BertScore metrics, similar to the findings for the non-distilled models, while the RL approach appears to be more effective on the METEOR metric.We also note that the combination of both strategies does not result in any additional advantages.Ultimately, just as in the non-distilled setting, we find that the coherency of the generated outputs remains uncompromised.In total, the combination of the highlights-focused strategies with the improved GPT-4-generated data leads to significant improvements, outperforming the baseline model by over 30 ROUGE-L points, and resulting in a reliable and effective CTR model for future downstream use.

Distillation Data Quality Assessment
To shed light on the distillation success, we analyze the quality of the generated dataset.To that end, we first sample 10 GPT-4-generated instances and manually identify their corresponding highlights, which we treat as "gold" highlights.Then, we apply Slobodkin et al. ( 2022)'s automated silver annotation methodology (see §2) on the GPT-4-generated summaries, leading to a new set of highlights for each pair of input text and GPT-4generated summary.These highlights, combined with their corresponding inputs and summaries, represent the original CTR dataset.Alternatively, the original highlights presented to GPT-4, combined with the input texts and the summaries it generated, represent the new dataset.Subsequently, we proceed to calculate the ROUGE-L F1 score between the manually annotated highlights and each of the automatically-generated variants.Our analyses reveal that the GPT-4-generated instances demonstrate a significantly greater alignment with the and coverage (R) scores for LED H , Flan-T5 H , and three variants of the distilled Flan-T5 H : regular, RL-finetuned ("+RL") and with the controlled decoding approach ("+ Con.Decoding").Bold marks the highest scores.
manually-annotated instances than those from the original dataset, achieving a ROUGE-L score of 79%, as opposed to 67.9%.

Manual Performance Analysis
To further evaluate the effectiveness of various components within our approach, we adopt the manual analysis methodology proposed in the CTR paper (Slobodkin et al., 2022).Consistent with the original authors' procedure, we select 10 random samples from the test set.Subsequently, we compute precision and recall scores for highlighted information units for a selection of five models utilized in this study: LED H , Flan-T5 H , the distilled Flan-T5 H , and the two variants of the distilled Flan-T5 H equipped with RL training and controlled decoding, respectively.5Our calculations cover 195 highlighted input units and approximately 180 system summary units for each model.The results of this analysis are illustrated in Table 2.
We obesrve that Flan-T5 H exhibits significantly increased faithfulness to the highlights compared to LED H , with a modest improvement in highlight coverage.Conversely, training on the GPT-4-generated data markedly augments adherence to the highlights as well as leading to significant highlight coverage, thus demonstrating the approach's inherent value.
Our findings also suggest that the application of RL finetuning primarily improves highlight coverage.The seemingly modest contribution to faithfulness could potentially be attributed to the RL mechanism's attempt to achieve a balance between reward optimization and coherence-preservation (due to the KL-divergence term), a requirement that might necessitate incorporating additional information from the surrounding context.Alternatively, the deployment of highlight-centric controlled de-coding displays a more pronounced impact on adherence to highlights.This phenomenon could potentially be attributed to the strategy's more rigorous enforcement of highlight constraints, thereby enforcing for limited deviation from the highlights.The modest advancement in coverage might could be attributed to the lookahead mechanism of controlled decoding.At every generation step, the algorithm favors tokens whose 'greedy' completion leads to better highlight-adherence, rather than exploring multiple potential trajectories for each candidate akin to a k-beam style.In doing so, this method inadvertently neglects additional, more suited candidates, whom a beam search could capture and thereby enhance coverage.

Discussion and Future Work
Our empirical assessments reveal the merits of each of the proposed methods.Specifically, the utility of the distillation process emerges as complementary to both reinforcement learning (RL) training and controlled decoding.The latter methods still have room to play in both enforcing the content preservation constraints beyond the sheer training data, as well as in overcoming a certain level of noise that persists also in the GPT-4 generated data (as shown in §5.3).
Conversely, the combination of controlled decoding and highlight-centric RL approach did not yield improvements over the better-performing method (controlled decoding, in our experiments).Given these strategies impact different stages of the generation process, namely training and inference, they may possess the potential for synergy.For example, the decoding strategy may be integrated into the RL training's sampling phase, possibly increasing the model's awareness of the decoding constraints during training.Our performance analysis, detailed in §5.4,further supports this hypothesis by demonstrating that each technique amplifies different aspects of the highlights: one notably improves their coverage while the other augments adherence, suggesting the exploration of this potential synergy in future research.
Furthermore, the existing implementation of highlight-centric controlled decoding is computationally demanding, given its requirement to generate a complete completion for every candidate token at each generation step.Wan et al. (2023) motivated the need to generate entire summaries by the expectation of current faithfulness metrics for full summary inputs.Yet, as our method does not employ faithfulness metrics, it would be insightful to explore how to leverage partial summaries, rather than complete ones, and thus to significantly reduce the computational overhead.
Finally, given the high-quality highlight adherence and coverage exhibited by our best models, investigating their incorporation into modular summarization pipelines emerges as a promising research direction.Additionally, these models could prove beneficial in a human-in-the-loop framework, allowing users (e.g., students) to pre-select desirable textual content aligned with their personal needs, which could subsequently be consolidated by our models into a concise summary.We suggest exploring these directions in future research.

Conclusion
In this study, we addressed the lack of a highquality Controlled Text Reduction (CTR) model by focusing on two pivotal aspects: the amplification of the highlight signals and the mitigation of noise within the training data.We started by proposing two distinct strategies aiming at augmenting the highlights signal.The first strategy emphasized this signal during training, where we combined an RL approach with a custom highlights-oriented reward.The second strategy was introduced during inference, where we employed a controlled decoding mechanism that prioritizes generation paths ensuring higher adherence to the highlights.Furthermore, we addressed the intrinsic noise in the CTR dataset by generating new instances using GPT-4, significantly enhancing the dataset quality.
Empirical evidence shows the effectiveness of our proposed methodology.Each of our highlightcentric strategies individually led to significant improvements over the baseline model in terms of highlight-matching capabilities.Additionally, training on the GPT-4-generated data yielded further improvements, outperforming each of the non-distilled variants, trained on the original CTR dataset.In total, our highest-performing models achieved state-of-the-art results, outperforming the baseline by more than 30 ROUGE-L points, while also preserving comparable levels of coherency.Future work would focus on improving the combination and efficiency of the different components, as well as on the incorporation of our best-performing models in modular summarization pipelines.
Although the risk involved in our work is minimal, like other advanced language generation models available today, we cannot ensure that our model will consistently produce accurate information, despite our attempts to utilize clean highlighted data and controlled decoding.Hence, it is crucial to exercise caution when employing the model in realworld scenarios and thoroughly test it before deploying it.
In addition, the controlled decoding method we use is limited by its slow speed, which may hinder its practicality for production systems.Additionally, the rigid constraints imposed during decoding can restrict the model's linguistic flexibility and creative potential, potentially limiting its suitability for generating varied or innovative outputs.These considerations highlight the need to carefully evaluate the trade-offs before implementing controlled decoding in production environments, and hence should be further studied in future research.
the Association for Computational Linguistics: Human Language Technologies, pages 4602-4625, Seattle, United States.Association for Computational Linguistics.completion's adherence to the highlights.Table 4 shows the results on the CTR development set.We note that the ROUGE-L score consistently offers superior or at least equivalent performance in comparison to the METEOR score.Therefore, based on these findings, we make the informed decision to adopt the ROUGE-L score as our primary evaluation metric in all subsequent experiments.Furthermore, we undertake a series of experiments with varied beam sizes encompassing 2, 4, 6, and 8.The outcomes, as observed on the development set, are illustrated in Figure 4. Corroborating the findings of Wan et al. (2023), we discern a marginal impact of the increase in beam size on regular decoding.However, its influence on the highlight-focused controlled decoding escalates in parallel with the beam size across all metrics.In light of these findings, we opt for a beam size of 8 for the remainder of our experiments.

C GPT-4 Prompt
For deploying GPT-4, we use a prompt that consists of some basic instructions, as well as two exemplars.The in-context examples consist of a modular generation, where we separate the task into three steps: (1) highlights extraction and enumeration, ( 2) highlights consolidation sentence-bysentence, and (3) generation of the final reduction (see Figure 7).Additionally, we already perform  ple, leaving only the consolidation and generation of the final reduction to the model.

D Prompt-Tuning
To improve the quality of the CTR trainset, we test both GPT-4 and GPT-3.5 (ChatGPT), both with modular prompting and with regular prompting, where each exemplar's answer consists solely of the final reduction.From Table 5, which shows the ROUGE and METEOR scores on 50 instances of the CTR development set, we observe that indeed the modular approach substantially improves both models' performances and that GPT-4 is superior to ChatGPT in handling the task.Additionally, from a manual analysis of GPT-4's generations, we find that occasionally it misses a highlighted span in the highlights-listing step, leading to its absence from the final summary.Consequently, we also experiment with a prompt that already incorporates the highlights-listing step, leaving only their consolidation and the generation of the final reduction to the model.This version indeed which improves the model's performance (see Table 6), making it the final version we use.

E Qualitative Analysis
Figure 6 exhibits two instances from the test set, consisting of input texts along with their corresponding highlights.Accompanying each instance are the outputs generated by the non-distilled Flan-T5 H , the output generated by the distilled Flan-T5 H , and the original gold summaries.
In the first example, the non-distilled model achieved a ROUGE-L score of 58.6 when compared to the gold summary, while its distilled counterpart received a score of 46.5.Despite the apparent superiority of the non-distilled model, we note that while it did cover all the highlighted spans, it also added several non-highlighted segments ("an Atlanta-based cable network", "the next business of the scientist", "the network's" and "look for" in the context of looking for cause and effect).In contrast, the distilled model, besides capturing all highlighted content, only added "look for".Consequently, the distilled model delivered a more fitting output, contradicting its supposed deficiency in terms of the ROUGE-L metric relative to the gold summary.Alternatively, when compared to the concatenated highlights, which yielded scores of 68.8 for the non-distilled model, and 78.9 for the distilled model, the results are more reflective of the actual performance.
In the second example, the distilled model amounts to 83.3 in comparison to the concatenated highlights, as opposed to a mere 43.2 when compared to the gold summary.Interestingly, despite the fact that its output nearly covers all the highlighted spans (except for "with a chance of showers for...Finland" and "in Helsinki" in the context of where the eclipse would end), and introduces very minimal non-highlighted information ("thousands of", "3,000" and "the national airline"), it still results in a lower ROUGE-L score when aligned with the gold summary.This discrepancy can be attributed to two significant factors.Firstly, the distinction in abstraction techniques employed between the gold summary and the generated output can contribute to the reduced scores.Secondly, the gold summary introduces information that is absent from the original input text, and therefore from the highlights ("on July 20").Both these elements can explain the divergence between the model's performance as per the ROUGE-L metric and its actual competency.This discrepancy can be resolved when directly comparing the generated summary to the highlights, which offers a more accurate reflection of its quality.

F Fluency Human Annotation Protocol
We ask crowd-workers to rate the fluency of the texts generated by the baseline supervised model, the PPO model and the dual-reward Quark model.Our group of crowd-workers consists of reliable workers that have shown a good understanding of different semantic tasks including summarization in previous experiments.To evaluate, we randomly select 100 documents from our test set and evaluate their corresponding generated text by the three aforementioned models (300 samples in total).We design a simple Amazon Mechanical Turk interface, where we present each time one of the 300 samples (see Figure 5).Following Slobodkin et al. (2022) we use a 5-point Likert scale to evaluate the fluency of the generated summaries.Additionally, we add criteria explaining each score, to reduce ambiguity and ensure consistent ratings (see Figure 5).Assessing an average response period of 30 seconds, we priced each response with 10 ¢.

G Results of All Variants with LED H as the Backbone Model
Table 7 shows performance results for all the variants explored in this work, with LED H as the backbone model.

H Performance Analysis Settings
To further examine the efficiency of our different components, we follow Slobodkin et al. (2022) and manually assess several of our models on two levels: (1) precision to the highlighted content and (2) recall of the highlighted spans.For that, we compare each system summary span to the source highlights.To that end, we randomly select 10 samples from our test set, with their corresponding system summaries (one for each of the models -50 in total).Then, following the notion of Summary Content Unit (SCU) in the Pyramid method for summarization evaluation (Nenkova and Passonneau, 2004), we extract such units from both the summary and the source highlighted spans using the Summary Evaluation Environment (SEE) interface, described in their paper.Then, to calculate the precision, for each summary unit, we manually search for a matched highlighted unit conveying the same information, to determine whether the summary unit is mentioned in  Table 7: ROUGE, METEOR and Bertscore results on the CTR testset, compared to the concatenated highlights, of all the different variants tested in this work, with LED H as the backbone model.In addition to the baseline LED H , we also evaluate the combination of LED H with the highlights-sensitive decoding strategy ("+ Con.Decoding"), with the highlights-focused RL strategy ("+ RL"), and with their combination.We also evaluate all those variants when trained with the GPT-4-distilled trainset ("Distilled").For each metric, The best non-distilled and distilled LED H models are in bold, with the best overall model having an asterix.
the highlights (TP) or not (FP), and then calculate the (micro-)precision to the highlighted content.
For the recall calculations, we also count the number of False Negative (FN) summary facts, compared to the facts in the highlights, namely highlighted information units that were absent from the system summaries.Then, combined with the TP count, we calculate the (micro-)recall of the highlighted content.

Figure 2 :
Figure 2: Demonstration of the Controlled Text Reduction task.The input consists of a source document and highlights (left), and the desirable output covers exclusively the highlighted content while preserving coherence (right).Borrowed and adapted from Slobodkin et al. (2022).

Figure 3 :
Figure 3: ROUGE, METEOR and BertScore results on 50 instances from the CTR development set of GPT-4 models, for varying number of in-context examples in the prompt.

Figure 4 :
Figure 4: The scores of fine-tuned Flan with controlled decoding compared to regular decoding at various beam sizes.We present a comparison of the generated summary to the highlights concatenation and to the gold summary.

Figure 5 :
Figure 5: Example of the data collection interface used by the crowd-workers to evaluate the fluency of summaries.

Figure 7 :
Figure 7: Example prompt provided to GPT-4.The prompt consists of basic instructions, two in-context examples, and the instance input.The examples demonstrate a modular pipeline, where we first extract the highlights, then consolidate them sentence-by-sentence, and lastly generate the final reduction.

Table 6 :
ROUGE, METEOR and BertScore results, compared to the concatenated highlights, on 50 instances from the CTR development set of GPT-4 models, once when the prompt does not contain highlights enumeration of the current instance ("w\o list h ") and once when it does contain it ("with list h ").