Generating Summaries with Controllable Readability Levels

Readability refers to how easily a reader can understand a written text. Several factors affect the readability level, such as the complexity of the text, its subject matter, and the reader's background knowledge. Generating summaries based on different readability levels is critical for enabling knowledge consumption by diverse audiences. However, current text generation approaches lack refined control, resulting in texts that are not customized to readers' proficiency levels. In this work, we bridge this gap and study techniques to generate summaries at specified readability levels. Unlike previous methods that focus on a specific readability level (e.g., lay summarization), we generate summaries with fine-grained control over their readability. We develop three text generation techniques for controlling readability: (1) instruction-based readability control, (2) reinforcement learning to minimize the gap between requested and observed readability and (3) a decoding approach that uses lookahead to estimate the readability of upcoming decoding steps. We show that our generation methods significantly improve readability control on news summarization (CNN/DM dataset), as measured by various readability metrics and human judgement, establishing strong baselines for controllable readability in summarization.


Introduction
Summaries convey salient pieces of information and their understanding depends on the reader's world and domain knowledge.The readability of a text plays a crucial role in how easily it can be understood and consumed for learning and education.Higher readability lowers reading efforts and increases the speed for any reader, and it is particularly beneficial for those who lack high comprehension (DuBay, 2004).On the other hand, Figure 1: Summaries generated with different readability levels using our lookahead method (Sec.3.3).We requested summaries with Flesch Reading Ease (FRE) readabilities of 90, 70, 30 (corresponding to the levels of 11-year-old, middle-school and college, respectively).The readability scores of the generated summaries (87.1, 70.3, and 30.9) are close to the requested targets.lower readability favors specificity, clarity and accuracy (August et al., 2023).Therefore, the readability of a summary is important to ensure that the information is comprehensible to a wider audience, accommodating varying levels of knowledge and understanding (Pitler and Nenkova, 2008).
Significant progress has been made in abstractive summarization using large language models (LLMs) (Raffel et al., 2020;Zhang et al., 2020a;Goyal et al., 2022a).This approach involves making various generation decisions, such as determining which content to paraphrase and how specific a summary should be.The goal is to generate high-quality summaries that are cohesive, readable, and factually consistent.Nevertheless, current methods provide limited mechanisms to specify stylistic preferences such as readability (Goyal et al., 2022b).While readability assessment, which measures the level of difficulty to comprehend a text, is a well-established field within NLP (Feng et al., 2010;Vajjala, 2022), the control of readability in natural language generation (NLG) tasks, such as summarization, has not been extensively explored and current readability control performance is low (Luo et al., 2022;Pu and Demberg, 2023).
While previous work (Goldsack et al., 2022;Guo et al., 2021;Luo et al., 2022) focused on binary readability control, such as expert versus lay summaries, we instead focus on generating summaries of various fine-grained reading grade levels (Todirascu et al., 2016;Martinc et al., 2021).Figure 1 presents an example of three summaries with diverse educational levels of readability generated with our lookahead method (Sec.3.3), given the same document.While the easier-to-understand summary, with the highest Flesch Reading Ease score (Kincaid et al., 1975), uses simpler words, shorter sentences, and less specialized knowledge, the summary with the lowest score requires the reader to understand more complex sentence structures and words (e.g., "saturated", "cholesterol", "cognitive") and contains more specific details (e.g., "University of Illinois researchers"), assuming the readers' familiarity with necessary domain and world knowledge.
We study three methods to control the readability of generated summaries: First, we present an instruction-prompting method that prompts the model to output summaries with particular target readability scores or categories, enabling fine-grained control.Next, we develop an approach based on reinforcement learning (RL), using Proximal Policy Optimization (PPO; Schulman et al. (2017)) with a novel Gaussian-based reward that strongly penalizes significant variations in readability, optimizing for summaries with the desired readability.Finally, inspired by Wan et al. (2023), we propose a readability-aware lookahead decoding method that selects tokens at each decoding step based on the readability of expected future token generations.
Our contributions in this paper are: (1) We propose three readability-controllable methods for text generation, using instruction-prompting, RL, and lookahead decoding, and (2) show that readability can be explicitly controlled for generating abstractive summaries with finely adjustable levels.Finally, (3) we explore the relation between summary readability and aspects such as specificity, abstractiveness, factuality and informativeness (e.g., more readable summaries tend to have lower specificity and informativeness).Our results show considerable improvements over GPT3.5, a state-of-the-art approach for controlling the readability level (Pu and Demberg, 2023).
2 Task Definition: Summaries with Distinct Readability Levels

Task Statement
Readability refers to the ease of understanding text, impacting the reader's comprehension and engagement.We aim to generate summaries with specified levels of readability, given an input document.Let x = ⟨x 1 , . . ., x n ⟩ denote the input document represented by the sequence of n tokens, and y = ⟨y 1 , . . ., y m ⟩ denote the summary token sequence of length m, where m ≪ n.Let r denote the desired summary readability level, which can be represented by a score (Sec.2.2) or a category name (e.g., "college level", Sec.3.1).The following formulation represents this task as an auto-regressive problem: This task presents challenges due to multiple factors: While the method must determine salient information from the input document x and compress it into a concise summary y, it must also be able to understand different readability levels and adapt the summarization output to match the target level r.The approach must strike a balance between providing a succinct, informative summary and ensuring it aligns with the reader's literacy skills and background knowledge (Collins-Thompson, 2014;August et al., 2023).

Readability Metrics
We now discuss metrics to assess text readability.Generally, readability is affected by lexical and syntactic sophistication, sentence structure, discourse cohesion, background knowledge and use of technical terms (McNamara et al., 2010;Crossley et al., 2023).Readability metrics typically focus on syntactic complexity, measuring the presence of qualified syntactic phrases, and vocabulary complexity, aligning words from the text correlated with a particular age-related level (Beers and Nagy, 2009;Crossley et al., 2017).We explore multiple established metrics to control and evaluate the readability of the summaries.Specifically, we employ Flesch Reading Ease (FRE, Kincaid et al., 1975), Gunning fog index (GFI, Gunning, 1952) and Coleman-Liau index (CLI, Coleman and Liau, 1975), among others. 2Those metrics calculate an approximation of the (US) grade level of education expected to understand a written text.FRE and GFI metrics are determined by sentence lengths, number of (complex) words, and syllables.Alternatively, unlike syllable-based readability indices, CLI does not require that the character content of words be analyzed, only their length in characters measures readability.Higher FRE scores denote higher readability; higher GFI and CLI scores denote lower readability.The metrics formulas are described in Appendix A. Finally, readability formulas may fail to consider significant factors such as cohesiveness and macro-level organization, which affect the overall readability and understanding of a text (Tanprasert and Kauchak, 2021).To account for this, we measure other aspects such as coherence and informativeness.

Instruction-Aligning Readability Methods
Inspired by previous works (He et al., 2022;Zhang and Song, 2022) that explore prompt guidance 2 Appendix A presents additional readability metrics.

FRE ≥80
Summarize this for a 11-year-old student: 60≤ FRE <80 Summarize this for a middle school student: 40≤ FRE <60 Summarize this for a high school student: FRE <40 Summarize this for a college student: to generate text with desired attributes, we develop instructions that encode the summary readability level.During training, the instructions are prepended to the source documents to create the input for their corresponding summaries.In contrast to recent studies that generate summaries with only two levels (expert and plain language) (Goldsack et al., 2022;Luo et al., 2022), we control the readability using fine-grained instructions based on the desired readability, as shown in Figure 2a.
Category-based Instructions.Drawing on established guidelines for text complexity levels (Fountas and Pinnell, 1999;DuBay, 2004), we define four instructions based on distinct reading level categories (see Table 1) aligned with particular FRE scores (Vajjala, 2022).For example, we instruct the model to summarize the input document "for a high-school student".We perform instruction-based fine-tuning, selecting the one instruction per training sample that matches the given reading level of the reference summary; that way, the model can learn to associate the observed reference summaries with the categories in the instruction prompts.At inference time, we can select any of the four instructions to request a summary in the style of the specified reading level category.We call this method CATEGORYINSTRUCT.
Score-based Instructions.We define a second instruction-based method, which we call SCORE-INSTRUCT.Here, we instruct the model to summarize with a particular score r, rather than a category.For example, we instruct the model to summarize the input document "with a readability level of 62".
In supervised fine-tuning, we use as r in the instruction the exact score of each reference summary; this way, the model can learn to associate the observed summaries with the exact reading levels specified in the instructions.At inference time, we can request any score r.Compared to CATEGORYINSTRUCT, this method adds flexibility by avoiding the hard boundaries presented by readability categories (Table 1); a training sample whose reference summary falls between two reading level categories (e.g., a FRE score of 60 is at the boundary between highschool and middle-school level) does not need to be forced into one or the other category.

Reinforcement Learning for Readability Control
During supervised instruction fine-tuning, using CATEGORYINSTRUCT or SCOREINSTRUCT, we request certain readability levels in the prompts (i.e., readability categories or scores), and the reference summaries act as demonstrations of summaries of that readability level for the model to learn.However, that supervised learning phase does not explicitly check if the model indeed tends to generate summaries in the requested reading levels.It merely uses token-wise gradient updates using teacher forcing based on the reference summaries.Ouyang et al. (2022) have shown that it can be helpful to supplement such an initial instruction fine-tuning approach with a subsequent reinforcement learning phase, in which the model is further updated in response to sequence-based rewards from a reward model.In contrast to Ouyang et al. (2022), who learned a reward model based on human preferences, we define a reward function based on a readability metric.Intuitively, we want to reward the model maximally whenever it generates a summary that has the requested readability level r and decrease the reward steeply for generated summaries whose readability deviates from r.
Reward.We design a reward R(r y , r) that assigns a maximum reward of 1.0 if the observed readability ry of a generated summary y is equal to the desired readability r and decreases exponentially as ry deviates from r.As illustrated in Fig- ure 2b, we formulate it as a normalized Gaussian centered at r: The use of a Gaussian function ensures that the reward decreases in a nonlinear fashion: Small readability deviations from the requested readability r result in small reward reductions; larger readability deviations result in disproportionally larger reward reductions.This is analogous to the nonlinear penalties for deviations from a Gaussian prior in L2 weight regularization.
PPO.We apply RL using the described reward, aiming for the model to learn improved readability control.At the same time, we wish to preserve the existing high salience and coherence achieved by the summarization models tuned with supervised fine-tuning.To accomplish this, we initialize the RL policy with SCOREINSTRUCT that has been trained on supervised data.We then employ the popular policy gradient method PPO (Schulman et al., 2017), which has been successfully applied to text generation problems (Liu et al., 2022a) and has been shown to be sample efficient and stable (Wu et al., 2021).To ensure stability, PPO employs a clipping mechanism during policy updates, preventing drastic changes and avoiding the problem of diverging policy updates.We call the resulting PPO-tuned model SCOREINSTRUCT+RL.

Lookahead Readability Decoding
As a direct consequence of RL training, the model explicitly learns to generate tokens with the goal of maximizing the readability reward.However, RL requires a initialization with a high-quality supervised model, which might not always be available.Consequently, we explore an decoding approach to dynamically adapt the readability level during inference, as shown in Figure 2c.
Previous work (Lu et al., 2022;Wan et al., 2023) develop lookahead approaches for controlling generation through signals such as logic-based lexical constraints or faithfulness.We extend this concept to enhance readability control in abstractive summarization at inference time.We develop a lookahead strategy that directs the model generation process, increasing the likelihood that the chosen tokens will lead to a path with the intended level of readability in the search space.Formally, each summary token y i is selected by: where h(•) is a function that assigns a readability score to the summary and w controls the weight of the readability in the generation process.L y is a set of possible future summaries that start with the tokens y 1:i−1 and contains n additional continuation tokens likely to be encountered in the future.
Our readability evaluation function h is defined as: where ry is the observed readability of the (possibly incomplete) summary y. h(•) can be defined with additional metrics which can be combined to desired scores (see Sec. 5.4).Finally, note that this method is computational costly since it needs to generate future summaries per generation step.

Experimental Setup
We evaluate on CNN/DM (Hermann et al., 2015), which contains news articles and their bullet-point highlights considered as summaries.In line with previous work (Guo et al., 2023) where summaries have similar or easier readability level than the input document, we assess the subset of the CNN/DM test set with documents with a FRE score below 50 (high school and college student levels).
The readability of the summaries are computed using FRE metric and used during training for constructing the instructions for CATEGORYINSTRUCT and SCOREINSTRUCT, and as the desired readability r in SCOREINSTRUCT+RL and SCOREINSTRUCT+LA.For CATEGORYINSTRUCT, we map the FRE score to the instruction for each reference summary as shown in Table 1.In SCOREINSTRUCT+RL, we randomly sample FRE scores which are used in the instruction to generate the summary and as r in the reward.We define σ by drawing inspiration from the readability levels (see Table 1), with each level covering a range of 20 FRE points.We set the standard deviation σ to half of that value (i.e., 10), so that more than two thirds of the reward values lie within the requested reading level centered at r.In SCORE-INSTRUCT+LA, the number of tokens considered in future n is set to 20.
Our models are initialized with Flan-T5-Large (Chung et al., 2022) and fine-tuned using a context size of 1024.We include as baselines GPT3.5 (TEXT-DAVINCI-003) and a Flan-T5 model without readability control.As an additional baseline, which we call Best-GPT3.5,we sample k summaries from GPT3.5 for each instruction and select the summary whose readability level is closest to the requested level. 3e evaluate the readability with the FRE, GFI and CLI metrics (Sec.2.2).We use BertScore (BS, Zhang et al., 2020b) and the F1 measure of ROUGE-L (RG-L, Lin, 2004) for evaluating the summary quality; and FactCC (Kryscinski et al., 2020) and UniEval (Zhong et al., 2022) for faithfulness assessment.

Main Readability Results
Table 2 shows the results on CNN/DM summaries, where we compute the absolute difference between the desired and generated FRE readability (∆) and Pearson correlations with readability metrics.In comparison to our methods, GPT3.5 summaries are considerably distant from the intended reading levels, while Best-GPT3.5greatly improves over GPT3.5.SCOREINSTRUCT has better performance than CATEGORYINSTRUCT, suggesting that fine-grained readability signals are beneficial during training.SCOREINSTRUCT+RL enhances readability control significantly, while SCOREINSTRUCT+LA further improves all readability metrics.
Table 3 presents the detailed results by reading level. 4To compare SCOREINSTRUCT's methods with GPT3.5 and CATEGORYINSTRUCT, we used prompts with the specific values for each readability level (30, 50, 70 and 90).Both instruction methods generates summaries with granular readability measured by the 3 readability metrics, while maintaining the summary quality close to the baseline.
For SCOREINSTRUCT+RL, the largest gains are for the 11-year-old level, improving FRE from 67.6 to   77.8.However, the summary quality is affected with decrease on RG-L.SCOREINSTRUCT+LA is very effective further decreasing ∆ over all reading levels.On the other hand, the high computational expense decoding (generating future summaries) is a disadvantage of this method. 5While the readability of GPT3.5-generated expert-style summaries is lower than easier-to-understand summaries, the readability variation between the different sum-5 Future work will consider to distillate into a model desired readability levels generated via lookahead (Wan et al., 2023).mary types is significantly less than our methods.Recently, Pu and Demberg (2023) found that the proficiency of GPT3.5 in adapting the readability of summaries is behind human-written texts.Conversely, while selecting a summary with the closest readability level from the sampled GPT3.5 summaries improves the performance, it is timeconsuming and costly.
Figure 3 visualizes the variance of our approaches for each readability level, according to FRE, while Figure 4a gives an overview of the relation between observed and requested readability for SCOREINSTRUCT's methods.Generally, while the instruction models can distinguish summaries at different readability degrees, SCOREINSTRUCT+RL is more effective in telling apart the defined levels, while SCOREINSTRUCT+LA is the approach with less variance and stronger fine-grained control.

Impact of Other Summary Dimensions
Specificity.Summaries can differ in the level of details they convey, which can range from being highly specific to more general.While easier sentences are less detailed and use simple vocabulary (Laban et al., 2021), casual language (Pitcher et al., 2022), and concise sentences (Scarton et al., 2018), specific sentences are more likely to contain specific words and details.Following Goyal et al. (2022b), we measure the degree of specificity of the generated summaries using Speciteller tool (Li and Nenkova, 2015). 6Interestingly, we observe in Figure 4b that easy-to-read texts are less specific, while summaries with higher readability are more specific, demonstrating that our methods enable summaries with different specificity degrees.
Abstractiveness.Abstractiveness quantifies the extent of rephrasing in generated text, indicating  the level to which the words are not directly taken from the input.We employ MINT (Dreyer et al., 2023) to measure the abstractiveness based on overlaps between the input document and the summary.As shown in Figure 4c, easier summaries are more abstractive than more complex summaries, and college-level summaries are slightly more abstractive.We hypothesize that this is because most of CNN/DM documents have readability levels in the range of high school, making more likely that the model copy parts from the source with similar level, producing more extractive summaries.
Summary Lengths.FRE is sensitive to text lengths (Tanprasert and Kauchak, 2021) and hence we check whether simpler summaries are just shorter while more difficult summaries are longer.Optimizing toward higher FRE scores leads to summaries that contain words that are shorter and easier to read and sentences that are shorter. 7As shown in Figure 5, we do not observe very short or long summaries overall.Finally, most methods generate summaries that are similar in length to the (training set) reference summaries, while the RL method detaches the model from such length constraints. 7Appendix F shows summaries with distinct levels.Case Study.Table 5 shows examples of summaries with different readability levels generated by SCOREINSTRUCT+LA.8Note that the observed FRE score decreases as the target FRE decreases.Summaries with lowest readability have more specific words (e.g., "defeated", "contest") and provide more details (e.g., "El Clasico", "Brazilian").Table 11 in the Appendix presents examples generated by the other methods.

Human Evaluation
We conduct human evaluations to determine how readable and informative the CNN/DM generated summaries are, using Amazon Mechanical Turk.Details on setup, annotation instructions and fair compensation are described in Appendix E. We ask to select the most readable and the least readable summaries among three displayed sum-FRE↑ GFI↓ CLI↓ Document: Team-mates Neymar and Dani Alves proved their dedication to Barcelona by supporting the club's basketball side.Neymar and Alves headed to watch El Clasico on Thursday night alongside the Brazilian's sister Rafaella.Barca prevailed with a narrow 85-80 victory in the Euro League contest.Brazil star Neymar takes a selfie with friends and Barcelona team-mate Dani Alves However Real Madrid remain top of their Euro League division over their bitter rivals, just by points difference ... Neymar's sister Rafaella headed to watch El Clasico of basketball with the Barcelona forward ... maries.Such a relative selection is easy to do, while determining an absolute ordinal readability level per summary would require expert training (Kiritchenko and Mohammad, 2017).We adopt a rating estimation in which game players' skills are estimated based on a series of multi-player games (Weng and Lin, 2011).9This is similar to the ELO score (Elo, 1978) used in chess, but it computes rating mean and standard deviation.Ratings start at 25.0 and are adapted based on wins and losses.We draw 1,000 three-summary sets, where the summaries in each set are generated from the same input article and are randomly drawn from the pool of 18 summaries (4 methods x 4 target readability levels, plus baseline and reference) per input article.Therefore, estimated ratings can be compared across different settings.For informativeness, we instruct annotators to mark the least informative and the most informative summary.Table 4 shows the human evaluation results.For all four methods, the readability tends to increase as higher target readability is requested, confirming the effectiveness of our methods, while informativeness is negatively correlated with readability.

Factual Consistency and Coherence
Factual consistency and coherence are crucial aspects in abstractive summarization.However, ensuring the factuality and coherence while controlling for readability is challenging.Table 6 shows that in most cases easy-to-read summaries, which are more abstractive (Figure 4c), are less factual and coherent.Previous work (Ladhak et al., 2022;Dreyer et al., 2023) show that factuality decays with increasing abstractiveness.Note that we explicitly used readability signals to control the models towards generating summaries with different readability degrees.However, a high-quality summary should also be faithful and contain relevant information, traits that may not be captured by readability metrics alone.Additionally, a signal based only on readability raises the risk of degenerate solutions, which might result on non-coherent summaries.We are interested in exploring whether factuality metrics in combination with readability affects this dynamics.As shown in Table 7, we adapt the RL and LA methods with a linear combination of BS-Fact (BERTScore precision of a summary with respect to the source document) and FRE readability scores.Note that when optimizing only for readability, FRE ∆ is lower but factual consistency and RG-L are the worst, and using BS-Fact tends to improve the metrics while reducing the readability control.Finally, Table 8 2023) conduct a inspection of the ability of a GPT3.5 model to adapt its output to different target audiences and writing styles (forvs.informal), while Imperial (2022) found that that GPT2 models struggle in preserving the linguistic complexity of the input prompts.Importantly, there has been significant development of models for Plain Language Summarization (PLS) from scientific papers (Devaraj et al., 2021;August et al., 2022;Goldsack et al., 2023;Guo et al., 2023).However, different from our proposed methods, such works do not consider fine-grained readability degrees.Concurrent to our work, Chi et al. (2023) employ weakly supervision and prompt methods to control readability complexity level of sentence paraphrasing.
Controllable Text Generation.Previous work explore different alternatives to tailor text for diverse target users (Cao et al., 2020;Kumar et al., 2022;Pu and Demberg, 2023), while research has also been conducted on style control in various generation tasks, including paraphrasing and story generation (Wang et al., 2017;Shen et al., 2017;Huang et al., 2019).In particular, for summarization, style control emphasizes factors such as abstractiveness (Goyal et al., 2022b;Dreyer et al., 2023) length, or content (Fan et al., 2018;He et al., 2022;Shen et al., 2022;Liu et al., 2022b).Contrary to these, Böhm et al. (2019) employ RL with rewards from human preferences for generating different summaries, while our reward function is based on readability signals.Goyal et al. (2022b)  Text Simplification.Text Simplification aims to improve the readability of sentences through reducing linguistic complexity while maintaining the original meaning (Alva-Manchego et al., 2021).Different aspects of the simplified output have been controlled, such as adapting to a specific level (Scarton and Specia, 2018;Nishihara et al., 2019) or incorporating edit operations (Alva-Manchego et al., 2017;Kumar et al., 2020;Mallinson et al., 2020) or lexical and syntactic constraints (Martin et al., 2020) into text simplifications.Maddela et al. (2021) implement linguistically motivated syntactic rules with data-driven neural models to enhance the diversity and controllability of the simplifications.In contrast to text simplification, which aims to control the degree to which a sentence is paraphrased, our approaches must provide succinct and informative summaries while maintaining different fine-grained levels of desired readability.

Conclusion
In this work, we propose three methods for finegrained control of the readability level of summaries.We showed that instruction-based methods can be used to guide LLMs to generate summaries with fine-grained readability degrees.We thus presented a RL approach that uses a Gaussian-based reward and a new decoding method that allows to control the readability during inference.We provided an extensive evaluation of our approaches and showed that they significantly improves the control of the summaries' readability.Future work includes adapt the methods to different NLG tasks and combine different metrics in order to capture distinct summary aspects that impact readability.

Limitations
In this paper, we propose different methods to current summarization approaches to enhance the controllability of readability levels.While this adjustment is not specific to any particular language, we conducted all of our experiments and analysis exclusively on English-language summarization datasets.Additionally, we focused solely on studying newswire summaries, given their widespread use in summarization research.Hence, this paper does not offer insights into the range of style variations found in non-English and non-newswire datasets, nor does it ascertain the generalizability of our findings to other datasets and domains.Second, although the Lookahead method demonstrates enhanced readability control, it requires a heavy computational overhead, especially when it is used with larger beams.To alleviate these expenses, one possible solution is to employ distillation to enhance decoding speed (Wan et al., 2023).Finally, the experimental cost of requesting API responses from OpenAI to assess ChatGPT's text generation abilities imposes significant constraints on the dataset selection.In this way, we limit our experimentation to only one summarization dataset.

Ethics Statement
While some of the investigated systems have demonstrated a high level of controllability on the CNNDM dataset, this does not imply their use as general controllable summarization models.To ensure reliability, these models should be thoroughly evaluated before being used in different settings.
In the conducted human experiment, for informativeness, workers took a median time of 79 seconds per HIT, and we paid $0.40 plus a bonus of $0.10, which amounts to $22.80 per hour.For readability, workers took a median time of 58 seconds per HIT, and we paid $0.20 plus a bonus of $0.05, amounting to a pay of $15.50 per hour.

Appendix A Readability Metrics
Flesch reading ease (FRE) (Kincaid et al., 1975) metric assigns higher scores to texts that are easier to read.It is calculated as follows: The Gunning fog index (GFI) (Gunning, 1952) quantifies the level of formal education required for a person to comprehend a given text upon initial reading, and it is computed using the following formula: where longWords are words longer than 7 characters.Higher values indicate lower readability.The Automated Readability Index (ARI) (Smith and Senter, 1967)  Coleman-Liau index (CLI) (Coleman and Liau, 1975) relies on characters instead of syllables per word: where L is the average number of letters and S is the average number of sentences.

B Hyper-parameter Settings
The experiments were executed using the version 3.3.1 of the transformers library released by Hugging Face (Wolf et al., 2019).The fine-tuning process is halted once the model reaches convergence in terms of ROUGE score on the validation set.In Table 9, we report the hyperparameters used to train the models on CNN/DM.We use the Adam optimizer (Kingma and Ba, 2015) and employ a linearly decreasing learning rate schedule without warm-up.
We select the the following prompts that gave the best results in terms of readability metrics: • {Document} \n Summarize the above article in 3 sentences for a sixth-grade student.
• {Document} \n Summarize the above article in 3 sentences for a middle-school student.
• {Document} \n Summarize the above article in 3 sentences for a high-school student.
• {Document} \n Summarize the above article in 3 sentences for a college student.

D Results with Additional Readability Metrics
Table 10 shows results using additional metrics, Dale-Chall readability and Automated Readability Index for readability assessment.CATEGORY-INSTRUCT and SCOREINSTRUCT methods perform similarly with SCOREINSTRUCT's improvements in the 11-year-old level.SCOREINSTRUCT+LA is able to better distinguish between readability levels compared to SCOREINSTRUCT+RL.

E Mechanical Turk Setup
We provide additional details on our Amazon Mechanical Turk setup.AMT annotators are nonexperts, so we use several mitigation strategies to obtain high-quality human judgements, including simplified task setups, clear annotation guidelines, task-specific qualification tests, and time checks to exclude potential spammers.We gave annotators fair compensation.We give detailed instructions to the annotators, see Figures 6 and 7. We add a number of tasks with known answers (i.e., cases where the most/least readable/informative summaries should be clear), enabling us to estimate in real time the accuracy of workers who work on multiple of these.Workers who complete the tasks too quickly or have low accuracy on the tasks with known answers are automatically removed from our worker pool; their answers are replaced with new answers.We also use a bonus incentive structure.Every worker who passes the automatic quality checks receives a bonus at the end.In addition, we use custom qualification tests.For any worker to be accepted as an annotator for our readability and informativeness evaluations, there are three hurdles: (1) We only consider workers from a country whose main language is English, who has completed 100 or more HITs so far with an acceptance rate of 95% or higher.(2) In addition, workers must have passed an initial custom qualification test for a related text classification task we have conducted in the past.(3) The workers who have passed (1) and (2) qualify to take the custom qualification tests for our readability task and our informativeness task.Only the workers who passed these final tests were accepted to work on the human readability or informativeness evaluations in this paper.
On our batches of 1,000 HITs, we allowed any worker to complete a maximum of 333 HITs, so that no worker can dominate the results.We use two annotators per HIT.
Even though we only display three summaries at a time and only receive an annotated relative ranking of these three -as opposed to an absolute ordinal readability level -we wish to estimate a readability score for each of the four methods (Sec.3) and their various target readability scores r, so that these setups can be compared based on human judgements.We interpret each set of three summaries as a three-player game in which the method that generated the most readable summary wins and the method that generated the least readable summary loses.22 workers worked on our readability evaluation, while 28 worked on our informativeness evaluation.

F Examples of Generated Summaries
Tables 11 and 12 present summaries of distinct levels generated by the different methods and their readability scores given by Flesch reading ease (FRE), Gunning fog index (GFI), and Coleman-Liau index (CLI) metrics.
Document: A group of U.S. senators has written to football's world governing body FIFA, calling for Russia to be removed as host of the 2018 World Cup because of its role in the Ukraine crisis and occupation of Crimea.In a letter dated Tuesday and released on Wednesday, the 13 Democratic and Republican U.S. lawmakers said they 'strongly encourage' FIFA to move the global competition.'Allowing Russia to host the World Cup inappropriately bolsters the prestige of the (Russian President Vladimir) Putin regime at a time when it should be condemned and provides economic relief at a time when much of the international community is imposing economic sanctions,' the senators wrote to

FleschFigure 2 :
Figure2: Overview of the proposed methods.(a) illustrates our approach to control the summary readability via fine-grained instructions.(b) shows our RL method where given an input document and the readability level, the policy generates a summary to be scored by our Gaussian-based reward, and (c) shows the our lookahead approach which uses a readability score of a future summary to guide the generation.

Figure 4 :
Figure 4: (a) Relation between observed ry versus requested r.(b) Specificity and (c) Abstractiveness of the CNN/DM summaries based on four readability levels.

Figure 5 :
Figure 5: Summary sentence lengths (left) and summary lengths (right) generated by the proposed approaches for different readability levels, compared to reference summaries of the corresponding levels.

Table 1 :
Category-based instructions based on readability scores (FRE).

Table 2 :
Readability results for the proposed methods using different automatic metrics.∆ is the absolute difference between the desired and generated FRE readability.ρ is the correlation with the target metric.All correlations are statistically significant (p<0.05).

Table 4 :
Human evaluation of readability (µ r ) and informativeness (µ i ) of CNN/DM summaries.σ r and σ i are the corresponding standard deviations.

Table 5 :
Requested FRE 90: Real Madrid and Barcelona played basketball on Thursday night.Barca won the game 85-80, but Real are top of the Euro League by points.Neymar and his sister Rafaella went to watch the game with friends.Requested FRE 50: Barcelona defeated Real Madrid 85-80 in El Clasico on Thursday night.Neymar and his Barcelona team-mates went to watch basketball with his sister Rafaella.Real remain top of the Euro League table over Barcelona by just points.Examples of summaries of different readability levels generated using SCOREINSTRUCT+LA.

Table 9 :
is an alternative readability formula that provides values correlating to the number of years of education needed to comprehend a given text: Hyperparameter settings for Flan-T5 methods.

Table 11 :
Examples of generated summaries for different readability levels.

Table 12 :
FIFA President Sepp Blatter.FIFA president Sepp Blatter announces Russia as the host nation for the 2018 World Cup .Republican John McCain is one of 13 US senators urging FIFA to think again .The lead signers were Senator Robert Menendez, the top Democrat on the Senate Foreign Relations Committee, and Senator Ron Johnson, the Republican chairman of the foreign relations panel's Europe subcommittee.Republican Senator John McCain, chairman of the Senate Armed Services Committee, also signed the letter, as did Senator Richard Durbin, the number two Democrat in the Senate.England are planning a bid to host Eur 2028 after last holding the tournament at home in 1996 .Meanwhile, England are interested in bidding for Euro 2028 but are very unlikely to bid for the 2026 World Cup, FA chairman Greg Dyke has revealed.FIFA announced last week that the decision on the 2026 tournament will be made in 2017 -when Blatter is still likely to be FIFA president.The FA has decided not to bid for FIFA tournaments while Blatter is still in power following England's failed bid for 2018, but is eyeing a bid for the European Championship in 2028.CATEGORYINSTRUCT Requested level: 11-years-old.13 US senators have written to FIFA asking for Russia to be removed as host of the 2018 World Cup.They say the decision on the 2026 World Cup will be made in 2017.England are planning a bid to host Euro 2028 but are unlikely to bid for the 2026 World Cup.Requested level: middle school.13 Democratic and Republican US senators have written to FIFA.They are calling for Russia to be removed as host of the 2018 World Cup.The letter was signed by Republican Senator John McCain.England are interested in bidding for Euro 2028 but are unlikely to bid for the 2026 World Cup.Requested level: high school.13 Democratic and Republican US senators have written to FIFA.They 'strongly encourage' FIFA to move the global competition.The senators say Russia's role in the Ukraine crisis and occupation of Crimea should be condemned.Requested level: college.13 Democratic and Republican U.S. lawmakers 'strongly encourage' FIFA to move the global competition.The senators said allowing Russia to host the World Cup 'inappropriately bolsters the prestige of the (Russian President Vladimir) Putin regime' 21.7 13.7 13.8 SCOREINSTRUCT Requested FRE 90. 13 US senators call for Russia to be removed as host of the 2018 World Cup.They say the decision should be made in 2017 when FIFA president Sepp Blatter is still likely to be in power.England are planning a bid to host Euro 2028 but are unlikely to bid for the 2026 World Cup.Requested FRE 70. 13 Democratic and Republican U.S. senators write to FIFA.They say Russia's hosting the World Cup 'bolsters the prestige of the Putin regime' England are planning a bid to host Euro 2028 but unlikely to bid for 2026.Requested FRE 50.13 Democratic and Republican U.S. senators write to FIFA.They 'strongly encourage' FIFA to move the global competition.England are planning a bid to host Euro 2028 after last holding the tournament at home in 1996.Requested FRE 30.13 Democratic and Republican U.S. lawmakers urge FIFA to move the global competition.'Allowing Russia to host the World Cup inappropriately bolsters the prestige of the (Russian President Vladimir) Putin regime,' they wrote.30.3 13.3 10.3 SCOREINSTRUCT+RL Requested FRE 90. 13 senators have written to FIFA calling for Russia to be removed as host of the 2018 World Cup.The 13 senators say the tournament should be moved because of its role in the Ukraine crisis.England are planning a bid to host Euro 2028 but are unlikely to bid for the 2026 World Cup.Requested FRE 70. 13 senators have written to FIFA calling for Russia to be removed as host of the 2018 World Cup.The 13 Democratic and Republican lawmakers say they 'strongly encourage' FIFA to move the tournament.Russia were announced as the host nation for the tournament in Russia.England are planning a bid to host Euro 2028 but are unlikely to bid for 2026.Requested FRE 50.13 Democratic and Republican senators have written to FIFA president Sepp Blatter calling for Russia to be removed as host of the 2018 World Cup.The senators say Russia's role in the Ukraine crisis and occupation of Crimea should be condemned.England are planning a bid to host Euro 2028 but are unlikely to bid for 2026 World Cup.Requested FRE 30.13 Democratic and Republican senators have written to FIFA president Sepp Blatter calling for Russia to be removed as host of the 2018 World Cup.The senators say Russia's role in the Ukraine crisis and occupation of Crimea should be condemned.England are unlikely to bid for the 2026 World Cup after failing to qualify for 2018 tournament.SCOREINSTRUCT+LA Requested FRE 90. 13 U.S. senators call for Russia to be removed as hosts of the 2018 World Cup.They say the decision should be made in 2017.England are planning a bid to host Euro 2028 but are unlikely to bid for the 2026 World Cup.Requested FRE 70.US senators call on FIFA to move the 2018 World Cup from Russia.Russia was announced as the host nation for the 2018 tournament.13 Democratic and Republican U.S. lawmakers signed the letter.Requested FRE 50.FIFA president Sepp Blatter announced Russia as host nation for 2018 World Cup. 13 Democratic and Republican U.S. senators have written to FIFA president urging him to move the event to 2018.The lawmakers said Russia's role in the Ukraine crisis and occupation of Crimea should be condemned.Requested FRE 30.13 Democratic and Republican U.S. lawmakers urge football's world governing body FIFA to move the global competition.The senators said Russia hosting the 2018 World Cup 'inappropriately bolsters the prestige of the (Russian President Vladimir) Putin regime' England are interested in Euro 2028 but are very unlikely to bid for the 2026 World Cup, FA chairman Greg Dyke has revealed.Examples of generated summaries for different readability levels measured using FRE, GFI and CLI metrics.
11685 Figure 6: Screenshot of the human readability annotation.