BLESS: Benchmarking Large Language Models on Sentence Simplification

We present BLESS , a comprehensive performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS). We examine how well off-the-shelf LLMs can solve this challenging task, assessing a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our analysis considers a suite of automatic metrics as well as a large-scale quantitative investigation into the types of common edit operations performed by the different models. Furthermore, we perform a manual qualitative analysis on a subset of model outputs to better gauge the quality of the generated simplifications. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines. Additionally, we find that certain LLMs demonstrate a greater range and diversity of edit operations. Our performance benchmark will be available as a resource for the development of future TS methods and evaluation metrics. 1


Introduction
Large pre-trained language models (LLMs) have demonstrated strong performance on a wide range of NLP tasks without the need for task-specific fine-tuning, leading to a prevailing conventional wisdom that LLMs can solve any task.This has motivated the development of benchmarks to better understand the abilities of LLMs in specific domains such as healthcare (Sallam, 2023), finance (Dowling and Lucey, 2023), education (Baidoo-Anu and Owusu Ansah, 2023), engineering (Soba- 1 We make our code and the generated system outputs available at https://github.com/ZurichNLP/BLESS. † These authors contributed equally.* Work done as a PhD student at the University of Manchester, United Kingdom.nia et al., 2023), andethics (Zhuo et al., 2023), as well as for specific NLP tasks (Li et al., 2022;Wang et al., 2023;Liu et al., 2023).
However, it remains unclear how well current LLMs can perform on the challenging task of text simplification (TS).In this paper, we focus on sentence simplification in English, which typically involves rephrasing part or all of a sentence into language which is more accessible and easier to understand.While recent work has focused on evaluating TS abilities of select models, such as GPT-3.5-Turbo (Feng et al., 2023) and mT5 (Ryan et al., 2023), there is currently no large-scale and detailed analysis of the simplification capabilities of different LLMs.
In this study, we expand both the breadth and depth of the knowledge base on TS with LLMs, evaluating a wider variety of models on three different TS datasets: ASSET (Alva-Manchego et al., 2020a), NEWSELA (Jiang et al., 2020) and MED-EASI (Basu et al., 2023).We select these datasets to cover a variety of domains (Wikipedia, news, and medical) and a diverse set of TS operations (e.g.paraphrasing, splitting, and elaboration).
Specifically, we use in-context learning (ICL) and assess LLMs in a few-shot setting, experimenting with three different prompts.We select 44 widely used generative models (both open and closed-weight) and evaluate their abilities from three distinct angles.First, we rely on automatic evaluation metrics commonly used in the TS literature.Second, we quantify and compare the edit operations performed by the LLMs during simplification.Finally, we perform a targeted qualitative analysis to validate our findings and to better understand the quality of the generated simplifications.Our findings reveal that closed-weight models provide significant gains over open-weight alternatives under a few-shot setting, establishing them as a strong baseline for future work on TS.We summarize our contributions as follows: 1. BLESS (Benchmarking Large language modEls on Sentence Simplification), a performance evaluation benchmark of 44 LLMs in a few-shot setting (Section 3). 2.An evaluation that includes both widely used automatic metrics and an analysis of the TS edit operations performed by the models (Section 4). 3. A qualitative analysis of the results, with manual annotation of simplification operations and an examination of the relationships between selected evaluation metrics (Section 5).

Related Work
Text Simplification Benchmarks Most simplification work treats the task as a monolingual machine translation problem, training models on datasets containing complex-simple sentence pairs (Zhu et al., 2010).Alva-Manchego et al.
(2020b) performed a standardized evaluation of general data-driven simplification systems, using Wikipedia-based datasets and NEWSELA.At the document level, Alva-Manchego et al. (2019b) conducted a systematic analysis of simplification operations to demonstrate the limitations and disruptions that occur when multiple sentences are involved.Benchmarks have also been established for more specific kinds of simplification: for example, both non-neural (Paetzold and Specia, 2016) and neural (Stajner et al., 2022;Saggion et al., 2022) approaches to lexical simplification, which aims to replace complex words with simpler alternatives.

LLM-based Simplification
LLMs such as GPT-3.5-Turbo, the model behind early versions of ChatGPT 2 , are often used out-of-the-box without any further training for a given domain or task.Some previous works have investigated simplification capabilities of select LLMs in order to benchmark performance against dedicated approaches (Aumiller and Gertz, 2022;Vásquez-Rodríguez et al., 2022;Ryan et al., 2023;Sun et al., 2023;Chi et al., 2023).Meanwhile, Feng et al. (2023) explored the TS abilities of the two strongperforming OpenAI models, GPT-3.5-Turbo and Davinci-003.However, despite these efforts, we only have results from a very limited number of LLMs and evaluation metrics.Thus, it remains un- clear how a wider spectrum of models, differing in architecture and training strategy, perform on different domains and in response to different prompts.
We aim to fill this gap and study the simplification abilities of 44 LLMs in order to highlight potential weaknesses and determine areas for further development.To the best of our knowledge, we are the first to focus on establishing the performance of recent LLMs on the task of TS.
3 BLESS: Benchmarking Large Language Models on Sentence Simplification

Datasets
Our assessment establishes the performance of current LLMs on TS according to three datasets, covering different domains and styles.Table 1 summarizes these datasets.ASSET (Alva-Manchego et al., 2020a) comprises 2,359 sentences from English Wikipedia paired with 10 simplified references.We use the official test split (359 sentences) for evaluation.These references were created by crowdworkers who were instructed to use edit operations such as replacement, splitting, and deletion.
MED-EASI (Basu et al., 2023) is a simplification dataset for short medical texts containing 1,979 complex (expert) -simple (layman) pairs.Each text contains one or more sentences.In this dataset, simplified texts are composed using four types of operations: elaboration, replacement, deletion, and insertion.We use the released test split (300 instances) for our evaluation.Unlike the other two datasets, simplifications in MED-EASI are slightly longer than the complex source texts, due to explanation and decomposition of complex medical terms.
NEWSELA (Xu et al., 2015) contains 1,130 longform news articles that have been professionally rewritten according to four different graded readability levels.For our benchmarking experiments, we opt for the Newsela-Manual test set (Jiang et al., 2020).We extract all aligned and partially aligned sentence pairs between a complex source sentence (level 0) and the four simplified article versions (levels 1-4), keeping only those sentences for which we have a reference for all four simplification levels. 3This results in 256 test examples.Using this small subset of NEWSELA data ensures that sentence-level alignments are of high quality and capture important edit operations such as splitting.

LLM Types
We investigate a total of 44 LLMs with different sizes, architectures, and training objectives.The models we consider range from 60 million to 176 billion parameters and are all based on the transformer architecture (Vaswani et al., 2017), consisting of either an encoder-decoder or a standalone decoder.Furthermore, all have undergone a selfsupervised pre-training stage.Nine of these models leverage instruction-tuning, which fine-tunes a pre-trained base model on labeled instructionresponse pairs from a diverse set of tasks.Finally, just three of these models have received additional training through reinforcement learning with human feedback (RLHF) to better align the model's responses with human preferences (Stiennon et al., 2020;Ouyang et al., 2022).Evaluating a wide variety of currently available models should serve as a broad baseline and give sufficient information on which models perform best in which domains as well as where key challenges remain.
We broadly distinguish between open-and closed-weight models.The former pertains to models for which the trained weights are accessible and thus allow for self-hosting.Typically, these models are considered to be "open-source."However, we note that this obfuscates specific licensing agreements attached to some models and whether or not the training data and code are also made available.In comparison, closed-weight models refer to those whose weights are kept private and can be queried only through APIs.Our open-weight models include variants of the T5 family (Raffel et al., 2020), GPT-style models (Radford et al., 2019;Wang and Komatsuzaki, 2021), OPT (Zhang et al., 2022c) and LLaMA models (Touvron et al., 2023), and the BLOOM family (Scao et al., 2022).For closed-weight models, we focus on those developed by OpenAI.Details on each model family are provided in Appendix A.

Prompts
To simplify sentences with LLMs without additional fine-tuning, we use in-context learning (ICL).ICL is a prompting technique that utilizes a small number of input-output examples to demonstrate a task (Brown et al., 2020).Previous work on related tasks has demonstrated that LLMs are sensitive to which input prompts and few-shot examples are used (Zhang et al., 2022b;Lu et al., 2022;Agrawal et al., 2023).To account for this, we construct three stylistically distinct prompts that consist of a task instruction and N few-shot examples (see Figure 1).For all generation settings, we set N =3 and randomly sample complex-source pairs from the corresponding validation sets.We leave a detailed investigation of optimal in-context learning strategies for TS to future work.

Inference Settings
For open-weight models, we run inference on local GPUs using the Transformers library (Wolf et al., 2020).We load the models with 8-bit quantization (Dettmers et al., 2022), which allows us to run inference efficiently on as few as 5 A100 80GB GPUs.For closed-weight models, we use the APIs provided by OpenAI.As generation hyperparameters, we use Nucleus Sampling (Holtzman et al., 2020) with a probability threshold of 0.9, a temperature of 1.0, and a maximum output length of 100 tokens.To account for the stochastic generation settings, we perform each inference run with 3 different random seeds and aggregate the results for each metric.

Baselines
We use the MUSS (Martin et al., 2022) model as our main baseline since it has been shown to achieve state-of-the-art performance.MUSS fine-tunes a BART-large (Lewis et al., 2020) model with ACCESS control tokens (Martin et al., 2020) extracted from labeled TS datasets and/or mined paraphrases to train both supervised (MUSS-wiki-mined) and unsupervised (MUSS-mined) TS systems.We use the suggested hyperparameters from the original paper to set the control tokens for simplification generation.
Rewrite the complex sentence with simple sentence(s).Keep the meaning same, but make it simpler.Rewrite the complex sentence with simple sentence(s).Keep the meaning same, but make it simpler.
The sentence '{complex example}' can be simplified as follows: '{simple example}' The sentence '{input}' can be simplified as follows: Prompt 1 uses the same basic task instruction as prompt 0, but presents few-shot examples in an inline, continuous text format.
Please rewrite the following complex sentence in order to make it easier to understand by non-native speakers of English.You can do so by replacing complex words with simpler synonyms (i.e.paraphrasing), deleting unimportant information (i.e.compression), and/or splitting a long complex sentence into several simpler ones.
The final simplified sentence needs to be grammatical, fluent, and retain the main ideas of its original counterpart without altering its meaning.

Automatic Metrics
To assess how well LLMs can perform TS, we evaluate all the model outputs using a suite of automatic metrics. 4We measure simplicity using SARI (Xu et al., 2016), meaning preservation using BERTScore (Zhang et al., 2020), and readability using FKGL (Kincaid et al., 1975).These metrics are computed using the EASSE package (Alva-Manchego et al., 2019a). 5Additionally, we report LENS (Maddela et al., 2023), a recently proposed learned metric, which considers both the semantic similarity and the degree of simplification performed by the system with respect to the source sentence and references. 6Where possible, we also establish the performance of 'gold' simplifications by evaluating available reference sentences using a 'leave-one-out' strategy.That is, in cases where multiple references are available, we select one at random and evaluate it against the remaining references.

Automatic Evaluation Results
In this section, we present the results of our automatic evaluation of simplification outputs and summarize our main findings.First, we perform an exhaustive assessment using automatic metrics (Section 3.6).For brevity, we report the results of the best-performing LLMs with SARI and BERTScore in Table 2 and provide the complete results for all 44 models and metrics in Appendix B. Then, we compute edit distance statistics to quantify the simplification operations performed by each of the LLMs (Section 4.1).We begin by assessing the impact of the different prompt formats.
Structured prompting improves performance.
Figure 2 reveals that prompts 0 and 2 both offer a slight advantage over prompt 1, especially in regard to meaning preservation.This confirms that providing a structured template for few-shot examples instead of embedding them within sentences is the most beneficial.Hence, we focus on prompt 2 for all our analysis, as it provides the most detailed description of the task and has also been used in prior work (Maddela et al., 2023).
Training method matters more than size.Table 2 presents the performance according to SARI  and BERTScore for the top-performing LLMs.Scaling LLMs has revealed strong benefits in few-shot settings (Brown et al., 2020;Chowdhery et al., 2022); however, in our evaluation, we observe numerous exceptions to this rule.For example, Flan-T5-large (770 million parameters) consistently attains higher SARI scores on ASSET than Flan-T5-xl (3 billion parameters) and Flan-T5-xxl (11 billion parameters). 7Meanwhile, we observe that training strategies such as 7 We include a wider comparison of selected LLMs on ASSET in Figure 7 in the Appendix.
instruction-tuning and RLHF help to deliver greater improvements, especially for meaning preservation, as measured by BERTScore.This agrees with previous findings that demonstrate the benefits of instruction-based adaption strategies for improved generalization abilities (Schick and Schütze, 2021;Zhang et al., 2022a;Chung et al., 2022).
ASSET On Wikipedia-style data, OpenAI's Davinci-003 and GPT-3.5-Turbooutperform all other tested LLMs by a considerable margin according to SARI.Strikingly, these models also outperform the ground truth references, which are closely approximated by the previous state-of-the-art MUSS models.This is notable since MUSS-wiki-mined was trained on the in-domain TS dataset of Wiki-Large (Zhang and Lapata, 2017).Meanwhile, for open-weight contenders, we can see in Table 2 that only a small number of models are competitive, namely OPT-IML-Max-30b, Flan-T5-large, and Flan-UL2, which scores the best balance between simplicity and meaning preservation according to automatic metrics.
MED-EASI For medical-related texts, we observe that the majority of the models consistently fail to preserve meaning (our qualitative analysis in Section 5 confirms this, see Table 3).The drop in meaning preservation can likely be explained by the fact that models are known to produce inadequate generations in out-of-domain settings (Müller et al., 2020;Singhal et al., 2023).The models that do strike a reasonable balance with both SARI and BERTScore are again OpenAI's more powerful offerings and the Flan models.Notably, we also observe that the two MUSS models are able to perform competitively with the Flan models despite being multiple orders of magnitude smaller.
NEWSELA Evaluating LLMs on professionally written simplifications from NEWSELA reveals that even the best LLMs are not able to match human performance.This is observable through the clear margins of around 20 SARI points and 14 BERTScore points between the best performers and the gold simplifications.On this dataset, MUSS-wiki-mined remains a strong baseline, outperforming all LLMs on both metrics, while Davinci-002, Flan-UL2, and Flan-T5-xxl show the strongest performances among the LLMs.

Analysis of Edit Operations
To identify the main token-level edit operations performed by LLMs, we use an adaptation of the Wagner-Fischer algorithm (Wagner and Fischer, 1974), following previous work by Vásquez-Rodríguez et al. (2021a).Specifically, we calculate the portion of insertion, replacement, deletion, and keep operations between the input source sentence and each of the system outputs for each dataset.
Figure 3 shows the distribution of token-level edit operations for the best-performing LLMs on ASSET (for a more comprehensive view across all datasets and models, see Figure 5 in the Appendix).Most models perform all four operations to differing degrees; however, similar to the gold references, the keep operation is by far the most prominent in this dataset.Notably, Davinci-003 and GPT-3.5-Turboperform the most diverse set of operations, with fewer additions and more replacements than other models.Insertions are typically less frequent, suggesting that the majority of the models avoid adding new and potentially irrelevant content.We observe that most LLMs are within the range of the gold references in terms of the amount of information they delete when simplifying.

Qualitative Analysis
Automatic metrics are known to have blind spots and are not always entirely reliable (Alva-Manchego et al., 2021;He et al., 2022).To compensate for this, we perform a qualitative analysis on a total of 300 system outputs.
First, we check whether or not each output is a valid simplification and highlight common failure cases such as inappropriate changes to the meaning of the original text, ungrammatical outputs, and the occurrence of hallucinations.Then, we annotate occurrences of common simplification edit operations such as lexical simplification, deletion, sentence splitting, reordering, and paraphrasing. 8or our annotations, we select model outputs from the top five systems ranked according to per- formance on the individual evaluation metrics of SARI, BERTScore, FKGL, and LENS.In each ranking set, we randomly select five complexsimple pairs from all generation settings.To evaluate a unique set of models for greater diversity, if a system is repeated in the ranking (e.g. two different prompt types from the same model appear in the top five), we choose the next best system for analysis.An example of our annotated outputs is shown in Table 9 in the Appendix.Table 3 shows results from this analysis, which we describe according to different criteria below.
By Automatic Metric Overall, we find that simplicity and meaning preservation are fairly balanced.However, there is a clear trade-off between these two axes when we consider the top 5 models according to SARI and BERTScore.This agrees with earlier findings from Schwarzer and Kauchak (2018).Along with a higher degree of simplicity, the top 5 SARI models exhibit more diverse edit operations than those ranked highly by BERTScore.LENS, however, does not trade off simplicity and meaning preservation and even achieves a higher simplicity score than SARI along with its increased level of deletion.This result is in line with the previous finding that LENS achieves stronger correlations with human judgments compared to existing TS metrics (Maddela et al., 2023).The top 5 models ranked by FKGL, on the other hand, produce outputs with low simplicity and meaning preservation and an especially high amount of hallucinations.This result supports the previous finding that FKGL can be easily gamed by degenerations (Tanprasert and Kauchak, 2021) and is therefore an unsuitable metric for evaluating the outputs of automatic TS systems.
By Open-Status Open-weight models most frequently use the operations of lexical simplification, paraphrasing, and deletion, while structural operations such as sentence splitting and reordering are often neglected.Many only achieve high meaning preservation by directly copying the input sentence.However, the closed-weight models investigated here behave very differently: they produce close to 10% more splitting, lexical simplification, and re-ordering than open-weight ones, while simultaneously performing fewer deletions.This leads to a greater degree of paraphrasing.
By Domain When comparing performance between different domains, we observe that all LLMs do significantly better on general encyclopedic texts in ASSET in terms of both simplicity and meaning preservation, while also exhibiting a diverse set of edit operations.Although outputs from NEWSELA contain more hallucinations, meaning preservation is still fairly high.Outputs from MED-EASI, on the other hand, have the lowest meaning preservation by far and the least diverse set of edit operations.We find that MED-EASI outputs, along with others that do not preserve meaning, often contain repetitions, hallucinations, and in some cases even copy the input prompt, demonstrating a tendency to disregard the instruction and thus fail to complete the task.These failure modes are most frequently observed from the smaller T5 models, but are also exhibited by models such as LLaMA when evaluated on MED-EASI.

Discussion
We discuss our results around the following aspects: the access level of the simplification models (openvs.closed-weight), the training strategies (general Figure 4: BERTScore, computed between the system output and reference sentence(s), correlates strongly with Levenshtein similarity, computed between the source sentence and system outputs.This indicates that BERTScore tends to reward minimally edited sentences.Levenshtein similarity is computed with the EASSE package (Alva-Manchego et al., 2019a).
pre-training vs. general fine-tuning strategies), and the utility of automatic metrics.
Access Level Among the OpenAI models, we observe that all models perform particularly well on meaning preservation according to BERTScore but exhibit considerable differences in their ability to simplify, as indicated by SARI on 'weaker' models such as Ada-001.Among the evaluated open-weight models, we observe that the Flan models (T5 and UL2) typically perform competitively, punching well above their weight in terms of parameter counts with much larger decoder-only models.This is a promising finding for the category of open-weight models, and we hope that this encourages future work to continue investigating different methods regardless of the model size.
Training Strategies Within model families, when comparing base models to their instruction fine-tuned counterparts, we observe that instruction-tuning typically leads to better performance in our few-shot ICL setting for TS.We find this to be particularly encouraging since TS is one task often hindered by the lack of high-quality labeled training data (Stajner, 2021).
Nevertheless, improvement is not always guaranteed, as seen when comparing BLOOM vs BLOOMZ.In this case, instruction fine-tuning leads to bettermeaning preservation but a reduction in the degree of simplification, indicating that the instructiontuning method used to derive the multilingual BLOOMZ may be less suitable for English TS.This stands in stark contrast to the Flan instruction tuning method, which delivers considerable gains in both SARI and BERTScore despite sharing the same underlying instruction-tuning dataset as BLOOMZ.Therefore, we hypothesize that this drop in performance may be influenced by the multilingual instruction tuning setup that is unique to BLOOMZ.
Utility of Automatic Metrics Overall, we find SARI and BERTScore to be useful automatic evaluation metrics for inspecting the trade-off between simplicity and meaning preservation (see Figure 6 in the Appendix).In general, closed-weight models often strike a more optimal balance.This is also supported by our qualitative analysis, which confirmed that these models rely less on deletion, an oft-overused operation (Devaraj et al., 2022), and more on other edits (e.g.paraphrasing or splitting).
Furthermore, our qualitative analysis shows that outputs with higher BERTScores tend to be minimally simplified, often copying the entire input text.We validate this by studying the relationship between BERTScore (computed between the system output and the reference sentence(s)) and Levenshtein similarity (computed between the system output and the original input sentence).Figure 4 reveals a strong positive correlation across all datasets, indicating that BERTScore tends to reward minimally simplified responses.For some of the closed-models, which tend to perform a greater degree of paraphrasing, this leads to lower BERTScores, while models that perform more copying are rewarded.Overall, the results from our qualitative study generally showed agreement with those from our automatic evaluation metrics, particularly SARI, BERTScore, and LENS.It also enabled us to pinpoint specific operations, such as re-ordering, and identify issues, notably hallucinations, in system outputs.

Conclusion
In this paper, we provided a comprehensive assessment of how well out-of-the-box LLMs perform on the task of TS with few-shot in-context learning.We found that the best LLMs outperform state-of-the-art supervised TS baselines while also producing a more diverse set of simplification operations.We also established that closed-weight models perform better than open-weight ones and that general instruction-tuning often improves a model's abilities on TS.Furthermore, we empirically validated the trade-off between simplicity and meaning preservation through automatic evaluation and a manual analysis.Our analyses of multiple few-shot prompting strategies revealed that a more structured prompting format produces better results than presenting source-target examples in continuous text.
Our performance benchmark, BLESS, provides a strong foundation for future work.For example, it remains an open question as to which expressions and instructions are optimal for prompting LLMs to simplify texts.Furthermore, this work exclusively focused on few-shot in-context learning.Future work could explore the capabilities of these systems in zero-shot, fine-tuned, or retrieval-based settings.

Limitations
In this section, we discuss a few limitations of our work.First, we only considered English TS datasets, and it still remains to be seen how these TS abilities transfer to languages other than English.Additionally, we selected only a handful of output samples for manual analysis for the three test datasets considered, and all annotations were performed by one of the authors and subsequently validated by another author independently.It will be necessary to perform this at a larger scale to more accurately characterize the capabilities of each model for each domain and prompt.We further acknowledge the limits of the evaluation set itself.While we purposefully chose the test splits to cover a variety of domains, test splits for all three corpora amount to 915 samples, which could potentially limit the statistical power of results obtained from the assessment.Additionally, two out of the three test sets contain only sentences as input, while the third contains short multi-sentence texts, so this assessment mostly applies to the subtask of sentence simplification.Finally, our findings confirm that proprietary, closed-source models can achieve a new state-of-the-art performance on the task of text simplification.However, very little is known about their training data, alignment strategies, and implementation behind paywalled APIs.Therefore, the comparison to open-source models, which contain no explicit training on the task and an extremely bare-bones implementation is potentially unfair.

Ethics Statement
This work is conducted in full awareness of and in line with the ACL Ethics Policy.Particularly, this work contributes to the transparency and fairness of evaluation methodologies in line with Sections 1.1, 1.2, 2.7, and 2.9 of the code, which innately leads to avoiding seen and unseen harms (Section 1.2, 1.4).We contribute to improving expertise in the domain of text simplification (Section 2.6).All models, datasets, and compute resources are used with permission and with concern to the appropriate access rights and licenses (Section 2.8).Our work contributes to the professional development of the research team (Section 3.5) and more widely benefits the research community and wider society (Section 3.1) by augmenting the understanding of the capacity of LLMs on the specific task of TS. tion of sentence simplification models with multiple rewriting transformations.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668-4679, Online.Association for Computational Linguistics.

A Model Details
In this section, we describe each type of LLM we use in our experiments.

A.1 Open-weight Models
As a brief disclaimer, we note that some listed models are not truly "open-weight" and may require special permission to obtain weights for self-hosting.Further, in our descriptions, we do not distinguish between different variations of the same model.We provide the details of the training data and model sizes in Table 4.We consider both encoder-decoder and decoder-only models for our evaluation as discussed below.
A.1.1 Encoder-Decoder Models T5 Family We evaluate a range of model variants derived from the original T5 models (Raffel et al., 2020).Originally, training recipes for T5 employ pre-training with a span-infilling objective and are thus not suitable for left-to-right generation tasks off the shelf.We thus use the T5-LM-adapted models from (Lester et al., 2021) which have undergone continued pre-training using a standard LM objective.
One later derivation includes the instructiontuned variant Flan-T5 (Chung et al., 2022), which continues training from the aforementioned T5-LM-adapted checkpoints and uses a wide variety of labeled data for instruction fine-tuning.Notably, the dataset description by Chung et al. (2022) does not include any reference to simplificationrelated tasks.Similar parallel efforts lead to the creation of the T0 models (Sanh et al., 2022).
Finally, UL2 (Tay et al., 2023) proposes a more diverse set of pre-training objectives beyond simple span corruption.Additional tasks include sequence distortion and extreme span corruption.

A.1.2 Decoder-only Models
GPT-J/GPT-X Early reproduction efforts of largescale GPT-style models started following the surge in popularity of GPT-2 (Radford et al., 2019).For our benchmark, we include models published by EleutherAI, namely the 6 billion parameter variant of GPT-J (Wang and Komatsuzaki, 2021) and the 20 billion parameter version of GPT-NeoX (Black et al., 2022).Both models were trained with a standard LM pre-training objective and were not fine-tuned to follow instructions.

A.2 Closed-Weight Models
As the current primary choice for commercial solutions, we benchmark a range of models by OpenAI.Previous publications regarding the GPT family (Radford et al., 2018(Radford et al., , 2019;;Brown et al., 2020) establish that these models (Ada/Babbage/Curie/Davinci) are decoder-only, with varying numbers of parameters.Here we report the total costs incurred for all three inference prompts and three seeded runs, totalling nine inference runs per dataset.Prices listed correspond to those for the API-based models available from April through June, 2023.All prices are in USD.

B Supplemental Results
Tables 6, 7, and 8 show full results for these on ASSET, MED-EASI, and NEWSELA respectively.

B.1 Details on Evaluation Metrics
A variety of automatic evaluation methods have been proposed.Commonly used automatic metrics like BLEU (Papineni et al., 2002) and SARI (Xu et al., 2016) can provide insights into how similar a model's outputs are to a set of gold reference simplifications.However, to more precisely understand a model's strengths and weaknesses, finergrained evaluation is often required.For example, calculating the distribution of edit simplification operations (e.g.additions and deletions) (Vásquez- -Reference They are rivaled only by chitin in toughness.
-GPT-3.5-TurboChitin is the only biological material that rivals them in toughness.

S↑ P+ L+ Davinci-003
Chitin is the only biological material tougher than them.S↑ P+ L+ Davinci-002 They are tough like chitin, which is the toughest known biological material.

MP↓ P+ L+
Flan-UL2 They are only second to chitin for biological materials.MP↓ P+ L+ Flan-T5-large Chitin is better than human materials in toughness.
MP↓ P+ L+ Table 9: Annotation examples from a SARI-based model ranking.S: Simplification, P: Paraphrasing, L: Lexical Simplification, and MP: meaning preservation.We highlight lexical simplification in bold and conflicts in meaning preservation in red.
Figure 5: Token-level edit operations computed for all models and test sets using prompt 2. For most models, the edit operations performed in ASSET and NEWSELA reflect those in the gold reference simplifications.However, on the MED-EASI dataset, we observe a sudden spike in insertions from all LLMs except for OpenAI and Flan models.These additions indicate the presence of potentially unrelated hallucinated tokens and endless generations, which aligns with the low BERTScore results.We regard this failure case to be related to the fact that MED-EASI presents a challenging domain which is out of the distribution of most general-purpose models.
Figure 6: Adequacy-simplicity trade-off as exhibited by a limited set of models on each of the three datasets.On ASSET, higher SARI is associated with lower BERTScore.In the case of MED-EASI, we can see that smaller models, which often tend to copy the input sentence, are rewarded by BERTScore but punished by SARI.Here, only the closed-weight OpenAI models exhibit a favorable balance between the two metrics.On NEWSELA, the relationship is more linear.We suspect that this is influenced by the fact that reference sentences are taken from multiple simplification levels (1-4) and therefore cover a broader range of possible rewrites, some with more simplifying edit operations (rewarded by SARI) and some with fewer (rewarded by BERTScore).
Figure 7: Visualizing LLM performance for select models, generated using prompt 2. This visualization corresponds to the results reported in Table 2 for ASSET.Models on the x-axis are ordered by model family, and within each model family, they are ordered by size (ascending).

Complex
Prompt 0 uses a basic instruction adapted from(Feng et al.,  2023)  followed by a list of N few-shot examples before the input sentence to be simplified.

Complex
Prompt 2 repurposes the instructions from(Alva-Manchego  et al., 2020a)  that were provided to crowdworkers in the creation of the ASSET dataset.Similarly to prompt 0, few-shot examples are presented in a structured format.

Figure 1 :
Figure 1: Prompts used for LLM text simplification.The blue boxes contain the task instructions.Orange boxes show how the few-shot examples are presented to the model and yellow boxes contain the prefix for the model to continue.

Figure 2 :
Figure 2: Impact of prompt selection on SARI and BERTScore for all models on ASSET.Prompts 0 and 2 achieve improved meaning preservation over prompt 1.

Figure 3 :
Figure 3: Distribution of token-level edit operations produced by the best-performing LLMs.
(Touvron et al., 2023)f large-scale decoder-only models conducted by researchers at Meta AI were released under the OPT label(Zhang et al., 2022c)and more recently under the LLaMA label(Touvron et al., 2023).

Table 5 :
Pricing information for OpenAI's API models.

Table 7 :
Simplification Results on MED-EASI