Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

Large language models (e.g., GPT-4) are uniquely capable of producing highly rated text simplification, yet current human evaluation methods fail to provide a clear understanding of systems' specific strengths and weaknesses. To address this limitation, we introduce SALSA, an edit-based human annotation framework that enables holistic and fine-grained text simplification evaluation. We develop twenty one linguistically grounded edit types, covering the full spectrum of success and failure across dimensions of conceptual, syntactic and lexical simplicity. Using SALSA, we collect 19K edit annotations on 840 simplifications, revealing discrepancies in the distribution of simplification strategies performed by fine-tuned models, prompted LLMs and humans, and find GPT-3.5 performs more quality edits than humans, but still exhibits frequent errors. Using our fine-grained annotations, we develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously. Additionally, we introduce word-level quality estimation for simplification and report promising baseline results. Our data, new metric, and annotation toolkit are available at https://salsa-eval.com.


Introduction
Text simplification aims to improve a text's readability or content accessibility while preserving its fundamental meaning (Stajner, 2021;Chandrasekar et al., 1996).Traditional human evaluation for text simplification often relies on individual, shallow sentence-level ratings (Sulem et al., 2018b;Alva-Manchego et al., 2021), easily affected by the annotator's preference or bias.Maddela et al. (2022) recently proposes a more reliable and consistent human evaluation method by ranking and rating multiple simplifications altogether.However, as text simplification involves performing a series of transformations, or edits, such as paraphrasing, removing irrelevant detail or splitting a long sentence into multiple shorter ideas (Xu et al., 2012), sentence-level scoring remains difficult to interpret since it is not reflective of fine-grained information about the types of edits being performed.Fine-grained human evaluation through span selection has been explored for machine translation (Lommel et al., 2014) and open-ended text generation (Dou et al., 2022).Yet, these evaluation methods are error-driven -i.e., focusing solely on evaluating failure -which punishes creative and diverse generations with minor errors in favor of generic ones.Additionally, machine translation and open-ended generation tasks usually retain none of the input words, while text simplification must balance the editing and preservation of words in the original input (Xu et al., 2016).We thus evaluate simplification quality as the aggregation of edit successes and failures (see Figure 1).
We introduce SALSA -Success and FAiluredriven Linguistic Simplification Annotation -an edit-level human evaluation framework capturing arXiv:2305.14458v1[cs.CL] 23 May 2023 a broad range of simplification transformations.SALSA is built on a comprehensive typology ( §3) encompassing 21 quality (e.g., elaboration, generalization, paraphrasing), error (e.g., hallucination, coreference deletion), and trivial (e.g., add articles such as "the") edit types.To enable annotation with SALSA, we develop a easy-to-use interface and tutorial.Using SALSA, we collect 13K edit annotations from 700 simplifications written by five state-of-the-art language models and two humans.With these annotations, we conduct a large-scale analysis of model and automatic metric performance, and further introduce word-level quality estimation for simplification.
Our main findings are as follows: • Few-shot GPT-3.5 simplification far surpasses other models, particularly in sentence-level syntax and content editing.However, its simplifications are not tuned to the types of operations performed by human simplification.( §5) • Some fine-tuned models such as the MUSS (Martin et al., 2022) produce more diverse edits than GPT-3.5, yet suffer from incredibly high errors, while others (T5, Raffel et al., 2020) learn to minimize loss by making very few changes.( §5) • Compared to lexical and syntax edits, edits modifying sentence content, such as generalization and elaboration, are difficult for current automatic metrics to evaluate.( §6) • Fine-tuned on SALSA annotations, our referencefree metric, LENS-SALSA, capture the subtleties of different simplification approaches more accurately than existing metrics.( §6) • Leveraging our data, we present the word-level quality estimation task for text simplification and establish initial baselines for future modeling efforts.( §7) Our results demonstrate SALSA provides an interpretable and exhaustive evaluation of text simplification.We release our interactive annotation interface, annotator training material, and data at http://salsa-eval.com to facilitate further development of text generation models, automatic metrics, and edit-based tasks.

Related Work
Model Evaluation.Simplification work broadly agrees some typology of simplification operations exists (Siddharthan, 2014), starting with early rulebased systems which explicitly defined specific syntax operations (Dras, 1999).Past work has experimented with designing models to control the extent of each operation by using a pipeline to perform each operation independently (Maddela et al., 2021;Raffel et al., 2020), predicting edit operations (Dong et al., 2019) or augmenting finetuned models with learned control tokens (Martin et al., 2022(Martin et al., , 2020)).However, evaluation only considers a sentence in its entirety rather than rating individual operations, either using automatic metrics, shown to be an inadequate representation of quality (Alva-Manchego et al., 2021;Sulem et al., 2018a), or surface-level Likert ratings, typically asking crowd-sourced annotators to rate on scales of fluency, adequacy and simplicity.These scores are difficult to interpret as independent dimensions of quality and capture no detail into the type of simplification being written (Briakou et al., 2021;Hashimoto et al., 2019).Additionally, despite current systems' often producing simplification errors (Choshen and Abend, 2018), annotating error has primarily been performed through inspection, and has not been incorporated into human or automatic evaluation (Gooding, 2022).Linguistic Inspection.Manual inspection attempts to understand the behavior of simplification models or datasets, characterized by detailed typologies and often conducted by authors or domain experts.Cardon et al. (2022) performs detailed inspection of the ASSET simplification test corpus and use their data to study the behavior of automatic metrics.Stodden and Kallmeyer (2022) and Jiang et al. (2020) propose an interactive linguistic inspection interfaces for sentence alignment and corpus annotation.However, these interfaces are not designed for human evaluation of model outputs and do not provide edit-level ratings for measuring performance.
Fine-grained Human Evaluation.Human evaluation performed on a span-level has been previously proposed for a variety of NLP tasks.In translation, the Multidimensional Quality Metrics (MQM) (Lommel et al., 2014), categorizes error into accuracy and fluency sub-types and is later extended by Freitag et al. (2021) to weight errors by severity and combine into a single quality score.Dou et al. (2022) proposes SCARECROW to capture errors appearing in open-ended text generation.However, as these span-based evaluation schemes exclusively annotate error, they encourage generic generations and punish interesting or diverse out-puts.For summarization, the FRANK typology (Pagnoni et al., 2021) aggregates errors into broader categories to benchmark metrics that measure factuality.Inspired by FRANK, Devaraj et al. (2022) introduces a framework to evaluate factuality for text simplification.To our knowledge, error-driven evaluation in text simplification has not yet been proposed beyond the context of factuality.

The SALSA Framework
We introduce SALSA, an edit-based human evaluation framework for text simplification defined by a typology of 21 linguistically-grounded edit types with the aim of capturing both successes and failures (i.e., both quality changes and errors, see Fig. 1) through evaluating each edit.Our annotation pipeline consists of: edit selection ( §3.1), categorizing the edits' impact on sentence information ( §3.2), classifying the fine-grained edit type ( §3.3) and rating by efficacy or severity ( §3.4).We implement our SALSA framework through an interactive annotation interface (Fig. 2).The SALSA typology is organized by a decision tree, as illustrated by Figure 3, where annotators only answer 3-4 intuitive questions about each edit.

Edit Selection
We formulate edit selection as a sequence tagging problem similar to phrase alignment (Yao et al., 2013;Lan et al., 2021), but different in that (1) spans are labeled by the primitive edit operation being performed: either single-operation insertion, deletion, substitution, word-/clause-reorder; or multi-operation sentence split and structure changes, (2) a single span may belong to multiple edit operations to account for overlapping edits.An insertion or deletion edit exclusively modifies content, while a substitution either modifies or paraphrases content.A reorder, split or structure edit exclusively performs a content-agnostic syntax transformation.As split and structure edits are multi-operation (i.e., require a combination of primitive operations to perform), they are defined by a set of underlying single-operation constituent edits.For example, this change from passive to active voice via a structure change written by zeroshot GPT-3.5 involves an insertion, substitution, reorder and four deletion edits:

Categorizing by Information Change
Each selected edit is then labeled with its impact on the underlying sentence information: less, same, more or different information.Given the type of operation and change to information, we subsequently organize each edit into three linguistic families as defined by Siddharthan (2014): Lexical edits perform simple changes in "wording".This includes paraphrasing (i.e., substitution that keeps the same information) and inconsequential trivial changes (e.g., inserting 'the').Syntax edits capture transformations to the distribution of information, rather than substance.A split converts a candidate sentence to two sentences, a re-order edit re-arranges clauses or wording within a clause, and a structural edit modifies the voice, tense or clausal structure.Examples of structural edit sub-types are in Appendix B. Conceptual edits modify underlying ideas con- veyed by the text.A successful conceptual edit requires elaboration to add clarifying information absent from the input or generalization to delete unnecessary/complicated ideas.Therefore, a substitution, insertion or deletion may alter the content.

Edit Type Classification
After being categorized into lexical, syntax or conceptual edit families, we further classify each edit operation into 21 fine-grained success (quality), failure (error) or trivial edit types (see Fig. 3).Each specific edit type may only be introduced by certain operations (e.g., a deletion cannot introduce a hallucination error).Successful edits only have one 'type' of success but a failed edit may introduce multiple error types.For example, a successful information insertion will always be elaboration, but an unsuccessful information insertion may be one of four errors.We also separately identify edits containing a grammar error, as sentence grammar is independent of its semantics (Chomsky, 1957).Appendix A enumerates each SALSA edit type.

Rating Edit Efficacy / Severity
As each quality and error edit has a varying degree of impact on the overall simplification quality, we define three levels to measure the efficacy of quality edits and severity of error edits: 1 -minor, 2somewhat, and 3 -major.Overall simplification score.Similar to MQM evaluation in machine translation (Lommel et al., 2014), we collapse edit annotations into a single simplification score to allow for direct system comparison.We calculate sentence-level simplification score score(S) as a weighted sum of edit ratings: where S is the simplification of complex sentence C, E is the set of edits, e C and e S are the parts of edit e performed on C and S respectively, w(e) is the edit weight, r(e) is the edit rating (severity / efficacy), and len denotes character length. 1or weight scheme w(e), we fit a linear regression model by considering the sentence-level human ratings gathered in SIMPEVAL 2022 (Maddela et al., 2022) as a gold standard, as reported in Table 1.The absolute values of the quality weights are generally higher than the error weights, as simplifications tend to make far more quality edits than error edits in all three linguistic families (see Figures 4 and 5 in §5).However, the weight for syntactic simplification errors is far greater than others (-5), as these errors often completely disrupt the sentence.As the type of simplification depends on the needs of each particular user group (Stajner, 2021), weights could be adjusted according to the simplification domain (Cemri et al., 2022;Basu et al., 2023) or use case (Trienes et al., 2022).

Human Annotation
We describe our use of SALSA to collect 13,180 edit ratings across 2,100 human annotations on 700 simplifications written by 5 state-of-the-art models and two humans.

Simplification Data
We collect annotations on SIMPEVAL 2022 (Maddela et al., 2022), a challenging simplification benchmark with 360 simplifications written by four state-of-the-art models and two humans on 60 manually selected complex Wikipedia sentences originally written between Oct 2022 and Nov 2022.We further expand the dataset with 40 additional sentences from Wikipedia written in Dec 2022 and adding simplifications from a fine-tuned T5-11B.As these sentences are selected to be more complex than previous simplification benchmarks, it allows systems to demonstrate their full capabilities in performing different simplification operations.Our SIMPEVAL 2022 inputs contain significantly longer sentences (µ = 37.87, σ = 12.73) than the previous ASSET benchmark (µ = 19.72,σ = 7.95).
Simplification Systems.We aim for a broad coverage of model approaches: MUSS (Martin et al., 2022), a BART-large model conditioned on explicit parameter tokens from Martin et al. (2020), fine-tuned on Wiki-Large (Zhang and Lapata, 2017) and mined paraphrase data.MUSS is the SOTA model before GPT-3.5.T5 (Raffel et al., 2020), an encoder-decoder transformer pre-trained on 745 GB of web text.We use T5-3B and T5-11B variants and fine-tune on the aligned Wiki-Auto dataset (Jiang et al., 2020), shown to be higher quality than Wiki-Large.GPT-3.5, a series of GPT-3 models pre-trained on text and code dated before Q4 2021.We use the best available text-davinci-003 model, based on InstructGPT (Ouyang et al., 2022), fine-tuned with human demonstrations and reinforcement learning with human feedback.We include both zeroand few-shot (5-shot) generation, using the same prompt setup as SIMPEVAL 2022 .Humans.We ask two in-house annotators to write simplifications for the 40 newly selected sentences, replicating instructions used in SIMPEVAL 2022 .

Collecting Human Annotations
As crowd-sourced annotators have shown to have inconsistent quality (Shmueli et al., 2021), we hire 6 undergraduate students from a US university.All annotators were native English speakers and paid $15 / hour.Annotators were trained with an indepth tutorial consisting of broad explanations of simplification concepts, over 100 examples covering each of the 21 SALSA edit types and interactive exercises.After finishing the tutorial, annotators completed two rounds of onboarding annotations and were provided feedback by the authors.To concretely measure agreement for each stage of the SALSA framework, we collect annotations in three stages: (1) we have three annotators select edits, (2) a fourth annotator adjudicates the edits into a single selection and (3) the original three annotators classify and rate the adjudicated edits.During each stage, the authors closely monitor each set of annotations to ensure quality and continually provide feedback to annotators.The final result is 2100 annotations, with the average time for a single annotation taking 4.23 minutes.Figure 2 illustrates our annotation interface, with further screenshots of our tutorial included in Appendix G.

Inter-Annotator Agreement
We calculate edit selection agreement by each token, as a single token may be annotated to multiple edits simultaneously, with Table 2 reporting agreement per-edit, further organized by their type of information change.Agreement is highly dependent on the edit type, as we observe high agreement for deletion (α=0.75),paraphrase (substitution with Figure 5: Failed edits on SIMPEVAL per-model, organized by edit type.Compared to humans, both GPT-3.5 setups make more syntax and lexical errors.Although humans perform bad deletion errors at a higher frequency than GPT-3.5, this is reflective of the inherent ambiguity in judging the relevancy of the deleted content.T5-3B, T5-11B and MUSS (w.r.t.syntax edits) make fewer errors than GPT-3.5 simply because they perform less overall edits.
the same information, α=0.53), and sentence split (α=0.66)edits.We also find low agreement for substitution with more information (α=0.15),due to the subjectivity among annotators on determining whether new tokens contain 'novel' information, as it was often confused with insertion.Disagreement on reordering (α=0.12) and structure (α=0.25) may be attributed to their low frequencies and the ambiguity of overlapping syntactic and content edits, as the highly-compressed SIMPEVAL outputs often make substantial edits whose annotations have multiple correct interpretations.We also report % two and three annotators agree, which we find are similar to fine-grained evaluation frameworks in other text generation tasks (Dou et al., 2022).We include further agreement analysis and examples in Appendix D.

Simplification Systems Edit Analysis
We aggregate our SALSA annotations to explore patterns in fine-tuned, LLM-and human-written simplifications.Figures 4 and 5 summarize the frequency of quality and error edit types.As edits vary in size, we calculate also edit coverage as the length of each edit in proportion to the entire simplification using the same method as §3.4. Figure 6 reports edit coverage for different edit efficacy and severity ratings.We average the annotations of both human simplifications for our analysis.The following are our main findings: All systems make far more quality than error edits, but these errors are sparse (Fig. 4 and 5).
We observe only 16% of these models' edits were errors, but these errors were distributed across all simplifications.73% of simplifications generated by MUSS have at least one error, compared to 62% / 56% by zero-/ few-shot GPT-3.5.Human simplifications have the lowest error rate of 48%.
Human mainly produces bad deletion errors, which are often subjective (Fig. 5).After excluding bad deletion errors, humans' error rate drops from 48% to 25%, compared to few-shot GPT-3.5 only decreasing from 56% to 43%.The only anomaly in errors is bad deletion, which may attribute to the subjectivity in judging deletion:

EXAMPLE
Human written Unlike the first film adaptation, in which director Samuel Fuller removed... Unlike the first film adaptation, Samuel Fuller removed...We see that some annotators mark this as a bad deletion while others consider it appropriate as they be- Edit coverage is defined as (len(e C ) + len(e S ))/(len(C) + len(S)) (see §3.4).The final all column ignores frequency to compare ratio of quality and error among models.High quality simplification is tuned to human performance, rather than maximizing the number of edits.We also report frequencies of only edit ratings in Figure 14 in Appendix.lieve this information is not entirely relevant since the sentence is communicating whether a film is a meaningful adaptation of a book.
Fine-tuned T5-3B and T5-11B generate conservative simplifications (Fig. 4, 5, and 6).Compared to all other systems, both T5 models make minimal changes, while still exhibiting high rates of error.This is likely due to their training data, Wiki-Auto, containing shorter sentences, usually requiring simpler simplification techniques, making it difficult for models to generalize on longer and more complex sentences.This underscores the need for explicit simplification design in fine-tuned models, such as the control tokens (Martin et al., 2020) used by MUSS.
GPT-3.5 writes quality edits at a higher frequency than humans, but human edits are longer and more effective (Fig. 4 and 6).Both zero-shot and few-shot GPT-3.5 make a larger number of content (elaboration and generalization) edits, but humans make longer edits and a higher percentage of high-efficacy edits.Human simplification typically inserts or deletes entire clauses, while GPT-3.5 edits single modifiers or words, which have less impact on sentence quality or simplicity.
Models elaborate, while humans generalize (Fig. 4).When simplifying content, all models (excluding T5) tend to elaborate at a higher ratio than humans, for example, GPT-3.5 attempts to insert content 17% more often.As LLMs have shown to encode world knowledge in their parameters (Petroni et al., 2019;Brown et al., 2020), GPT-3.5 elaboration is far more effective than MUSS, for example: Split edits are straightforward, Structure edits are far more complex (Fig. 4 and 5).Surprisingly, sentence split is shown to be the easiest edit for all models to accomplish, with a similar number made by MUSS, GPT-3.5, and human, with even the conservative T5 models making a comparable number of split edits.However, the more complex structure and re-ordering edits are rarely seen in fine-tuned models, we speculate this may be attributed to (i) SIMPEVAL's sentences are more compressed than models' training data and (ii) GPT-3.5 has a unique ability to perform complicated syntax rewriting, also reflective of findings in abstractive summarization (Goyal et al., 2022).Despite GPT-3.5'simprovement, the structure error rate demonstrates it has not yet reached human-level ability.Additionally, we observe zero-shot GPT-3.5 produces structure errors (see below example) at a 19% rate above few-shot.

EXAMPLE
Zero-shot GPT-3.5The sentence included a fine of $400...You will receive a fine of $400...We also find human simplifications are more conservative with re-ordering than models, but its attempts to simplify using re-ordering often appear arbitrary:

EXAMPLE
Human written On 3 November 2022, the British Secretary... On November 3rd, 2022, the British Secretary... Paraphrasing is a crucial, but tricky mechanism (Fig. 4 and 5).MUSS, GPT-3.5, and human all paraphrase in at least 75% of sentences.Despite low performance in conceptual and syntactic simplification, MUSS paraphrases at a human-like rate likely due to its training on over one million paraphrase sentence pairs mined from web crawl data.Although zero-/few-shot GPT-3.5 paraphrase at a higher rate than humans, they often are not necessary as shown here:

EXAMPLE
Few-shot GPT-3.5The club said on social media that customers subdued the gunman...The club reported on social media that customers were able...We include further discussion and analysis of edit-level evaluation with SALSA in Appendix E.

Automatic Metric Evaluation
While automatic metrics are traditionally evaluated based on correlation with human ratings on high-level aspects such as semantic similarity and simplicity, their ability to capture the subtleties of lexical, syntactic, and conceptual simplification is not well understood.Using our comprehensive annotations collected, we study how well current automatic metrics capture these distinct simplification approaches.Additionally, we introduce LENS-SALSA, a reference-free metric fine-tuned on SALSA annotations.
LENS-SALSA.The automatic simplification metrics mentioned above require human-written references, which may not exist in practice and are costly to collect.To this end, we introduce LENS-SALSA, a reference-free simplification metric enabled by the edit-level information provided by SALSA annotations.Inspired by the COMETKIWI machine translation metric design (Rei et al., 2022), LENS-SALSA is first pre-trained on the SIMPE-  Results.Table 3 reports the Pearson correlation of each metric with the human sub-scores across each SALSA dimension.We calculate the human sub-score for each dimension of simplification as discussed in §3.4.We find our LENS-SALSA is uniquely sensitive to SALSA edit-level ratings, despite not being trained to predict the SALSA sentence-level score.In fact, fine-tuning on wordlevel quality scores substantially improved performance (+0.07 correlation on all edits compared to no fine-tuning).Only LENS and LENS-SALSA obtain substantial correlation with human SALSA scores (0.27 and 0.34 respectively), with other metrics demonstrating spurious and even negative correlation with human judgements.Although trained on span-based MQM ratings, COMET-MQM fails to capture monolingual simplification quality, demonstrating the need for simplification-specific quality estimation.Despite their strong performance, we find LENS-based automatic metrics mainly evaluate lexical and syntactic simplification edits, rather than conceptual edits, which may be attributed to the SIMPEVAL training data consisting of shorter, paraphrase-based simplifications.Lastly, all metrics have a higher correlation with quality than error edits.We posit this is primarily due to the sparsity of errors exhibited in the generations generated by the current high-performing systems.
Word-level Quality Estimation (QE), defined as predicting the quality of each token in the output, carries substantial value in evaluating and refining text simplification.Despite its utility as well explored in machine translation (Basu et al., 2018;Zerva et al., 2022), word-level QE has not been studied for text simplification due to a lack of appropriately annotated data.In this section, we leverage our SALSA annotations to demonstrate baseline approaches and show significant potential for future work.As the presence of deletion edits is exclusive to the complex sentence, the task setup is classifying each token in both the complex and simplified sentences as quality, error, or ok.
Data.We label each word by the average efficacy/severity rating of its associated edit: < 0 as error, = 0 as ok, and > 0 as quality.Words that are not part of any edits default to the ok label.
For edit types such as reorder and substitution that span both sentences, we only label the words in simplified sentences, leaving the words in original sentences with ok labels.Given that split and structure edits are formulated of composite edits, such as deletion and substitution, we deconstruct them into their composite edits before proceeding into the labeling process.For tokens that appear in multiple edits, we use the lowest rating to assign its label.After the entire process, 6.8K, 1.8K, and 27K words are labeled as quality, error, or ok respectively for training.
Models.We fine-tune Transformer-based models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) to perform quality estimation as a sequence tagging task.During inference, we take the hidden states of the first token for each word as the input to the classification head.We also include LENS-SALSA, as one of its training dual-objective is to predict word-level quality.For additional implementation details, please refer to Appendix F.
Results.Table 4 shows the F1 scores separated by edit type.RoBERTa models perform better than BERT, even with a smaller size.The overall baseline performance mirrors current models in machine translation (Yang et al., 2022).Interestingly, while word-level QE offers substantial benefits for sentence-level metrics as shown in §6, the dual objective of sentence-level does not reciprocate similar benefits back, which is potentially due to the pre-training data only focuses on sentencelevel QE.Given the imbalance in label distribution, we posit data augmentation could improve performance in detecting error tokens.

Conclusion
In this work, we introduce SALSA, a novel human evaluation framework that incorporates, edit-based labeling, error and quality evaluation, and dimensions of lexical, syntax and conceptual simplification.We demonstrate SALSA benefits in granularity, accuracy, and consistency.We use SALSA to collect a 13K edit annotation dataset on simplifications written by modern models as well as humans, and analyze the strengths and limitations of GPT-3.5, fine-tuned models, and human simplifications.Finally, we use SALSA annotations to develop the first reference-less automatic metric for text simplification and demonstrate promising baselines for word-level quality estimation, showing productive avenues for future development of fine-grained human evaluation, automatic metric development and simplification error identification.

Limitations
While we demonstrate promising results on sentence-level evaluation, simplification is often a document-level task (Laban et al., 2021;Sun et al., 2021).Incorporating higher-level operations such as sentence fusion, paragraph compression, and reordering would require an extension to SALSA and presents unique analytical challenges.Additionally, detailed human evaluation inherently requires greater resources to produce a high granularity of annotations.While we show this process can be streamlined with a robust annotator training, SALSA requires a similar amount of resources as widely used fine-grained evaluation in other tasks such as MQM (Lommel et al., 2014) or FRANK (Pagnoni et al., 2021).

Ethics Statement
Our annotations were performed using the SIMPE-VAL 2022 corpus, originally collected from publicly available Wikipedia articles (Maddela et al., 2022) and we further extend the dataset with complex sentences collecting using the same methodology from publicly available Wikipedia articles.As discussed in §4.2, we perform data collection with in-house annotators from a US university.Annotators were paid $15-$18/hour.We took care to manually review all data prior to annotation as to exclude any triggering or sensitive material from our annotation data.Annotators were informed that any data they felt uncomfortable with was not required to annotate.Our interface was built using the open-source Vue.js 2 library, and training of our added T5-11B system was implemented using the open-source Hugging Face Transformers 3 library.

A Defining the SALSA Framework
We provide detail into the SALSA framework, including qualitative examples which helped guide design decisions when building the typology.Table 5 illustrates each final edit type, as organized by Figure 3.During development, we adjusted our scheme based on preliminary annotations with the final goal of SALSA's ability to evenly represent all modes of simplification and the full space of errors.

A.1 Quality Evaluation
We organize quality edits by their approach to simplification, as real-world application and models' capability to simplify falls into tiers of conceptual, syntactic and lexical simplification (Stajner, 2021).An ideal simplification system demonstrates a balance of these 'tiers' and incorporates different techniques depending on the original text, context and users (Gooding and Tragut, 2022).Automatic simplification research initially focused on lexical paraphrasing (Siddharthan, 2014), but has since evolved to emphasize the importance of syntactic and conceptual editing (Alva-Manchego et al., 2020b).

A.1.1 Conceptual Simplification
These edits modify the underlying sentence information or ideas, a prerequisite for simplifying complex domains.
Elaboration.An addition of meaningful, relevant and correct information (Siddharthan, 2006), such as clarifying vague terminology, providing background information on an entity or subject, or explicating general world knowledge unknown to the audience.Elaboration has been shown as a rare, but helpful mechanism in text generation (Cao et al., 2022) and we observe its careful use in human simplifications.
Generalization.A deletion of unnecessary, irrelevant or complicated concepts.Although we ask annotators to rate the quality of elaboration by how it improves the readability of a sentence, we ask annotators to rate the quality of a generalization by the relevancy of the deleted information to the main idea of the sentence.As 'relevancy' is inherently subjective to the user, domain and annotator, determining the threshold for 'necessary information' is crucial to standardize (Devaraj et al., 2022).Deleting information will, by nature, contain some amount of information and SALSA instead focuses on ensuring the deleted information is not impor-tant sentence, context or users.Consider two candidate deletions: EXAMPLE Like so many hyped books before it, The Midnight Library excited me and gave me pause.Like so many hyped books before it, The Midnight Library excited me and gave me pause.
Although the deletion of Midnight is shorter, it changed the subject of the sentence, and it is rated higher than the second deletion, which is not central to the main idea.Generalization using paraphrase is more often preferred than deleting full clauses.
We observe successful conceptual edits are often performed on the clause level.For example, adjunct removal via deletion: EXAMPLE Born into slavery in 1856, Booker T. Washington became an influential African American leader.Booker T. Washington became an influential African American leader.
Or information insertion through an appositive or relative clause, although the prior is typically more common for the SIMPEVAL domain as it implies objective information: EXAMPLE Éric Gauthier is also a novella author... Éric Gauthier, famous for his soloist dancing career, is also a novella author...

A.1.2 Syntactic Simplification
Syntax is a crucial mechanism for fluent, highly modified simplification (Štajner, 2016).Given recent attention in automatic simplification to syntaxaware datasets and systems (Cumbicus-Pineda et al., 2021;Kumar et al., 2020;Alva-Manchego et al., 2020a;Scarton et al., 2017), SALSA standardizes the first explicit evaluation accounting for these operations.Information Reorder.We classify two levels of reorder, word-level reorder, which reorganizes modifiers within a phrase, and component-level reorder which moves clauses or content across a sentence (Siddharthan, 2006).A component-level re-order typically may be accompanied by a broader structure change or both re-order types may overlap, as in:

EXAMPLE
The emergence of huge radio conglomerates is a direct consequence of the '96 Act.The '96 Act had a direct consequence of the emergence of huge radio conglomerates.
When faced with two equivalent phrases (e.g.'A and B' → 'B and A'), SALSA classifies the re-  ordered span as the phrase more significant to the main idea of the sentence.In practice, we found this to be a helpful guideline, although annotators often simply selected the phrase appearing first in the candidate sentence.
Structural Change.As this syntax modification necessarily includes some discourse preserving edits (Gooding, 2022), they are defined w.r.t.some combination of constituent edits (i.e.insertion, deletion, substitution, reorder).Further discussion of structure changes in §B, with examples of structural change sub-types used for manual inspection in Table 6.
Sentence Split.A sub-type of a structural edit.We automatically identify split changes prior to annotation, but annotators must first select constituent spans and then associate those spans with the corresponding sentence split.We find the importance of this edit is highly domain-dependent (Figure 15).

A.1.3 Lexical Simplification
Paraphrase.Swapping complex spans with equivalent, simpler alternatives, is the most primitive, yet important, approach to simplification (Qiang et al., 2020) (also referred to as a hypernym, e.g.Štajner, 2016).These are exclusively defined by substitutions marked as same information and positive impact.
Trivial Change.Captures any minor modifications to wording, either through a synonym replacement, or inconsequential change in wording (e.g.the, a).Trivial changes are identified as trivial insertion or trivial deletion.These edits differ from a content or syntax modification in that they adds no new or major modification to the presentation of information.However, Meister et al. (2020) exemplifies trivial changes should not be ignored as they may modify the information density and verbosity of a sentence.An example is famously shown by (Jaeger and Levy, 2006): How big is the family you cook for?How big is the family that you cook for?
The relativizer 'that' creates no syntactic or conceptual simplicity, but adds clarity as to the identify of the subject.Trivial changes have previously been described with finer granularity, including subcategories like abbreviation, filler words, compound segmentation, anaphora (Stodden and Kallmeyer, 2022) or even changes in number/date formatting (Cardon et al., 2022) but we exclude these groups due to their sparsity and our focus on evaluating performance.

A.2 Error Evaluation
We describe the SALSA error typology, with examples of each type in Table 5.Although despite their sparsity, errors have a far greater impact on fluency and adequacy than individual quality edits (Chen et al., 2022).We refined our definition of errors by focusing on minimizing the amount of error types while retaining the ability to capture the full possibility of simplification ablations.Notably, we specifically exclude a hallucination due to its ambiguous definition in related work (Ji et al., 2023), and instead define our error categories to capture any possible hallucination.Identifying errors in text generation is parallel to work in Grammar Error Correction (GEC), particularly balancing editing and preservation of input words (Napoles et al., 2015).However we refrain from adopting techniques in GEC for two reasons: (1) To our knowledge, detailed human evaluation has not been proposed for this GEC.Current finegrained evaluation typologies for GEC, such as M2 (Dahlmeier and Ng, 2012) or ERRANT (Bryant et al., 2019), are rule-based frameworks, and are too detailed and rigid to be applied in a human evaluation setup.(2) GEC is an error-driven task (i.e.solely focusing on minimizing grammar errors), while text simplification requires creative, quality edits (see §1).As a result, work in GEC typically relies on a set of objective rules defining "correct" grammar, while task of simplification is inherently more subjective.While we use GEC to help define our grammar error category, we instead focus on developing SALSA based on frameworks in summarization and machine translation designed for fine-grained human evaluation.

A.2.1 Conceptual Errors
We identify six types of errors in content, with errors primarily being related to information insertion.Bad deletion.As the overwhelmingly most common error, a bad deletion removes necessary and relevant content to the main idea of the sentence.As discussed in §A.1.1,the threshold for 'relevancy' is ambiguous.Coreference.More precisely a failure in coreference or anaphora resolution (Maddela et al., 2021), this determines whether an explicit entity reference is removed.This error is only observed on a deletion of information.The first thorough analysis of this rural society is made by her book.passive voice → active voice Elevation is not primarily considered by the system.
The system does not primarily consider elevation.

Part-of-Speech Change
Modifies words' derivation or inflection nominalisation (verb → noun) The ability to capture nature scenes has been improving...The ability to capture nature scenes has seen improvement... denominalisation (adjective → verb) The protesters turned violent when...The violent protesters...

Tense Change
Modifies verb modality or tense past perfect → past simple The governor told reporters he had overseen a productive conversation.
The governor oversaw a productive conversation.
present → past We compute the Pearson correlation to asses annotation quality.
We computed the Pearson correlation when we assessed annotation quality.Dextrose adds flavor, texture and nutrition to dishes, although its consumption is known for negative consequences.
Factual Error.We asked annotators to use their commonsense knowledge and limited research to evaluate factuality in edits.Unlike contradiction, these claims introduce information which must be externally verified beyond the sentence context.Although factual content is an established focus for summarization evaluation (Pagnoni et al., 2021;Maynez et al., 2020), adequately retaining information (i.e.minimizing bad deletion) is a far greater concern for simplification (Devaraj et al., 2022).In the context of work studying hallucination in LLMs, our contradiction and factual error categories can be interpreted as intrinsic and extrinsic hallucination respectively (Ji et al., 2023).Irrelevant.A sub-type of a hallucination failing to insert information related to the main idea of the sentence, recognizing the threshold for 'relevancy' is ambiguous ( §A.1.1).For simplicity, we report irrelevancy alongside hallucination, as information insertion is generally a rare technique.

A.2.2 Syntactic Errors
Because syntactic edits are identified by the impact of information distribution, they do not need a fine-grained error typology like conceptual edits, which make a diverse set of modifications.We simply observe each type as a failed attempted at their respective transformations.Bad Reorder.Uses the same word-/phrase-level specification as quality reorder.We also observe that phrase-level reorder errors are almost exclusively observed to introduce a discontinuity to the syntax tree structure (Paetzold and Specia, 2013).Bad Structure.We manually inspect structural errors according to the same sub-type specification as quality edits ( §B).Bad Sentence Split.Although sentence splitting is rarely rated as unhelpful, simplifications may unnecessarily segment ideas, or interrupt the flow of information.

A.2.3 Lexical Errors
Unrelated to information change, lexical errors evaluate primitive issues in fluency or wording.Complex Wording.An attempted paraphrase where the the exact meaning is retained, but the replacement uses more complex semantics (also referred to as a hyponym, e.g.Stodden and Kallmeyer, 2022).

EXAMPLE
The researchers conducted an investigation.The researchers conducted an assay.
Information Rewrite.Some substituted span whose content concerns the same subject, but fails    to substitute the wording correctly, either through misrepresenting or falsely interpreting the information.Although similar to a combination of information deletion and information insertion, the edit is still attempting to represent the same content.Grammar Error.The edit violates grammatical convention.Past error analysis combines fluency and grammar into the same error type (Maddela et al., 2021) as the two are interrelated.Grammar errors are unique as they can co-occur with other errors, or occur alongside a high quality edit, as sentence fluency is independent from adequacy (Siddharthan, 2014).

B Structural Edit Examples
Examples of each structural edit sub-type are listed in Table 6.We find training annotators to label structure change sub-type improved their ability to identify structure changes.Other work (Barancikova and Bojar, 2020), specifically Stodden and Kallmeyer (2022) annotate with a larger array of structural changes, notably including separate directions as distinct categories (e.g.singular → plural and plural → singular) and including change in sentiment and personal/impersonal form.We exclude these types as they almost never occur in the entirety of the ASSET corpus (Cardon et al., 2022).However, a case study in Italian simplification (Brunato et al., 2022) shows this structural edit distribution may vary when adapted to the needs of other languages.Similarly, German simplification often converts genitive to dative noun cases, a feature not seen in English simplification (Stodden and Kallmeyer, 2022).

C Edit-Level Annotation Example
To add clarity to our score aggregation, we show an example edit-aligned simplification in Figure 7, with categorizations and ratings for each edit shown in Table 7. Qualitatively, the sentence performed some lexical simplification, but primarily performed syntax changes to separate ideas within the sentence and removed some technical details.
Ideally human evaluation should capture each different simplification operation performed.We report our composite scores alongside other The vertical and horizontal axes represents the class of majority agreement and minority decision between annotators respectively.The left includes all edits, while the right calculates agreement using the underlying constituent spans selected for structure and split edits.
human evaluation scores as reported by Maddela et al. (2022) in Table 8.As domains of raw scores vary among evaluation schemes, we report the percentile of the score within the distribution of all SIMPEVAL scores among that system.Observing direct assessment (DA) scores, we find the traditional fluency / adequecy / simplicity components reach an upper-bound for systems which are easily capable of high-quality writing (e.g.LLMs).
As SIMPEVAL was collected by comparing highperforming systems, our score distribution is closer to its human ratings.Additionally, with our composite scores we can get a better understanding that this simplification performs well at transforming content and syntax, but does not simplify the wording to the extent of other GPT or human simplifications.We may interpret this as a function of the domain itself (e.g. the sentence contains in-domain terminology and already has a simple, basic lexicon) or the failure to identify words to simplify (e.g. the human simplification paraphrases spacecraft).
By instead judging components of simplification rather than components of quality, we report a focused, detailed understanding of a model's ability to simplify.

D Interpreting Annotator Agreement
As the SIMPEVAL challenge dataset contains more edits than past simplification corpora, edit annotation becomes significantly more challenging as multiple groups of edits often overlap and simplifications contain more compression and sentencelevel transformations.Additionally, error-prone systems like MUSS make it challenging to disambiguate error and quality edits.Figure 8 illus-trates an example of this disagreement and Figure 9 demonstrates a conceptual representation of the edit highlighting, showing many of the same tokens are annotated, but with different edit spans.For example, observe the last clause in the sentence, which performs a rewrite: EXAMPLE that the fort stood out for its defenders' heroic resistance.and the defenders of the fort gave their lives to save the city.
We see three different, but valid understandings of this phrase: 1. Information was replaced -The information about the defenders' resistance is inherently different then the defenders' giving their lives to save the city and is therefore an add/deletion pair.2. Information was retained, but paraphrased -The phrase heroic resistance being equivalent in meaning to gave their lives.3. Subject was modified and information was replaced -The subject swap between the subject of the clause being the fort to being the defenders.The rest being an add/deletion pair.
Varying interpretations of the same edit leads to natural disagreement.However, often a clear annotation exists and is not captured.For example, although we instructed annotators to create separate edits for overlapping syntax and conceptual edits, this occurred inconsistently in practice: EXAMPLE it was during the siege of the city of Elvas Don Luis de Haro attacked the city of Elvas 1. Identified the edit as a structural change, because the noun siege was replaced with a verb, modifying the voice of the sentence 2. Identified a paraphrase, annotating siege as a more complex word than attacked 3. Correctly identified both edits occurred simultaneously We find the largest source of disagreement comes from overlapping edits of multiple types, most often between structural changes and other types, because they often co-occur.Figure 10 demonstrates structural edits explain a significant portion of disagreement.Additionally, because structural edits are a composite edit, the same spans are captured by the structural edits' constituent spans and re-calculating agreement using these spans, disagreement instead focuses on whether tokens are substituted.Within individual sentences, we often observe multiple valid interpretations for span labeling, highlighting the inherit ambiguity in the task.Despite this, annotators still successfully communicated edit performance.All three annotators identified both the bad deletion and hallucination errors contained in the sentence.For the full SIMPEVAL dataset, we report error identification agreement in Table 9, finding syntax errors (e.g., bad structure, bad reorder) are far more difficult to identify than content or lexical errors.Particularly, complex wording and grammar errors exhibit both high frequency and high agreement, as the definitions of these errors are unambiguous.Broadly, we find that high span-level agreement is not necessary for capturing overall, or even fine-grained sentence-level performance, a clear trade-off exists between the granularity of annotations and expected agreement.

E Detailed Analysis
Here, we report additional findings on the SIM-PEVAL dataset and model performance, alongside  observations about edit-level evaluation as a task.
Table 11 reports basic span statistics.We find paraphrases are typically annotated as pairs of a few words, while conceptual edits typically occur on the clause level and are annotated together.Surprisingly, structure changes often occurred as a few words: EXAMPLE MUSS ... Corbin has expanded his business to include agritourism, using his farm to host weddings ... ... Corbin's business also offers agritourism and he uses his farm to host weddings ... The edit converts the beginning subordinate clause to a coordinate clause, yet only requires substituting a single word.Errors exhibited a significantly higher variance in size, which may be attributed to their sparsity, as no error except bad deletion occurs in more than 20% of outputs (Table 9).However, errors sizes display the same trend as their quality counterparts, with conceptual errors typically being seen on the clause level.We also found single-word conceptual errors such as:

EXAMPLE
Zero-shot GPT-3.5 ... Arroyo released a statement that acted as an informal concession of sorts ... ... Arroyo released a statement that was like a formal concession.

EXAMPLE
Few-shot GPT-3.5The sentence included a fine of $400...They imposed a fine of $400... Were less frequent than hallucinating entirely new phrasing or ideas.This may be promising for error detection as it implies error spans are often clausal and occur among many adjacent tokens.Quality and Error Are Interrelated.Figure 12 displays sentence-level scores for our error typology across systems on SIMPEVAL.We find the existence of an error to be a consistent predictor of a lower quality sentence, even in human simplifications.However, we find some errors correlate with a higher score (e.g.bad structure, information rewrite), but this may be attributed to the multi- clause complex sentences in SIMPEVAL having a a far greater number of positive edits when these corresponding errors occur.Broadly, we observe an inverse relationship between error and quality.
As the error score increases (a function of the severity, frequency and size of errors), the quality must decrease.
Increased Edits Enables, But Does Not Guarantee Performance.Table 10 reports the mean and variance of sub-scores for the sentence-level SALSA score across each system.Edit-level scoring addresses the frequent evaluation concern that conservative systems may maximize their score by performing a minimal number of safe edits (Alva-Manchego et al., 2021).The qualitatively conservative simplifications of T5 and zero-shot GPT-3.5 often score low because they fail to make many edits.SALSA distinguishes the MUSS simplifications with many successes, but more failures than other systems.We find the extent of sentence editing is not heuristic, but is a prerequisite for high performance and that overall simplification performance is often determined by a small number of high-impact edits.
Sentence Length Impacts Edit Frequency.Previous linguistic annotation of the ASSET corpus

Figure 1 :
Figure 1: Simplification generated by few-shot GPT-3.5.Our edit-level SALSA annotation communicates a finegrained evaluation of successes and failures.

Figure 3 :
Figure3: The multi-stage SALSA edit evaluation framework, implemented by our edit annotation interface (Fig.2).Underlying spans are classified into twenty one success (↑) and failure (↓) types, rated by efficacy or error severity.

Figure 4 :
Figure4: Successful edits on SIMPEVAL per-model, organized by edit type.MUSS successfully paraphrases at a human rate but fails to capture more complex simplification techniques.Compared to GPT-3.5, human content simplification utilizes more generalization , a similar distribution of syntax edits, and slightly less paraphrasing.

Figure 6 :
Figure6: Edit coverage of efficacy (+) and severity (-) ratings on SIMPEVAL, separated by approach to simplification.Edit coverage is defined as (len(e C ) + len(e S ))/(len(C) + len(S)) (see §3.4).The final all column ignores frequency to compare ratio of quality and error among models.High quality simplification is tuned to human performance, rather than maximizing the number of edits.We also report frequencies of only edit ratings in Figure14in Appendix.

EXAMPLE
Herbert Spencer's book makes the first... His book makes the first. . .Repetition.Some trivially additional information which simply repeats knowledge already previously contained in the candidate sentence.EXAMPLE ... the New York City Police Department is a law enforcement agency ... ... the New York City Police Department is a police department ... Despite successfully paraphrasing, police department, simply copies content from earlier in the sentence, instead of generating unique information.Contradiction.A negation of the meaning of the original sentence.This notably includes modifying an existing phrase to contradict the original sentence: EXAMPLE ... the Watergate burglars were convicted ... ... the Watergate burglars were not convicted ... or generating new information making the sentence contradict itself: EXAMPLE Dextrose adds flavor and texture to dishes, although its consumption is known for negative consequences.
the first thorough analysis of this rural society.

EXAMPLE
Hilary Clinton was born in 1947.Hilary Clinton was born in 1947 outside the United States.

Figure 8 :Figure 10 :
Figure 8: Edit selection between three annotators on a MUSS simplification.For complex examples, multiple valid interpretations for span labeling may exist, however we find annotator's overall judgements are consistent.

Figure 11 :
Figure 11: Average character length of edits and specific error types with 95% confidence interval.

Figure 12 :
Figure12: Average sentence-level score across error sentences for each system.

Figure 13 :
Figure13: Edit distance and number of annotated edits for 300 randomly sampled sentences from ASSET and SIMPEVAL.While past work found no relationship, by extending ASSET to more complex sentences we see a clear correlation arise.

Figure 17 :
Figure 17: Landing page introducing annotators to each part of the task.The 10 stages organize different concepts in the SALSA typology.

Figure 18 :
Figure 18: Example interactive allowing annotators to see different spans to understand different amounts of relevancy to the main idea of the sentence.

Figure 19 :
Figure 19: One of the 100 sentence examples provided to annotators, highlighting different types of structure edits existing within the same sentence.

Table 2
After defeating PSD candidate Viorica Dȃncilȃ by a landslide in 2019, his second term..In 2019, Klaus Iohannis defeated PSD candidate Viorica Dȃncilȃ by a large margin.His second term.. A

Table 3 :
(Wan et al., 2022)n between automatic metrics and SALSA sub-scores ( §3.4).We exclude human written simplifications, which are used for references.Best; Second Best.VAL sentence-level scores using UniTE(Wan et al., 2022), a multi-task learning objective with three input formats: Simp + Ref, Simp + Comp, and Simp + Comp + Ref.We then fine-tuned LENS-SALSA on SALSA annotations, with Simp + Comp as the input format.We use a dual-objective to predict both the sentence-level score (calculated by LENS) and word-level quality score ŷi ∈ [−3, 3], which is the efficacy or severity rating of each word w i .

Table 5 :
Overview of edit-level evaluation typology.Original text for the examples: Many volatile organic chemicals are increasing in abundance in the lower troposphere.

Table 6 :
Examples of structural modification sub-types used for annotation.

Table 8 :
SALSA sub-scores of Figure7reporting performance across different simplification approaches, alongside other human evaluation schemes.* indicates the percentile among the raw scores for each system.

Table 9 :
Fleiss kappa error identification agreement measured per-sentence alongside error frequencies.As errors were far more rare, we observe a strong relationship between frequency and expected agreement.