Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text

This work presents Keep it Simple (KiS), a new approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity. We train the model with a novel algorithm to optimize the reward (k-SCST), in which the model proposes several candidate simplifications, computes each candidate’s reward, and encourages candidates that outperform the mean reward. Finally, we propose a realistic text comprehension task as an evaluation method for text simplification. When tested on the English news domain, the KiS model outperforms strong supervised baselines by more than 4 SARI points, and can help people complete a comprehension task an average of 18% faster while retaining accuracy, when compared to the original text.


Introduction
The main objective of text simplification is to make a complex text accessible to a wide audience by increasing its readability. In contrast with text summarization -in which key content is selected to remain in the summary and other content is elided -in text simplification, ideally all relevant content is preserved.
We propose that text simplification algorithms need to balance three properties: (1) fluency: the simplified text should use well-formed English sentences, (2) salience: the simplified text should relay the same information as the original, and (3) simplicity: the simplified text should be syntactically and lexically simpler than the original. Figure 1 provides intuition for the necessity of each of the three properties. It shows the original text and the output of the full proposed model compared to three reduced versions:  (Lewis, 2021). We optimize a three-component reward: fluency, salience and simplicity. We show model outputs when trained with all three components, and with a missing component.
Without Fluency, the generator has no incentive to generate full sentences, and learns it can boost the simplicity score by generating short phrases with excessive punctuation.
Without Salience, the generator does not gain by covering facts in the original text, and can improve the simplicity score by learning to remove facts (e.g., not mentioning planet Mars by name).
Without Simplicity, the generator is not guided to favor syntactically and lexically simpler rewrites. In Figure 1, Model No Simplicity is in fact more complex than the original according to readability measures.
As we show in the related work section (Section 2), there are no high-quality, large datasets publicly released for text simplification. In this work, we build on recent progress of reinforcement learning (RL)-based training of text generators: we formulate a reference-free reward for text simplification and directly optimize it, circumventing the need for aligned data.
Our main contribution is the Keep it Simple (KiS) procedure, a novel unsupervised method for text simplification. Applied to the English news domain, KiS outperforms several supervised models on common simplification metrics such as SARI (Xu et al., 2016) and the Flesch-Kincaid Grade Level (Kincaid et al., 1975).
A second contribution is a new algorithm for RLbased training of text generators, k-SCST, which is an extension of Self-Critical Sequence Training (Rennie et al., 2017). For each input, we generate k sampled outputs (vs. 2 in SCST), and use the mean population reward as a baseline. We show in Section 4 that in our domain, k-SCST outperforms models trained with SCST.
A third contribution is a novel evaluation method for text simplification. Based on the assumption that simplified text should enable faster reading with better understanding, we propose a realistic Text Comprehension task. We show that people reading texts simplified by KiS are able to complete comprehension tasks faster than comparison texts.
Another departure from previous work is that we work with paragraphs as units of text. Most work in text simplification is done at the sentence level, despite work such as Zhong et al. (2020) showing that common simplification phenomena occur at the level of the paragraph, (e.g., the deletion, insertion or re-ordering of full sentences). Specifically, we train our models to simplify full paragraphs, and evaluate our models in a human evaluation on short documents (i.e., 3-4 paragraphs).
Through rigorous empirical evaluation, we demonstrate the strong performance of our approach; automated results show that this unsupervised approach is able to outperform strong supervised models by 4 SARI points or more. We publicly released the code and model checkpoints 1 .

Related Work
Simplification Datasets. Early datasets were first based on Simple Wikipedia 2 : WikiSmall (Zhu et al., 2010), later expanded into WikiLarge (Zhang and Lapata, 2017). Xu et al. (2015) show there are quality concerns with Simple Wikipedia datasets, and propose Newsela 3 as a replacement. Newsela is a project led by educators re-writing news articles targeting different school grade levels. We view Newsela as the gold-standard for our work, and use the public Newsela release of 1,911 groups of articles to design and evaluate our work. Using a coarse paragraph alignment algorithm, we extract 40,000 paired simple/complex paragraphs targeting a separation of 4 grade levels. We call this dataset the paired Newsela dataset, which we use for analysis and baseline training.
Seq2Seq for Simplification. Text simplification is most commonly framed as a sequence-tosequence (seq2seq) task, leveraging model architectures of other seq2seq tasks, such as natural machine translation (Zhu et al., 2010;Wubben et al., 2012). Martin et al. (2020) introduce ACCESS, a finetuned Transformer model that achieves stateof-the-art performance on WikiLarge. ACCESS can customize simplifications on parameters such as compression rate and paraphrase amount. We directly compare our approach to ACCESS.
Data availability remains one of the main limitations to seq2seq-based text simplification. We side-step this issue entirely by working with unsupervised data, only requiring a small dataset with coarse-level alignments for calibration.
Lexical Simplification focuses on the substitution of single words or phrases with simpler equivalents, with diverse approaches using lexical databases such as WordNet (Thomas and Anderson, 2012), to using contextualized word vectors (Qiang et al., 2020). These methods tend to be limited, as they do not consider syntactic complexity, and have no direct way of modeling deletions and insertions. We incorporate a lexical score (L Score ) as one of the rewards in our simplicity component.
Text-edit for Simplification. Recent work (Dong et al., 2019;Stahlberg and Kumar, 2020) has modeled text simplification as a text-edit task, learning sequences of word-edits that transform the input into the output. Text editing offers explainability, at the cost of added model complexity. We find that without explicitly representing edits, the KiS model easily learns to copy (using attention heads) and deviate from the original text. Outputs can be post-processed into edits, if desired.
Unsupervised Simplification has mostly been limited to lexical simplification. Recently Surya et al. (2019) (Unsup NTS) proposed a system that can perform both lexical and syntactic simplification, with a joint encoder, and two decoders (simple and complex). We directly compare our unsupervised approach to Unsup NTS.
RL for Simplification. Prior work (Zhang and Lapata, 2017;Guo et al., 2018) used Reinforcement Learning (RL)-based simplification. However, in both cases, components of the reward or training procedure involved reference simplifications, requiring an aligned dataset. By designing a reference-free reward, we are able to train our model with RL without supervision.
Evaluation of Simplification. This usually falls into two categories: automatic offline evaluation, and human evaluation. Automatic evaluations usually involve using n-gram overlap calculations such as BLEU (Papineni et al., 2002) and SARI (Xu et al., 2016)). SARI was shown to correlate better with human judgements of simplicity than BLEU, and it has since become a standard (Zhang and Lapata, 2017;Surya et al., 2019;Martin et al., 2020). In our experiments, we report both SARI and BLEU.
Human evaluation is typically done in an intrinsic way -e.g., by directly rating factors like fluency, simplicity and relevance of model outputs (Surya et al., 2019;Wubben et al., 2012). In this work, we propose an extrinsic, task-based protocol. In our comprehension study, we directly measure how much simplified texts can help a human reader answer questions more efficiently. The closest to our evaluation design is that of Angrosh et al. (2014) with the important difference that we require participants to resubmit after erroneous answers. In pilot studies, we found this step to be crucial for high-quality responses.

KiS Components
In KiS, we approach unsupervised simplification as a (non-differentiable) reward maximization problem. As shown in Figure 2, there are four components to the reward: simplicity, fluency, salience and guardrails which are jointly optimized. This is essential to avoid trivial solutions that only consider subsets. We therefore use the product of all components as the total reward, because the product is sensitive to the sharp decrease of a single component. For example, the triggering of a single guardrail leads to the zeroing of the total reward. Each component is normalized to the [0, 1] range.

Simplicity
The simplicity score should establish whether the generator's output uses simpler language than the original text. We follow prior work (Ferrés et al., 2016) and organize our score into a syntactic score S Score , and a lexical score L Score . Syntactic simplification focuses on reducing the complexity of a sentence, for example by reducing the number of words in a clause, or reducing distant dependencies. In lexical simplification, the objective is to replace complex phrases with simpler synonyms. To produce a single simplicity score, we take the product of S Score and L Score (both in [0, 1]).

Syntactic Simplicity: S Score
We measure syntactic complexity via the Flesch-Kincaid grade level (FKGL) as it is easy to compute and maps to a grade-level which also corresponds to the scale used by Newsela. Other readability metrics such as Dale-Chall formula (Dale and Chall, 1948), or the Gunning-Fog index (Gunning, 1969) could be used, and future work could examine the effect of choosing one readability metric over the  When the original paragraph's FKGL is higher (x-axis), the change in FKGL tends to be larger (y-axis). We fit a linear approximation, which we use to compute the Sscore.
other. Another viable option is the Lexile score (Smith et al., 2016), however, because its implementation is not publicly released, we cannot use it during training and we report it only for evaluation (done manually on the Lexile Hub 4 ). Figure 3 shows the S Score algorithm. We compute the original paragraph's FKGL (FStart), used to compute a target FKGL (tgt). The score is a linear ramp measuring how close the achieved FKGL (Fend) is to the target, clipped to [0, 1].
In the initial design, the target drop was a constant: 4 grade levels, independent of FStart. However, analysis on the paired Newsela corpus revealed that the target FKGL should depend on the initial FKGL. This makes sense intuitively: an already syntactically simple paragraph should not require further simplification, while more complex paragraphs require more simplification. Figure 4 shows the positive correlation between the original paragraph's FKGL and the drop of FKGL in the simplified text. We fit a piece-wise linear function to calculate the target FKGL drop from the initial paragraph.

Lexical Simplicity: L Score
Lexical simplicity focuses on whether words in the input paragraph (W 1 ) are more complex than ones in the output paragraph (W 2 ). We rely on the observation that word frequency and difficulty are correlated (Breland, 1996), and use word frequency in a large corpus of text (Brysbaert and New, 2009) to determine simplicity.
Because word frequency follows a Zipf power law, we use Speer et al. (2018)'s log normalization, adjusting the frequency on a [0, 8] range, with words at 0 being non-existent in the corpus, and 8 for most common words. As an example, the word vigorous has a frequency of 3.54, while its more common synonym strong obtains 5.23.
We compute the average Zipf frequency of the set of inserted words (Z(W 2 − W 1 )), and the set of deleted words (Z(W 1 − W 2 )). The difference should be positive. Analysis of the paired Newsela corpus reveals that 91% of pairs have a positive ∆Z(W 1 , W 2 ), with a median value of 0.4. We use this median as the target Zipf shift in the L Score , and use a ramp shape similar to the S Score , clipped between 0 and 1 (denoted as [·] + ):

Fluency
We use two sub-components for the fluency component: a pre-trained language-model, and a discriminator trained dynamically with the generator.

Language-Model Fluency
Language models assign a probability to a sequence of words. This probability is often used to measure fluency of generated text (Kann et al., 2018;Salazar et al., 2020). The KiS fluency score is based on a language model in a way similar way to Laban et al. (2020). The language model is used to obtain a likelihood of the original paragraph (LM (p)) and of the generated output LM (q). We use average log-likelihood, for numerical stability. The language model fluency score is then: λ is a tunable hyper-parameter. If the LM (q) is lower than LM (p) by λ or more, LM Score (p, q) = 0. If LM (q) is above or equal to LM (p), then LM Score (p, q) = 1, and otherwise, it is a linear interpolation. We set λ = 1.3 as it is the value for which the paired Newsela dataset achieves an average LM Score of 0.9.

Discriminator Fluency
The LM Score is static and deterministic, which can be limiting, as the generator can learn during training how to adapt and exploit flaws in the languagemodel (e.g., learning to alter capitalization).
Inspired from the Generative Adversarial Network (GAN) framework (Goodfellow et al., 2014), we create a dynamic discriminator, trained in conjunction with the generator, dynamically adapting the fluency score during training.
Specifically, we use a RoBERTa model (Liu et al., 2019) as the basis for the discriminator, a classifier with two labels: 1 for authentic paragraphs, and 0 for generator outputs.
As the generator produces outputs, they are assigned a label of 0 and added to a training buffer, while the original paragraphs are assigned a label of 1 and added to the training buffer as well.
Once the training buffer reaches a size of 2,000 samples, the discriminator is trained, using 90% of the training buffer. We train the discriminator for 5 epochs (details of training are in Appendix A.1). At the end of each epoch, we checkpoint the discriminator model. We compare the 5 checkpoints in terms of F-1 performance on the remaining 10% of the training buffer, and keep the best checkpoint as the new discriminator.
The discriminator's probability that a paragraph (q) is authentic is the discriminator score: As with GANs, there is an equilibrium between the generator attempting to maximize the probability of generating real outputs ("fooling" the discriminator), and the discriminator succeeding at distinguishing generated and authentic texts.

Salience
For the salience component, we use the coverage model introduced in the summary loop (Laban et al., 2020) for the domain of text summarization, and adapt it to the simplification domain.
The coverage model is a Transformer-based model trained to look at generated text and answer fill-in-the-blank questions about the original text. The score is based on model accuracy at filling in the blanks: the more is filled in, the more relevant the generated content is, and the higher the score.
A key element of the coverage model is its masking procedure, which decides which words to mask. In the summary loop, a limited number of extracted keywords (up to 15 words) are masked. By contrast, for simplification, we mask all non-stop words, amounting to a masking rate of about 40%.
This change reflects a difference in expectation between summarization and simplification: in summarization, only key components are expected to be recovered from a summary, whereas in simplification most of the original paragraph should be recoverable. Coverage ranges in [0, 1], and reference simplifications in the paired Newsela corpus obtain an average score of 0.76, confirming that manual simplification can achieve high coverage.

Guardrails
We use guardrails as simple pattern-based scores to avoid common pathological generation problems that we observed. Unlike the main components, guardrails are binary, giving a score of 1 (pass) unless they trigger (score of 0). We use two guardrails: brevity and inaccuracy.

Brevity guardrail
The brevity guardrail ensures the length of generated paragraph (L 2 ) falls in a range around the original paragraph's length (L 1 ). We compute a compression ratio: C = L 2 /L 1 . If C min ≤ C ≤ C max , the guardrail passes, otherwise it triggers.
We set [C min , C max ] = [0.6, 1.5], because these values ensure the guardrail is not triggered on 98% of the paired Newsela dataset; this can be adapted depending on the application.

Inaccuracy guardrail
Modern text generation models are known to hallucinate facts (Huang et al., 2020), which has led the community to create models to detect and correct hallucinations (Cao et al., 2020;Zhang et al., 2020;Wang et al., 2020).
We propose a light-weight inaccuracy detector as a guardrail. We use a Named Entity Recognition (NER) model (Honnibal et al., 2020) to extract entities present in the original paragraph (E 1 ) and the model's output (E 2 ). We trigger the guardrail if an entity present in E 2 is not in E 1 .
Even though human writers can successfully introduce new entities without creating inaccuracies (e.g., replacing the city La Paz with the country Bolivia), we find that text generators predominantly introduce inaccuracies with novel entities. This simple heuristic can eventually be replaced once inaccuracy detection technology matures.

KiS Training
Rennie et al. (2017) introduced Self-Critical Sequence Training (SCST) as an effective algorithm for reward-based training of text generators, successfully applying it to image captioning. The efficacy of SCST was later confirmed on other text generation tasks such as question generation (Zhang and Bansal, 2019), and summarization (Celikyilmaz et al., 2018;Laban et al., 2020). In SCST, a probabilistic model is used to generate two distinct candidates: C S , a candidate constructed by sampling the word distribution at each step, andĈ, by taking the argmax of the word distribution at each step. Each candidate is scored, obtaining rewards of R S andR, respectively, and the loss is: where p(w S i |...) represents the probability of the i-th word conditioned on previously generated sampled sequence according to the model, P is the input paragraph, and N the number of words in the generated sequence. Intuitively, minimizing this loss increases the likelihood of the sampled sequence if R S >R, and decreases it otherwise, both increasing the expected total reward.
One limitation in SCST occurs when the two sequences achieve comparable rewards (R S R ): the loss nears zero, and the model has little to learn, wasting a training sample. In our experiments with SCST, this can occur with 30% of samples.
We propose an extension of SCST, which we call k-SCST. We generate k sampled candidates (k > 2), compute the rewards of each candidate R S1 , ..., R Sk , as well as the mean reward achieved by this sampled population:R S = (R S1 + ... + R Sk )/k, which we use as the baseline, instead of R. The loss L becomes: We use a GPT2-medium for the generator, initialized with the released pre-trained checkpoint. Experimental details such as data and optimizer used are provided in Appendix A.1.
In Figure 5, we show results of a direct comparison of SCST (k = 2) with k-SCST varying k in {4, 6, 8}, while keeping other components of the training fixed. Because of the variance involved in RL training, we recorded six independent training runs for each setting (for a total of 24 runs), and plot the average reward across runs of a setting, as well as the standard error of the mean (SEM).
We observe that increasing k leads to higher average reward, and less variation in the reward. In our setting, k-SCST boosts performance and stabilizes training. We use k = 8 in all final models, as increasing k further is impractical due to GPU memory limitations.
We believe k-SCST's advantage stems from two factors: first, obtaining a better estimate of the distribution of rewards by sampling more outputs, second, by using the mean reward as the baseline, saving on computation of a separate baseline generation. We believe k-SCST can also improve learning in other text generation applications and plan to pursue this in future work.

Experiments
We present results experimentally validating the KiS procedure for text simplification. We give results based on automatic metrics, on a novel human comprehension task, and from an ablation study.

Models Compared
We compare the KiS Model to three strong supervised models, and an unsupervised approach.
ACCESS90 is identical to ACCESS, with different parameters (NBChar =0.90, LevSim=0.75), reducing target compression from 95% to 90%, matching the average compression rate in Newsela. Finetune Baseline is a GPT2-medium model finetuned on the paired Newsela dataset. Large pre-trained models often perform competitively in low-resource environments, making this a strong point of comparison.
Unsup NTS from (Surya et al., 2019) is an unsupervised approach based on successively encoding and denoising text using a GRU architecture.
Training details for the KiS Model and Finetune Baseline are in Appendix A.1.

Automatic Results
We put aside 500 samples from the paired Newsela dataset as a test set to compare models on automatic metrics. We compare models on SARI and BLEU, report the percentage when readability measures see an improvement in readability: %FKGL, and %Lexile and compute the average compression rate (Comp.), and coverage (Cov.). Results are summarized in Table 1.
The KiS model achieves the highest SARI score by a margin of 0.04, even though it is an unsupervised approach.
Finetune Baseline achieves the highest BLEU and salience scores, but lowest SARI score. We interpret this as showing the model takes the least risk: high salience, with little simplification.
We observe that all models are able to increase readability in terms of FKGL and Lexile compared to original paragraphs. We note that for almost all models, the percentage is lower for the Lexile measure than for FKGL, showing that an improvement in Lexile score is more difficult to achieve than FKGL. The KiS model achieves an increase in Lexile readability 72% of the time, the closest figure to 79% of the Newsela human-written reference.
We note that the perfect performance of KiS on %FKGL could be explained by the fact that FKGL is a part of a component being optimized (S Score ), however Lexile was not.
In terms of compression, the KiS model compresses the second most, most likely hurting its coverage. Adjusting the Brevity guardrail could encourage the model to compress less. ACCESS90 has the compression rate closest to Newsela references, but this only leads to a modest improvement in SARI when compared to ACCESS.
Overall, the Newsela references achieve the best percentage of Lexile readability improvement, while outperforming the KiS model at coverage: there is still a gap between human-written simplifications and model-generated ones.

Human Comprehension Study
We propose a human comprehension study to evaluate the usefulness of simplification results. Simplified text should be easier to read than the original text, while retaining accuracy and understanding. We design a task to evaluate how well both manual and automated simplifications achieve this objective. The main idea is to show readers a text and ask them to answer multiple-choice questions, evaluating the texts based on time and retries needed to select the correct answer.

Study Design
Five different versions of each document were generated as stimuli: the original document, the Newsela reference, and versions from the three best-performing methods from the last section: KiS, Finetune Baseline, and ACCESS. We did not include Unsup NTS in our analysis, because of its low performance on %FKGL and %Lexile metrics. Associated with each document are five manually generated multiple-choice questions, each with one or more correct answers and one to four distractors. The original and the Newsela texts were checked manually by experimenters to ensure that all allow for questions to be answered correctly. Crowdworkers were shown four documents in succession, in a between-participants design. Order of document and stimuli type were randomized. Figure 6 shows two stimuli of a document (original and KiS) along with the comprehension questions. (The entire set of five stimuli can be found in Figure A2 in the Appendix.) After several rounds of pilot testing, we arrived at the following design choices: Document theme. We chose recent news articles involving complex themes (e.g., trajectory of iceberg) as the source of documents. For news articles, recency seems to engage participants, and ORIGINAL [Lexile Grade 11] Each summer, libraries in St. Louis, Missouri, host many types of free camps -yoga, chess and even a Harry Potter "Sorting Hat Camp." In 2020, camp dreams seemed farfetched given the global coronavirus pandemic. That didn't stop St. Louis libraries, though. Instead of canceling, they brought camp into kids' homes. So children who signed up for ukulele camp got a beginner's guidebook, instructional DVD and an actual ukulele in the mail. It was all free. In addition, camp sessions still occurred. Advisers met with kids using virtual formats. Joe Monahan, manager of youth services for the St. Louis library system, says that of the 70 camps originally scheduled, 54 were held virtually. Paula Langsam, a youth services manager at the soon-to-reopen Martin Luther King Junior Memorial Library in Washington, D.C., says, "In a way, our work has changed a lot. We didn't used to do videos a lot."

KIS MODEL [Lexile Grade 9]
In the summer months, St. Louis has many free classes for kids, including yoga, chess and a Harry Potter "Sorting Hat Camp." In 2020, camp dreams again seemed far-fetched given the crisis.  technical terms increase the impact of simplification.
Section length. We chose document length of 3-4 paragraphs (or 200 words), and five comprehension questions. Document length should not be too short (makes some questions trivial), or too long (adds a retrieval component to the task).
Selection of questions. Questions were generated via a GPT2 question generation model finetuned on the NewsQA dataset (Trischler et al., 2017). We select questions answerable by both the original and Newsela references, attempting to have both factoid (answer is entity) and reasoning questions.
Re-submission until correct. When submitting answers, participants received feedback on which were incorrect, and were required to re-submit until all answers were correct. This aligns the objective of the participant (i.e., finishing the task rapidly), with the task's objective (i.e., measuring participant's efficiency at understanding). This also gives a way to discourage participants from "bruteforcing" the task, re-submitting many combinations until one works.
We note that some components of the study such as the choice of document themes and the selection of comprehension questions are elements that create variability in the results. We release the models used in the study, as well all generated texts that were evaluated to enable follow-up research and to aid reproducibility.

Model
Time (

Study Results
We ran the study on Mechanical Turk, accepting crowd-workers with 1700+ completed tasks, and an acceptance rate of 97%+. The study was active for two weeks in December 2020, and remunerated participants completing all four sections at a rate of $10/hour. (Appendix A.2 shows crowd-worker instructions and the document/version distributions.) When removing "brute-forced" submissions (10+ re-submissions), we are left with 244 submissions, used for result analysis reported in Table 2, (A more detailed results table is included in Appendix A.4.) We measure two outcomes: question completion time (in seconds), and number of submissions to correctness. We performed a Kruskal-Wallis test (Kruskal and Wallis, 1952) with a Dunn posthoc test (Dunn, 1964) for statistical significance between pairs of conditions.
In line with study objectives, simplified texts help participants complete the task faster than reading original texts, with three of the four simplified versions leading to improvements in completion times. Participants were fastest with KiS simplifications (18% faster). The KiS model led to a statistically significant speed-up compared to the originals, Newsela references, and ACCESS simplifications. ACCESS simplifications surprisingly led to a non-significant slow-down, which we attribute to a potential loss in fluency that might have confused participants. One important factor we consider is that shorter passages (i.e., smaller compression) might lead to a speed-up regardless of simplicity. We confirm this by finding a small positive correlation between passage length and completion time of 0.09. We compute a compression-adjusted speed-up (CASpeed) ratio by: (1) computing the passage length of each simplified version, (2) linearly extrapolating the expected completion time for this passage length for original paragraphs, and (3) computing the ratio of the extrapolation to the observed completion time. If CASpeed > 1, participants were faster than expected for the passage length. Newsela reference paragraphs achieve the best CASpeed, followed by the KiS model. This suggests that good simplification can involve making texts longer.

Ablation Study
We train three ablated models, each missing a reward component to gain understanding in the value of each component of the KiS procedure. Figure 1 gives a qualitative perspective on each ablation. Without fluency, the generator learns to generate incomplete sentences, without salience, it omits important information, and without simplicity, it can sometimes "complexify".
We computed complete automatic results for the ablated models, and find that each ablation leads to a decrease on an evaluation metric, confirming that all three components are necessary to generate highquality simplifications (details in Appendix A.5).

Limitations and Future Work
Improved Accuracy Scoring.
The current guardrail for inaccuracy is rudimentary; trained models still generate non-factual simplifications. Recent work in fact-checking for the summarization domain (Kryscinski et al., 2020;Li et al., 2018) could be adapted to the simplification domain to improve this.
Inclusion of Supervised Signal. In this work, we establish that text simplification can be approached in an unsupervised manner. In future work, Keep it Simple could be used as a pretraining strategy, or used jointly with supervised training.
Reproducibility of Human Evaluation. Even though we release the models, stimuli and comprehension questions used in the human evaluation, some elements of the procedure introduce randomness. Participating crowd-workers differ in literacy level which may have an effect on their performance at the task (Alonzo et al., 2021).
New Settings, Domains and Languages. We limited our experiments to the simplification of English news articles following prior work, but plan to pursue other languages in the future. Similarly, because Keep it Simple does not require labeled data, it can be applied to new settings (e.g., rewriting to inverse the effects of simplification), or to new domains (e.g., legal texts).

Conclusion
We have shown that text simplification can be approached in an unsupervised manner via KiS. By optimizing a reward comprised of simplicity, fluency and salience components, KiS is able to outperform strong supervised models on automatic metrics (+0.04 in SARI). We propose a human comprehension task to evaluate the usefulness of simplification and show that simplifications tend to lead to a measurable speed-up in task completion, with KiS texts producing the best speed-up of 18% on average. These are first steps for unsupervised text simplification, and we suggest that future work should focus on adapting the methodology to new domains (i.e., legal), non-English languages, and refining optimized rewards to take factuality into account.

Ethical Considerations
We present a method for text simplification and verify its performance on text from the news domain in the English language. Even though we expect the method to be adaptable to other domains and languages, we have not verified this assumption experimentally and limit our claims to the English news domain. When comparing to prior work (e.g., ACCESS model), we obtained implementations directly from the authors (through Github repositories) and produced results following the recommended setting, with an objective to present prior work as a strong comparison point.
For the human evaluation, we paid the annotators above the minimum wage, and did not collect any personal identifiable information. We selected topics to avoid sensitive or political subjects and had our protocols reviewed by the university's IRB committee (Protocol ID: 2018-07-11230). We relied on a third party (Amazon Mechanical Turk) to remunerate the crowd-workers.

A.1 Training Details
We detail the model architecture size, data, optimizer of the models we train in the paper. All models were trained using Pytorch and Hugging-Face's Transformers library 5 . We use the Apex 6 library to enable half-precision training.
The KiS procedure was trained on a single GPU, either an Nvidia V-100 (16Gb memory) or a Quadro RTX 8000 (48 Gb memory). We ran a total of around 200 experiments, with an average run-time of one week.
Because the procedure is unsupervised, the model was trained using a large unreleased corpus of news articles, containing 7 million news articles in English.
KiS Model is initialized with a GPT2-medium model. We used the Adam optimizer, with a learning rate of 10 −6 , a batch-size of 1, using k-SCST with k = 8.
Finetune Baseline is initialized with a GPT2medium model. We train using using standard teacher forcing on the 40,000 samples in the paired Newsela dataset, reserving 2,000 samples for validation. We use the Adam optimizer, and use the 5 https://github.com/huggingface/transformers 6 https://github.com/nvidia/apex validation set to choose a learning rate of 10 −5 , and a batch-size of 8, and run for 3 epochs before seeing a plateau in the validation loss.
Discriminator Model is initialized with a Roberta-base, and retrained every time the training buffer reaches 2,000 samples. The discriminator is reset to the original Roberta-base each time the training buffer is full. We use a standard cross-entropy loss, the ADAM optimizer with a learning rate of 10 −5 and a batch size of 8. Each time we retrain, we run for 5 epochs, and checkpoint one model after each epoch. The checkpoint that achieves the highest performance on a validation set becomes the new discriminator for the next round. Figure A1 shows the instructions given to crowdworker participants for the manual evaluation.

A.2 Human Evaluation Instructions
• The entire HIT should take no more than 15 minutes: (1) You will answer a pre-questionnaire.
(2) Read 4 short news stories and answer comprehension questions about each.
• If you believe the answer is not in the document, you can select the option "Answer not in document". • There is no time limit for each individual document or question.
• You can leave at any point but will not complete the HIT.
• You can complete this task at most once.
• If you have a question/problem, contact us at email.  Figure A2 is a complement to Figure 6, with the five stimuli that were shown for the Covid Libraries document.

A.4 Detailed of Human Evaluation Results
Table A1 details the timing and number of participants for each combination of document and stimuli. Figure A2: Complement to Figure 6. Example Task for the Comprehension Study. Participants were assigned to one of five settings: original, Newsela, KiS, Finetune Baseline, and ACCESS. Participants were instructed to answer the five comprehension questions.  (53) 163 (44) 161 (46) 188 (52) 143 (49)  Table A2: Automatic results of the three ablation models. SARI and BLEU are reference-based metrics. % FKGL and % Lexile are the percentage of simplified paragraphs with a lower FKGL and Lexile score than the original paragraph. Comp. is the average compression ratio (# of words), and Cov. is the average coverage score of the simplifications.

A.5 Detail of Ablation Study Results
Table A2 details the metric results of the three ablated models, an extension to Table 1. An example output of each ablated model, illustrating the limitation when a score component is missing, is given in Figure 1.
One surprising element is that the model trained without fluency achieves higher scores on almost all metrics, compared to the full model. This surprising fact is due to the fact that without fluency, the model does not learn to generate full sentences (see the example in Figure 1). Instead, the model learns to concatenate high-scoring phrases together, which can boost automatic metrics artificially. In fact, the strong performance of a model generating incomplete sentences reveals a limitation of current automatic metrics, such as BLEU and SARI.