A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification

We introduce a new approach for the task of Controllable Text Simpliﬁcation, where systems rewrite a complex English sentence so that it can be understood by readers at different grade levels in the US K-12 system. It uses a non-autoregressive model to iteratively edit an input sequence and incorporates lexical complexity information seamlessly into the reﬁnement process to generate simpliﬁcations that better match the desired output complexity than strong autoregressive baselines. Analysis shows that our model’s local edit operations are combined to achieve more complex sim-pliﬁcation operations such as content deletion and paraphrasing, as well as sentence splitting.


Introduction
Text simplification (TS) aims to automatically rewrite text so that it is easier to read. What makes text simple depends on its target audience (Xu et al., 2015): replacing complex or specialized terms with simpler synonyms might be helpful for non-native speakers (Petersen and Ostendorf, 2007;Allen, 2009) whereas restructuring text into short sentences with simple words might better match the literacy skills of children (Watanabe et al., 2009). Studies of simplification tools for deaf or hard-ofhearing users also show that they prefer lexical simplification to be applied on-demand (Alonzo et al., 2020). Yet, research in TS has mostly focused on developing models that generate a generic simplified output for a given source text (Xu et al., 2015;Zhang and Lapata, 2017;. We contrast this Generic TS with Controllable TS which specifies desired output properties. Prior work has addressed Controllable TS for either high-level properties, such as the target reading grade level for the entire text (Scarton and Specia, 2018;Nishihara et al., 2019), or low-level properties, such as the compression ratio or the nature of Grade Text 10 Tesla is a maker of electric cars, which do not need gas and can be charged by being plugged into a wall socket. 5 Tesla cars can be charged by being plugged in, like a phone. They do not need any gas. 3 Tesla builds cars that do not need gas. the simplification operation to use (Mallinson and Lapata, 2019;Martin et al., 2020;. Specifying the desired reading grade level might be more intuitive for lay users. However, it provides only weak control over the nature of simplification. As illustrated in Table 1, simplifying text to different grade levels results in diverse edits.
To rewrite the grade 10 original for grade 5, the complex text is split into two sentences and paraphrased. When simplifying for grade 3, phrases are further simplified, and content is entirely deleted.
In this work, we adopt the intuitive framing for Controllable TS where the desired reading grade level is given as input, while providing fine-grained control on simplification by incorporating lexical complexity signals into our model. We adopt a nonautoregressive sequence-to-sequence model (Xu and Carpuat, 2020) that iteratively refines an input sequence to reach the desired degree of simplification and seamlessly integrate lexical complexity.
Unlike commonly used autoregressive (AR) models for simplification (Specia, 2010;Nisioi et al., 2017;Zhang and Lapata, 2017;Wubben et al., 2012;Scarton and Specia, 2018;Nishihara et al., 2019;Martin et al., 2020;Jiang et al., 2020, among others), our model relies on explicit edit operations. It therefore has the potential of modeling the simplification process more directly than AR models which need to learn to copy operations implicitly. Unlike existing edit-based models for simplification which rely on pipelines of independently trained components (Alva-Manchego et al., 2017;Malmi et al., 2019;Mallinson et al., 2020), our model is trained end-to-end via imitation learning and thus learns to apply sequences of edits to transform the original source into the final simplified text. Furthermore, our approach does not require a custom architecture for simplification: it repurposes a non-autoregressive (NAR) model introduced for Machine Translation (MT) and can seamlessly incorporate lexical complexity information derived from data statistics in the initial sequence to be refined.
Based on extensive experiments on the Newsela English corpus, we show that our approach generates simplified outputs that match the target reading grade level better than strong AR baselines. Further analysis shows that the model learns complex editing operations such as sentence splitting, substitution and paraphrasing, and content deletion and applies these operations accordingly to match the complexity of the desired grade level.

An Edit-based approach for
Controllable TS Task We frame Controllable TS as follows: given a complex text c and a target grade level g t , the task consists in generating a simplified output s that is appropriate for grade level g t .
Approach Our approach, illustrated in Figure 1, is based on EDITOR (Xu and Carpuat, 2020), a NAR Transformer model where the decoder layer is used to apply a sequence of edits on an initial input sequence (possibly empty). The edits are of two types: (1) reposition and (2) insertion. The reposition layer predicts the new position of each token (including deletions). The insertion layer has two components: the first layer predicts the number of placeholders to be inserted and the fill-in layer generates the actual target tokens for each placeholder. At each iteration, the model applies a reposition operation followed by insertion to the current input. This is repeated until two consecutive iterations return the same output, or a preset maximum number of operations is reached. We tailor EDITOR for the task of Controllable TS as follows: Figure 1: EDITOR iteratively refines a version of the input where words predicted to be too complex for 3rd grade readers have been deleted.
Control tokens The target complexity g t is encoded as a special token added at the start of the input sequence. As in prior work with autoregressive models (Scarton and Specia, 2018;Nishihara et al., 2019), this token acts as a side-constraint, gets encoded in the encoder hidden states as any other vocabulary token, and informs hypothesis generation through the source-target attention mechanism.
Lexical complexity signals We automatically identify the source words that are too complex for the target grade and delete them from the initial sequence to be refined by EDITOR. This simple strategy provides finer-grained guidance to the simplification process than the sequence-level sideconstraint, while leaving the EDITOR model the flexibility to rewrite the output without constraints. We quantify the relatedness between each vocabulary word (w) and grade-level (g) using their Pointwise Mutual Information (PMI) in the newsela corpus (Nishihara et al., 2019;Kajiwara, 2019): Here, p(w|g) is the probability that word w appears in sentences of grade level g and p(w) is the probability of word w in the entire training corpus. While the desired grade level g t is known in the task, we automatically predict the complexity g s of each source sentence s i using the Automatic Readability Index (ARI; Senter and Smith (1967)). The initial decoding sequenceŝ i takes the source sequence and deletes all words that are strongly related to the source grade level and unlikely to be found in text of the target grade level, with the exception of named entities: where, E i represents the set of entities in the source sequence s i . Our approach contrasts with prior work where PMI has been used in the loss to reward the generation of target grade-specific words for Controllable TS (Nishihara et al., 2019) or to exclude complex words from the decoding vocabulary using hard constraints for Generic TS (Kajiwara, 2019). Our approach combines lexical complexity information from both the source and target grade level more flexibly. Starting fromŝ i as an initial sequence, EDITOR can still delete further content to match the target grade level, insert new words to fix fluency and preserve the original meaning, and has the flexibility to re-generate tokens that were incorrectly dropped from the initial sequence.
Training to generate & refine EDITOR uses imitation learning to learn an appropriate sequence of edit operations to generate the output sequence by efficiently exploring the large space of valid edit sequences that can reach a reference output. A roll-in policy is used to generate sequences to be refined and a roll-out policy is then used to estimate cost-to-go for all possible actions given the roll-in sequences. The model is trained to choose actions that minimizes the cost-to-go estimates from the roll-in sequences to the true reference by comparing the model actions to the oracle actions generated by the Levenshtein edit distance algorithm. The roll-in sequences are stochastic mixtures of the initial sequences and outputs of the insertion and reposition modules given an initial sequence. The initial sequence is generated by applying random word dropping (Gu et al., 2019) and random word shuffle (Lample et al., 2018) with a probability of 0.5 and maximum shuffle distance of 3 to either the target sequence for MT tasks (Xu and Carpuat, 2020) or to the source sequence for Automatic Post Editing (Gu et al., 2019). For Controllable TS, we combine both, training EDITOR to generate text based on the corrupted target sequence first, and then fine-tuning the model for refinement based on the corrupted source sequence next.
3 Experimental Settings

Data
The Newsela website provides high quality data to study text simplification (Xu et al., 2015). It con-sists of news articles rewritten by professional editors for students in different grade levels. We use English Newsela samples as extracted by Agrawal and Carpuat (2019) since their process preserves grade level information for each segment. We restrict the length of each segment to be between 5 and 80 resulting in 470k/2k/19k for training, development and test sets respectively. We pre-process the dataset using Moses tools for normalization, and truecasing. We refer to the resulting dataset as newsela-grade. We further segment tokens into subwords using a joint source-target byte pair encoding model with 32, 000 operations. We use spacy 1 to identify entities in the source sequence.

Model configurations
Architecture We adopt the base Transformer architecture (Vaswani et al., 2017) with d model = 512, d hidden = 2048, n heads = 8, n layers = 6, and p dropout = 0.1 for all our models. We add dropout to embeddings (0.1) and label smoothing (0.1). AR models are trained with the Adam optimizer with a batch size of 4096 tokens. Training stops after 8 checkpoints without improvement of validation perplexity. We decode with a beam size of 5 for the AR models. All NAR models are trained using Adam with initial learning rate of 0.0005 and a batch size of 16,000 tokens. We select the best checkpoint based on validation perplexity. Grade side-constraints are defined using a distinct special token for each grade level (from 2 to 12). All models are implemented using the Fairseq toolkit.
Preliminary We establish that our Transformer architecture choice is strong on the more standard Generic TS task, as it performs comparably to the state-of-the-art 2 (Jiang et al., 2020) on the Newsela-Auto corpus (Table 2). 3

Experimental Conditions
We compare our approach, i.e.,"NAR + PMI-based initialization", described in Section 2 to three auto-regressive baselines for Controllable TS:  level tokens as side constraints (Scarton and Specia, 2018).
2. AR + PMI-based constraints is an AR Transformer model which incorporates lexical complexity information as hard constraints during decoding (Kajiwara, 2019): complex words are excluded from beam search using the dynamic beam allocation algorithm (Post and Vilar, 2018). While this approach was introduced for Generic TS, we adapt it to Controllable TS by defining hard constraints using the same criteria as for deleting words in initial sequences for EDITOR (Section 2).
3. AR + PMI weighted loss (Nishihara et al., 2019) is an AR Transformer model trained with a loss that weights words based on their PMI values with the desired target grade level.

Automatic Evaluation Metrics
We evaluate the output of the models using the following text simplification evaluation metrics: SARI (Xu et al., 2016) measures lexical simplification based on the words that are added, deleted and kept by the systems by comparing system output against references and against the input sentence. It computes the F1 score for the n-grams that are added (add-F1). The model's deletion capability is measured by the F1 score for n-grams that are kept (keep-F1) and precision for the deleted n-grams (del-P) 4 .
Pearson's correlation coefficient (PCC) measures the strength of the linear relationship between the complexity of our system outputs and the complexity of reference outputs. We estimate the reading grade level of the system outputs and reference text using the ARI score.
Adjacency ARI Accuracy represents the percentage of sentences where the system output grade level is within 1 grade of the reference text according to the ARI score (Heilman et al., 2008).
Mean Squared Error (MSE) between the predicted ARI grade level of the system output and the desired target grade level (Scarton and Specia, 2018;Nishihara et al., 2019). Table 3 summarizes the automatic evaluation of our approach on Controllable TS: our approach, "NAR + PMI-based initialization", improves all metrics-SARI, PCC, ARI-accuracy and MSEcompared to the AR baselines. It also outperforms the AR + PMI-based constraints baseline across all metrics except MSE which oversimplifies the source text by always deleting the complex tokens, as shown by a decrease in keep-F1 (-6.1) and improved del-P (+4.6). This results in lower MSE but worse PCC and ARI-Accuracy. By contrast, our approach uses lexical complexity information to provide an initial canvas and yields simplified sentences that match the desired target complexity better than the AR baselines. This is reflected in the higher SARI obtained by the PMI-based initialization baseline relative to the Source, which represents outputs generated by deleting complex tokens from the source text and hence by itself is not a well-formed. AR + PMI weighted loss performs comparably to the AR baseline across all the metrics except PCC, which could be due to PMI values being a relatively noisy signal at the token level during training, especially for the target grade levels where the data is scarce. 5 We further compare our approach with the model that uses the Oracle-keep sequence, i.e., tokens from the source sequence that are present in the target sequence. As expected, the oracle significantly outperforms all models that do not have access to the reference, further confirming EDI-TOR's ability to make good use of the provided initial sequence. More interestingly, our method for identifying grade-specific complex tokens (Equation 2) achieves a recall of 91.3% and precision of  Figure 2: Our approach substitutes "analyzed" correctly as well as splits the source sentence into two simple sentences to generate a simplified output that matches the lexical complexity of the desired grade-level 6. The tokens identified as complex using the proposed method in the source are bold.

76.4%
with the oracle on the development set, indicating that the initial sequences contain appropriate vocabulary. Table 3 shows that our approach partially closes the gap in performance with the oracle by using this modified source sequence as opposed to the original source sequence (NAR). Figure 2 illustrates the refinement process that generates the simplified output. Reposition and insertion operations are used in consecutive steps to perform complex editing operations (e.g., sentence splitting and lexical substitution) , which requires that the model learns to perform these operations sequentially. Furthermore, our approach recovers the tokens that were incorrectly identified as complex and thus deleted in the initial sequence, highlighting the benefits of the flexible refinement process.

Human evaluation
We randomly sample 60 source sentences from the Newsela-Grade dataset, among sources that are simplified toward four distinct grade levels (∼240 examples). For each of these target grades, we obtain ratings of system outputs and reference from   We compute the absolute difference ("AbsDiff") in the simplicity ratings between the reference and the system output by the same annotator, and aggregate over all examples and all ratings. Table 5 shows that our outputs are closer to the reference according to the simplicity judgements than the AR system outputs. The "Mean" ratings indicate that the two models make different tradeoffs: where the AR model under-simplifies the source sentence and preserves the meaning, our approach almost matches the mean simplicity of the reference at the cost of lower meaning preservation. Our outputs are also less grammatical than those of the AR model and the references, probably due to the independence assumptions made by the non-autoregressive model. The Adjacency Accuracy, representing the percentage of system outputs within a difference of one rating with the reference, is also higher for our approach relative to the AR model. Table 4 summarizes the impact of the design choices described in Section 2: Removing lexical information (-PMI-based Initialization) hurts both SARI and the grade specific metrics. Further, using the baseline EDITOR model that is trained only to generate, without fine-tuning for refinement, significantly hurts the performance across the board. In that setting, EDITOR never learns to delete tokens from the source, but only learns to delete tokens inserted by the model. Using EDITOR to generate the output from scratch instead (-Src Initialization) recovers the performance on SARI and grade specific metrics but fails to match the performance of our approach. This shows that fine-tuning for refinement and providing initial sequences informed by lexical complexity are both key to the performance of the EDITOR for Controllable TS.

Ablation Experiments
We also compare our approach with the variant of the model that is trained to perform reposition independent of the insertion operation (-Joint), similar to Mallinson et al. (2020). Even though this variant is able to match SARI, the difference in grade-specific metrics is significant, showing the benefits of joint training of the insertion and reposition components.
Iterative refinement helps match the target grade better than single step refinement as suggested by ARI Accuracy, MSE and PCC. Figure 3 shows the number of iterations of refinement performed by our approach as the function of desired target grade level: simplifying to lower grade levels (2 or 3) requires on average 1 additional refinement step Figure 3: Our approach requires more number of iterations when simplifying to a lower grade level. The number of iterations performed by the model monotonically increases with the edit distance between source and reference. than simplifiying to grade 8 or 9. This suggests that the iterative process helps simplification when the gap between source and target grades is wider.  Finally, we verify that using ARI to estimate the complexity of the source is effective. Replacing the ARI predictions with the gold-standard grade-level improves the grade-specific metrics only marginally. Table 6 further shows the advantage of combining source and target grade information when identifying complex tokens (Equation 2) over using source or target grade only.

Analysis
Per Grade Analysis How does our model compare against the AR baselines for each target gradelevels? Figure 4a and 4b show the SARI and Adjacency Accuracy bucketed by target grade level. We observe that our approach achieves comparable or higher accuracy than the AR baselines for all grades except 2 and 3. Further analysis suggests that this is due to samples where the source grade level is 12, and where our approach deletes words too aggressively to simplify for the large grade gap (Figure 4c and 4d).

Model Edit operations
We compare the number of edit operations performed by our model and the oracle Levenshtein Edit Distance (Section 2) when simplifying to different target grade levels. Figure 5 shows that the number of operations performed by our approach to generate its output track the number of oracle Levenshtein edits overall. The main differences are that our approach performs more than twice as many repositions than the oracle (5c) for grades 4 and above which suggest that the sequence of operations performed is suboptimal. Furthermore our approach overdeletes words for target grade levels lower than 4 (5b), and performs fewer insertions than the oracle (5a). We turn to manual analysis to shed more light on these results. Table 7 reports a manual annotation of the simplification operations observed for 50 randomly sampled segments, using an operation taxonomy from prior work (Xu et al., 2015;Jiang et al., 2020). Our approach performs content deletion in 7.5% more sentences than needed to generate the references. At the same time, it performs fewer insertions -in particular, our approach is unable to generate the elaborations and explanations found in the Newsela references (Srikanth and Li, 2020). This would require knowledge-based reasoning, which is beyond the capacity of the current model. However, our approach can model sentence splitting and substitution, which often require a sequence of insertion/deletion/reposition operations to be performed sequentially.  6 Related Work AR Models for TS Generic TS is often framed as machine translation where an autoregressive sequence-to-sequence model learns to model simplification operations implicitly from pairs of complex-simple training samples (Specia, 2010;Nisioi et al., 2017;Zhang and Lapata, 2017;Wubben et al., 2012;Scarton and Specia, 2018;Nishihara et al., 2019;Martin et al., 2020; Jiang et al., 2020). There have been efforts at controlling a different aspect of the simplified output, such as controlling for a specific grade-level (Scarton and Specia, 2018;Nishihara et al., 2019) or employing lexical or syntactic constraints (Mallinson and Lapata, 2019;Martin et al., 2020), where the complexity of a word is either determined by its frequency or by manually tagging the tokens at inference time. We instead use the association of a word with the grade-level to define lexical constraints automatically. Furthermore, these models lack interpretability in terms of the type of oper-ations performed, and need to generate the entire output sequence from scratch thus potentially wasting capacity in learning copying operations.

Simplification Operations
Edit-based Generic TS Recent work incorporates edit operations into neural text simplifications more directly. These approaches rely on custom multi-step architectures. They first learn to tag the source token representing the type of edit operations to be performed, and then use a secondary model for in-filling new tokens or executing the edit operation. The tagging and editing model are either trained independently (Alva-Manchego et al., 2017;Malmi et al., 2019;Kumar et al., 2020;Mallinson et al., 2020) or jointly (Dong et al., 2019). By contrast, we use a single model trained end-to-end to generate sequences of edit operations to transform the entire source sequence.
Lexical Complexity for TS Nishihara et al.
(2019) introduced a training loss for Controllable TS that weights words that frequently appear in the sentences of a specific grade-level. By contrast, we use lexical complexity information to define the initial sequence for refinement, which does not require any change to the model architecture nor to the training process. For Generic TS, Kajiwara (2019) used complex words as negative constrained for decoding with an autoregressive model. By contrast our approach provides more flexibility to the model which results in better outputs in practice.
Non-autoregressive Seq2Seq Models They have primarily been used to speed up Machine Translation by allowing parallel edit operations on the output sequence (Lee et al., 2018;Gu et al., 2018;Ghazvininejad et al., 2019;Stern et al., 2019;Chan et al., 2020;Xu and Carpuat, 2020). Refinement approaches have been used to incorporate terminology constraints in machine translation, including as hard (Susanto et al., 2020) and soft constraints (Xu and Carpuat, 2020). They have also shown promise for Automatic Post Editing (APE) (Gu et al., 2019;Wan et al., 2020) , and grammatical error correction (Awasthi et al., 2019). In this work, we show that they are a good fit to incorporate lexical complexity information for Controllable TS.

Conclusion
We introduced an approach that repurposes a nonautoregressive sequence-to-sequence model to incorporate lexical complexity signals in Controllable TS. An extensive empirical study showed that our approach generates simplified outputs that better match the desired target-grade complexity than AR models. Analysis revealed promising directions for future work, such as improving grammaticality while encouraging tighter control on complexity by better aligning the model's atomic edit operations with more complex simplification operations.

A Appendix
A.1 Dataset Statistics Table 8 provides the statistics of grade pair distribution in the Newsela-Grade dataset.

A.2 Implementation Details
We train all our models on two GeForce GTX 1080Ti GPUs. The average training time for a single seed of AR model is ∼8-9 hrs and for the EDITOR model is ∼20-22 hrs. Fine-tuning EDI-TOR takes additional 4-5 hrs.

B Human Annotation
Quality Control We set the location restriction to the United States to control for the quality of annotations. The correlation between the target grade levels and the simplicity ratings of the reference text is 0.582, which suggest that workers do rank simpler output higher than a relatively complex reference of the same source sentence.
Compensation We compensate the Amazon Mechanical Turk workers at a rate of $0.03 per HIT.
Instructions We provide the following instructions to the Amazon Mechanical Turk workers to evaluate generated simplified sentences.
Meaning You are given one sentence and 3 rewrites of the same sentence. Carefully read the instructions provided and then use the sliders to indicate the extent to which the meaning expressed in the original sentence is preserved in the rewrites (Agirre et al., 2016).
Score Category 4 they convey the same key idea 3 they convey the same key idea but differ in some unimportant details 2 they share some ideas but differ in important details 1 they convey different ideas on the same topic 0 Completely different from the first sentence Grammar You are given three sentences. Carefully read the instructions provided and then use the sliders to indicate the extent to which each of the sentence is grammatical (Heilman et al., 2014).
Score Category 4 Perfect: The sentence is native-sounding. 3 Comprehensible: The sentence may contain one or more minor grammatical errors 2 Somewhat Comprehensible: The sentence may contain one or more serious grammatical errors, 1 Incomprehensible: The sentence contains so many errors that it would be difficult to correct 0 Other/Incomplete This sentence is incomplete Simplicity You are given one sentence and 3 rewrites of the same sentence. Carefully read the instructions provided and use the sliders to indicate how simple is each of the rewrite as compared to the original sentence (0: not simplified at all, 10: most simplified). We provide the following examples for your reference.